Using OCR to Digitize Documents
Services provided: Text Annotation
Published date: 24.04.2023
Read time: 3 min
Industry: IT Services & Consulting
The client is a global leader in digital transformation as well as cybersecurity, cloud and high-performance computing. Their purpose is to help design the future of the information space with expertise and services to support the development of knowledge, education and research.
We helped our client anonymize personal identifiable information like names, addresses, emails and other information (PII). We also needed to annotate entities in business documents to create training data for machine learning models.
The client had a significant number of physical vendor invoices, payment advice, purchase orders and other documents which made it difficult to store and track information. They were looking for ways of digitizing this information with the help of an AI solution, but all of the datasets needed to be annotated with bounding boxes and tagging. What further complicated the project was that the documents were in several languages (Spanish, Italian, Korean, Swedish, Finnish, Czech, Norwegian, Danish, Russian, Portuguese, French, Polish, Chinese, Japanese, Dutch, Turkish, Hungarian, Arabic, Croatian, Greek, Thai, Vietnamese, Slovakian, Latvian, Slovenian)
While the dataset included many languages, annotations in English, German, French and Spanish would be pre-annotated by an ML system. Therefore, the client needed qualified annotators to make sure the system correctly placed the bounding boxes. The rest of the languages mentioned above needed to be annotated manually. The client was looking for a trusted data annotation partner with extensive experience in OCR and had the capability to assemble a sizable team on short notice.
Why Mindy Support
Mindy Support was recommended to the client by one of our partners as an experienced and trustworthy data annotation provider. All of these skills and expertise allied to outperform client expectations and keep the project on track.
Solutions Delivered to the Client
Mindy Support started out by recruiting qualified candidates with proficiency in all of the languages requested by the client. Since we are one of the largest data annotation providers in Eastern Europe and we have proven recruitment processes in place, it was not challenging for us to source and recruit the needed candidates.
After we assembled the team, we began actualizing the project by placing bounding boxes around the needed PII and also labeled each piece of data as an “email”, “name”, “telephone number” etc. All of the data annotation work was transferred to a JSON file to make the data easier to use for our client. In addition to this, we provided QA for the data annotation performed by the automated system. We were highly meticulous in this process to ensure the highest level of quality.
All of this work was done inside the client’s proprietary tool. Our team members had extensive experience using this tool, there was not a need for additional training.
- 98%+ quality rate for manual data annotation
- 2 years of experience working in the project
- Documents were annotated in 16 languages
- Helped the client transition from manual data annotation to quality validation or pre-labeled data
GET A QUOTE FOR YOUR DATA LABELING PROJECT
We have a minimum threshold for starting any new project, which is 735 productive man-hours a month (equivalent to 5 graphic annotators working on the task monthly).