Helena Merker
MIT EECS | Citadel
Robust OCR Pipelines for Automated Key-Value Pair Extraction and Document Classification
2022–2023
Electrical Engineering and Computer Science
- Graphics and Vision
Amar Gupta
Despite the existence of numerous pipelines to extract textual information, most current approaches are tailored to a specific document format. Since many large companies handle a myriad of document types, there is a need for a generalized solution to process image-based documents in batches. The proposed model builds an automated key-value pair extraction pipeline for electronic parts specification documents. The main goal is to extract information with high accuracy and versatility, especially since the specification documents come in various different formats and orientations. Overall, we propose two main modules, a classification module and an extraction module, to balance the number of individual parsing algorithms needed and the achieved accuracy measure. By employing a conservative number of extraction algorithms, in addition to machine learning classification approaches that utilize both visual and textual document features, we achieve a system that produces competitive accuracy in the absence of a high performance overhead.
I am participating in SuperUROP due to my project’s focus on the use and implementation of optical character recognition, as well text identification and extraction. Academically, this project aligns with my desire to learn about image processing techniques and to concentrate on research within the domains of document processing and analysis. Overall, I hope to further my programming abilities and become a better researcher and experimentalist.