Helena Merker

hmerker@mit.edu

Scholar Title

MIT EECS | Citadel Undergraduate Research and Innovation Scholar

Research Title

Robust OCR Pipelines for Automated Key-Value Pair Extraction and Document Classification

Cohort

2022–2023

Department

Electrical Engineering and Computer Science

Research Areas

Graphics and Vision

Supervisor

Amar Gupta

agupta@mit.edu

Abstract

Despite the existence of numerous pipelines to extract textual information, most current approaches are tailored to a specific document format. Since many large companies handle a myriad of document types, there is a need for a generalized solution to process image-based documents in batches. The proposed model builds an automated key-value pair extraction pipeline for electronic parts specification documents. The main goal is to extract information with high accuracy and versatility, especially since the specification documents come in various different formats and orientations. Overall, we propose two main modules, a classification module and an extraction module, to balance the number of individual parsing algorithms needed and the achieved accuracy measure. By employing a conservative number of extraction algorithms, in addition to machine learning classification approaches that utilize both visual and textual document features, we achieve a system that produces competitive accuracy in the absence of a high performance overhead.

Quote

I am participating in SuperUROP due to my project’s focus on the use and implementation of optical character recognition, as well text identification and extraction. Academically, this project aligns with my desire to learn about image processing techniques and to concentrate on research within the domains of document processing and analysis. Overall, I hope to further my programming abilities and become a better researcher and experimentalist.

Back to Scholars