Jung Soo  Chu

Jung Soo Chu

Scholar Title

MIT EECS Undergraduate Research and Innovation Scholar

Research Title

Automated Document Processing and Data Extraction with Computer Vision, OCR, and NLP

Cohort

2021–2022

Department

EECS

Research Areas
  • Artificial Intelligence and Machine Learning
Supervisor

Amar Gupta

Abstract

Optical character recognition (OCR), computer vision (CV), and natural language processing (NLP) have made signifi- cant leaps in the past two decades, and have transformed many industries, including automated document processing. It offers a tremendous advantage in processing time and consistency over human workers, sometimes even in accuracy. However, extracting information in a structured format from a pool of documents of widely varying formats is still a challenge that very few have been successful so far. In this paper, we present a solution to this problem in the setting of electronic parts specification documents. Our solution’s highlight is its two-module design consisting of a classification module and extraction module. The classification module identifies the format of the input document, and the extraction module then takes in that information and chooses the best algorithm for extracting information from it. This approach is an effective design that achieves a balanced compromise between the time it takes to build the algorithms and accuracy.

Quote

As a senior in 6-3 and 18, I have been the most interested in both the application and theoretical sides of AI/ML. This SuperUROP opportunity has offered me an invaluable experience in applying my AI/ML knowledge to tackle a real-world problem that affects the largest electronic part distributor in the world. The SuperUROP assignments and presentations/deliverables to the sponsor company gave me a full vision on and deep understanding of the project that would have been impossible otherwise. Through this project, I grew as a software engineer and AI/ML researcher, and I was motivated to MEng with AI/ML concentration and pursue this field in the industry afterwards.

Back to Scholars