Jung Soo  Chu

Jung Soo Chu

Scholar Title

MIT EECS Undergraduate Research and Innovation Scholar

Research Title

Automated Document Processing and Data Extraction with Computer Vision, OCR, and NLP





Research Areas
  • Artificial Intelligence and Machine Learning

Amar Gupta


Optical character recognition (OCR), computer vision (CV), and natural language processing (NLP) have made signifi- cant leaps in the past two decades, and have transformed many industries, including automated document processing. It offers a tremendous advantage in processing time and consistency over human workers, sometimes even in accuracy. However, extracting information in a structured format from a pool of documents of widely varying formats is still a challenge that very few have been successful so far. In this paper, we present a solution to this problem in the setting of electronic parts specification documents. Our solution’s highlight is its two-module design consisting of a classification module and extraction module. The classification module identifies the format of the input document, and the extraction module then takes in that information and chooses the best algorithm for extracting information from it. This approach is an effective design that achieves a balanced compromise between the time it takes to build the algorithms and accuracy.


As a senior in 6-3 and 18, I have been the most interested in both the application and theoretical sides of AI/ML. This SuperUROP opportunity has offered me an invaluable experience in applying my AI/ML knowledge to tackle a real-world problem that affects the largest electronic part distributor in the world. The SuperUROP assignments and presentations/deliverables to the sponsor company gave me a full vision on and deep understanding of the project that would have been impossible otherwise. Through this project, I grew as a software engineer and AI/ML researcher, and I was motivated to MEng with AI/ML concentration and pursue this field in the industry afterwards.

