Parbhakar Kafle
MIT EECS | Himawan Undergraduate Research and Innovation Scholar
Extracting Information from Financial Documents
2021–2022
EECS
- Natural Language and Speech Processing
Amar Gupta
Storing information in databases provides better organization and operation capabilities. At the same time, paper documents are easier and simple to use. This dichotomy has given rise to growing need to process scanned images and PDFs to extract and information in digital form. Our project will focus on processing and extracting information from financial documents. We plan to work on two major ideas. First, we aim to use Deep Learning to build models to extract structured information from documents with the help of an Optical Character Recognition (OCR) engine. Second, we plan to build a validation framework for the extracted data. This would include things like verifying that a field labeled email is a valid email address, or a ZIP code is valid.
I joined Dr. Gupta’s research group this summer to work on financial documents processing. It has been fun to work on the project so far. It has helped me grow technically, professionally, and as a team member. We still have ideas that we couldn’t work on over the summer and I am really excited to continue the project and work on those. I believe SuperUROP will give a more immersive and involved experience over the fall and spring semesters.