Daria Kryvosheieva

daria_k@mit.edu

Research Title

Data-Efficient Training of LLMs

Cohort

2025–2026

Department

Electrical Engineering and Computer Science

Research Areas

AI and Machine Learning

Supervisor

Kim, Yoon

yoonhkim@mit.edu

Abstract

Some AI models, such as image classification models, can achieve human-level performance after training on a human-comparable amount of data. However, large language models (LLMs) typically require several orders of magnitude more language data than a human being receives by the time they become fluent in their native language. This SuperUROP will focus on developing strategies for data-efficient training of LLMs. To this end, I will explore approaches including but not limited to linguistic biases, architectural modifications, data and training objective curricula, and reinforcement-based tuning. The practical impact of more data-efficient LLMs is a reduction in both the costs and the CO2 emissions induced by training.

Quote

I am participating in this SuperUROP because I want to make LLM training more accessible to organizations and individuals around the world. Throughout the project, I will draw on my prior experience with language models, which includes the MIT NLP class (6.8611), an LLM training internship at Jina AI, and past UROPs at EvLab and CPL. I am excited to better understand how training data shapes the performance and behavior of LLMs.

Back to Scholars