Daria Kryvosheieva
Data-Efficient Training of LLMs
2025–2026
Electrical Engineering and Computer Science
- AI and Machine Learning
Kim, Yoon
Some AI models, such as image classification models, can achieve human-level performance after training on a human-comparable amount of data. However, large language models (LLMs) typically require several orders of magnitude more language data than a human being receives by the time they become fluent in their native language. This SuperUROP will focus on developing strategies for data-efficient training of LLMs. To this end, I will explore approaches including but not limited to linguistic biases, architectural modifications, data and training objective curricula, and reinforcement-based tuning. The practical impact of more data-efficient LLMs is a reduction in both the costs and the CO2 emissions induced by training.
I am participating in this SuperUROP because I want to make LLM training more accessible to organizations and individuals around the world. Throughout the project, I will draw on my prior experience with language models, which includes the MIT NLP class (6.8611), an LLM training internship at Jina AI, and past UROPs at EvLab and CPL. I am excited to better understand how training data shapes the performance and behavior of LLMs.
