Isha Agarwal

agarwali@mit.edu

Scholar Title

MIT EECS | Undergraduate Research and Innovation Scholar

Research Title

Measuring the Generalizability of Model Editing Techniques for Debiasing Language Models

Cohort

2025–2026

Department

Electrical Engineering and Computer Science

Research Areas

AI and Machine Learning

Supervisor

Marzyeh Ghassemi

mghassem@mit.edu

Abstract

As language models are becoming widely adopted, it is important to ensure that they do not perpetuate harmful stereotypes. This project investigates model editing—a technique for modifying a model’s knowledge of a specific fact without affecting its broader knowledge—as a potential debiasing tool. We examine how debiasing model edits alter the underlying features of a target group. Specifically, we decompose the features that distinguish a target group (e.g. Asian women) from its corresponding broader concept (e.g. people) in the model’s embedding space and track how these features shift pre- and post-edit. Our findings shed light on the robustness of model editing as a cost-effective debiasing strategy and provide insight into how edits transform internal representations.

Quote

By participating in SuperUROP, I hope to gain more hands-on experience completing an end-to-end research project. I’m particularly excited to apply my prior research experience in mechanistic interpretability from UROPs and summer programs to debiasing research. I look forward to learning more about interpretability and bias in AI while making a meaningful contribution in this space.

Back to Scholars