
Isha Agarwal
Measuring the Generalizability of Model Editing Techniques for Debiasing Language Models
2025–2026
Electrical Engineering and Computer Science
- AI and Machine Learning
Ghassemi, Marzyeh
As language models are becoming widely adopted, it is important to ensure that they do not perpetuate harmful stereotypes. This project investigates model editing—a technique for modifying a model’s knowledge of a specific fact without affecting its broader knowledge—as a potential debiasing tool. We examine how debiasing model edits alter the underlying features of a target group. Specifically, we decompose the features that distinguish a target group (e.g. Asian women) from its corresponding broader concept (e.g. people) in the model’s embedding space and track how these features shift pre- and post-edit. Our findings shed light on the robustness of model editing as a cost-effective debiasing strategy and provide insight into how edits transform internal representations.
By participating in SuperUROP, I hope to gain more hands-on experience completing an end-to-end research project. I’m particularly excited to apply my prior research experience in mechanistic interpretability from UROPs and summer programs to debiasing research. I look forward to learning more about interpretability and bias in AI while making a meaningful contribution in this space.