MIT Tang Family FinTech Undergraduate Research and Innovation Scholar
Investigating Internal Signatures of Neural Network Failure Modes
- Artificial Intelligence and Machine Learning
Nir N. Shavit
Neural Nets are often easy to trick, whether through restricted perturbations of the input (Lp Adversarial Examples), or through prompt injections (LLM jailbreaks). These failures are often studied while treating models as a black boxes; this has yielded some results, but we remain far from using these insights to build robust models. Instead, we study Neural Network failures by investigating model internal computation. By using mechanistic interpretability techniques (e.g path patching, dictionary learning), and examining model internal statistics, we aim to produce theories of Neural Network failures that are grounded in phenomena that can be observed in the internal computations of DNNS. We hope that such theories can inspire novel approaches for building robust models.
Through this SuperUROP, I want to gain experience understanding Neural Networks failures by opening up model black boxes and studying their internals. I am looking forward to applying both my prior experience researching Adversarial Examples in CNNs, and my math background to this project. I am most excited about treating studying Deep Learning as a systematic science, and figuring out the techniques best suited for understanding these models.