Evan Vogelbaum
MIT EECS | Quick Undergraduate Research and Innovation Scholar
Measuring Benchmark Value
2022–2023
Mathematics & Electrical Engineering and Computer Science
- Artificial Intelligence & Machine Learning
Aleksander Madry
Benchmarks are ubiquitous across machine learning tasks. Whether we are trying to make models for image classification or trying to make agents that can play the game of Go, novel methods are always evaluated and compared to existing methods using standard datasets such as MNIST or ImageNet so that they can demonstrate their comparative value. While these benchmarks are convenient–preventing us from having to evaluate a new method on a wide array of applications to demonstrate its utility–they can also be misleading as some benchmarks do not well represent the difficulty of real world tasks. For example, it is easy to get 97% accuracy on MNIST with an SVM, placing it on par with Neural Network methods. However it is extraordinarily challenging to achieve similar performance on the ImageNet dataset with an SVM, whereas Neural Network methods have excelled on that challenge in recent years. In other words, using the MNIST dataset to benchmark methods we are considering using on a task as challenging as ImageNet results in poor quality predictions of the relative performance of models.
This leads to a “meta”-question in the field of machine learning research–how do we know when a benchmark is not longer useful for comparing methods? Even more importantly: can we define quantitative criterion showing exactly how much better one benchmark is for a target task than another one? This project aims to develop exactly that criterion so that machine learning researchers can better understand what benchmarks to evaluate on in order to best evaluate the relative merits of novel methods.
I am doing SuperUROP to further my academic research experience and develop new skills to make me a better researcher. While I have had a lot of experience developing the theory and models required to move the SOTA forward in various tasks, I have not yet had the chance to look introspectively at models to understand *why* they perform the way they do. I am excited to further that understanding this year!