As artificial intelligence models become increasingly prevalent and integrated into various sectors such as healthcare, finance, education, transportation, entertainment, etc., it is important to understand how they work under the hood. Interpreting the underlying mechanisms of AI models allows us to audit them for safety and bias, and deepens our understanding of the science behind the intelligence itself.
What if we could directly interrogate the human brain by manipulating its individual neurons to examine their role in perceiving specific objects? Such experiments would be highly invasive in the human brain, but more feasible in another kind of neural network: an artificial one. However, while somewhat similar to the human brain, artificial models containing millions of neurons are too large and complex to be interrogated manually, and achieving large-scale interpretability is a monumental task.
To address this, researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) decided to take an automated approach to interpreting artificial vision models that evaluate different characteristics of images. They developed “MAIA (Multimodal Automated Interpretability Agent),” a system that automates a variety of neural network interpretation tasks using a visual language model backbone with tools for experimenting with other AI systems.
“Our goal is to equip AI researchers to run interpretability experiments autonomously. Existing automated interpretability methods simply label or visualize data in a one-off process. MAIA, on the other hand, lets you develop a hypothesis, design experiments to test it, and develop understanding through iterative analysis,” says Tamar Rott Shaham, a postdoctoral researcher in MIT Electrical Engineering and Computer Science (EECS) at CSAIL and co-author of a new paper on the research. “By combining pre-trained visual language models with a library of interpretability tools, our multimodal method can answer user queries by creating and running experiments targeted at specific models, continually improving our approach until it can provide a comprehensive answer.”
The automated agent has been proven capable of three main tasks: labeling individual components in visual models to explain the visual concepts that activate them; cleaning up image classifiers by removing irrelevant features to make them more robust to new situations; and hunting down hidden biases in AI systems to uncover potential fairness issues in the output. “But the main advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann PhD ’21, a research scientist at CSAIL and co-leader of the study. “While we’ve demonstrated MAIA’s usefulness for a few specific tasks, because the system is built from a foundational model with a wide range of inference capabilities, it can answer many different kinds of interpretability questions from users and instantly design experiments to explore them.”
For each neuron
In one example task, a human user asks MAIA to explain a concept that a particular neuron in a visual model is responsible for detecting. To explore this question, MAIA first uses the tool to retrieve “example datasets” from the ImageNet dataset that maximally activate the neuron. For this example neuron, those images show people in formal attire and close-ups of their chins and necks. MAIA makes different hypotheses about what drives the neuron’s activity: facial expressions, chins, ties, and more. MAIA then uses the tool to design experiments to test each hypothesis individually by generating and editing synthetic images. In one experiment, adding a bow tie to an image of a human face increased the neuron’s response. “This approach allows us to identify the specific cause of a neuron’s activity, just like in a real science experiment,” says Rott Shaham.
MAIA’s explanations of neuronal behavior are evaluated in two main ways. First, they use synthetic systems with known ground truth behavior to evaluate the accuracy of MAIA’s interpretations. Second, for “real” neurons in a trained AI system with no ground truth explanations, the authors design a novel automated evaluation protocol to measure how well MAIA’s explanations predict neuronal behavior on unknown data.
The CSAIL-driven method outperformed baseline methods for describing individual neurons in a variety of vision models, including ResNet, CLIP, and the vision transformer DINO. MAIA also performed well on a new dataset of synthetic neurons with known ground truth descriptions. For both real and synthetic systems, the descriptions were often comparable to those written by human experts.
How can describing AI system components like individual neurons help? “Understanding and identifying behavior within large-scale AI systems is a critical part of auditing the safety of these systems before they are deployed. In several experiments, we show how to use MAIA to find neurons with undesirable behavior and remove these behaviors from the model,” says Schwetman. “We are building a more resilient AI ecosystem, where the tools for understanding and monitoring AI systems keep up with the systems’ scaling, allowing us to explore and understand unexpected challenges posed by new models.”
A peek inside neural networks
The emerging field of interpretability is maturing into its own research area with the rise of “black box” machine learning models. How can researchers unravel these models and understand how they work?
Current methods for peering under the hood tend to be limited in terms of scale and explanatory precision. Moreover, existing methods tend to be tailored to specific models and specific tasks. This led researchers to ask: How can we build a general-purpose system that allows users to answer interpretability questions about AI models, while combining the flexibility of human experimentation with the scalability of automated techniques?
One of the key areas they wanted to address with this system was bias. To determine whether the image classifier showed bias towards certain subcategories of images, the team looked at the final layers of the classification stream (a system designed to sort or label items, similar to a machine identifying whether a picture is of a dog, cat, or bird) and the probability scores of input images (the confidence level the machine assigns to its guess). To understand potential bias in image classification, MAIA was asked to find a subset of images of a particular class (e.g., “Labrador Retriever”) that were likely to be mislabeled by the system. In this example, MAIA found that images of black Labradors were more likely to be misclassified, suggesting that the model was biased against retrievers with yellow fur.
Because MAIA relies on external tools to design experiments, its performance is limited by the quality of those tools. However, as the quality of tools, such as image synthesis models, improves, so does MAIA. MAIA sometimes exhibits confirmation bias, falsely confirming its original hypothesis. To mitigate this, the researchers built an image-to-text tool that summarizes the experiment results using another instance of a language model. Another failure mode is overfitting to a particular experiment, where the model may draw premature conclusions based on minimal evidence.
“We think a natural next step for our lab would be to go beyond artificial systems and apply similar experiments to human perception,” says Lot-Shaham. “To test this, traditionally you would have to design and test stimuli by hand, which is labor-intensive. With our agent, we can scale up this process to design and test many stimuli simultaneously, which might even allow us to compare human visual perception with artificial systems.”
“Neural networks are difficult for humans to understand because they consist of hundreds of thousands of neurons, each with complex patterns of behavior. MAIA bridges this gap by developing an AI agent that can automatically analyze these neurons and report the extracted results in a human-friendly way,” said Jacob Steinhardt, an assistant professor at the University of California, Berkeley, who was not involved in the research. “Scaling up these methods could be one of the most important ways to understand and safely monitor AI systems.”
In addition to Lot-Shaham and Schwetman, the paper also includes CSAIL undergraduate student Franklin Wang, MIT freshman Achuta Rajaram, EECS doctoral student Evan Hernandez SM ’22, and EECS professors Jacob Andreas and Antonio Torralba. Their work is supported in part by the MIT-IBM Watson AI Lab, Open Philanthropy, Hyundai Motor Company, the Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship. The researchers will present their findings this week at the International Conference on Machine Learning.