Because machine learning models can make incorrect predictions, researchers often give their models the ability to tell users how confident they are in a particular decision. This is especially important in high-stakes situations, such as when a model is used to identify diseases in medical images or filter job applications.
But quantifying a model’s uncertainty is only useful if it is accurate: if a model says there is a 49 percent chance that pleural effusion will be seen in a medical image, then the model should be correct 49 percent of the time.
Researchers at MIT have introduced a new approach that can improve uncertainty estimation in machine learning models: not only does the technique produce more accurate uncertainty estimates than other techniques, but it does so more efficiently.
What’s more, the technique is scalable, making it applicable to the large-scale deep learning models being increasingly deployed in healthcare and other safety-critical situations.
This technique has the potential to provide end-users without machine learning expertise with better information that they can use to decide whether to trust a model’s predictions or whether they should deploy the model for a particular task.
“It’s clear that these models perform extremely well in some scenarios, and it’s easy to assume that they’ll perform equally well in other scenarios. That makes it particularly important to advance this kind of research to better calibrate the uncertainty in these models and make them more consistent with human notions of uncertainty,” says lead author Nathan Ng, a visiting student at MIT and a graduate student at the University of Toronto.
Ng co-authored the paper with Roger Gross, assistant professor in the University of Toronto’s Department of Computer Science, and lead author Marji Ghassemi, associate professor in the Department of Electrical Engineering and Computer Science and a member of the Biomedical Engineering Institute and the Information and Decision Systems Laboratory. The research will be presented at the International Conference on Machine Learning.
Quantifying Uncertainty
Uncertainty quantification methods often require complex statistical calculations that do not scale well to machine learning models with millions of parameters, and they require users to make assumptions about the model and the data used to train it.
The MIT researchers took a different approach. They use something called the minimum description length principle (MDL), which doesn’t require assumptions that can hinder the accuracy of other methods. MDL is used to better quantify and accommodate the uncertainty in the test points that a model is asked to label.
The technique the researchers developed, called IF-COMP, makes MDL fast enough to be used with large-scale deep learning models deployed in many real-world environments.
MDL considers all possible labels that the model could give to a test point, and if there are many alternative labels that fit this point well, the confidence in the selected label decreases accordingly.
“One way to understand how much trust a model has is to give it counterfactual information and see how likely it is to believe it,” Ng says.
For example, consider a model that is told that a medical image shows pleural effusion: if a researcher tells the model that the image shows edema, and the model is willing to update its belief, the model will become less confident in its original decision.
In MDL, when the model is confident in labeling a data point, it should use a very short code to describe that point. When the decision is uncertain because a point could be labeled with many other labels, it should use a longer code to capture these possibilities.
The amount of code used to label a data point is called probabilistic data complexity. If researchers ask a model how likely it is to update its beliefs about a data point in the presence of contrary evidence, the probabilistic data complexity should decrease if the model is confident.
However, testing each data point using MDL requires a significant amount of computation.
Speeding up the process
With IF-COMP, the researchers developed an approximation technique that can accurately estimate the complexity of stochastic data using a special function called an influence function. They also employed a statistical technique called temperature scaling that improves the calibration of the model’s output. The combination of the influence function and temperature scaling allows for a high-quality approximation of the complexity of stochastic data.
Ultimately, IF-COMP can efficiently generate a well-calibrated uncertainty quantification that reflects the true reliability of the model. The technique can also determine if the model has mislabeled certain data points and reveal which data points are outliers.
The researchers tested their system on these three tasks and found it to be faster and more accurate than other methods.
“Having confidence that models are properly tuned is crucial, and there is an increasing need to detect when certain predictions seem incorrect. Audit tools are increasingly necessary for machine learning problems, as we use large amounts of unvalidated data to create models that are then applied to human-facing problems,” Ghassemi says.
Because IF-COMP is model-agnostic, it can provide accurate uncertainty quantification for many types of machine learning models, allowing them to be deployed in a wider range of real-world environments, ultimately empowering more experts to make better decisions.
“People need to understand that these systems are highly fallible and can fabricate facts on the spot. The models may seem very confident, but there are a lot of things they are willing to believe if there is evidence to the contrary,” Ng said.
In the future, the researchers are interested in applying this approach to large-scale language models and exploring other potential use cases for the minimum description length principle.