How do explanation methods for machine learning models work?

February 15, 2022

Views 1.2k

Imagine a team comprising physicians who use a neural net to detect the presence of cancer in mammogram images. Even though the machine-learning model appears to be working well, it could be focusing more on features in images that are mistakenly correlated with cancers, such as a watermark, timestamp, or other similar features.

Researchers use feature-attribution methods to test these models. These techniques are meant to identify the parts of an image that are most relevant for the neural network’s prediction. What if the attribution method overlooks important features? Researchers don’t know what features are most important, so they can’t tell if their evaluation is not effective.

MIT researchers devised a method to modify the data to determine which features are important to the model. They then use the modified dataset to determine if feature-attribution methods are accurate in identifying these important features.

The researchers found that even the most popular methods can miss important features of an image and that some methods perform worse than random baselines. This could have serious implications for neural networks that are used in high-stakes cases such as medical diagnosis. Yilun Zhou is an electrical engineer and computer science graduate student at the Computer Science and Artificial Intelligence Laboratory.

All of these methods are widely used, particularly in high-stakes situations like detecting cancer using X-rays and CT scans. These feature-attribution methods may be incorrect. These feature-attribution methods may point out something that does not correspond to the real feature used by the model to make a prediction. We found this to be often the case. He says that feature-attribution methods can be used to prove that a model works correctly.

Zhou co-authored the paper along with Serena Booth, a fellow EECS graduate student, Marco Tulio Ribeiro (Microsoft Research researcher), and Julie Shah (senior author), who is an MIT professor in aeronautics and the director for the Interactive Robotics Group at CSAIL.

Focusing only on Features

Each pixel of an image can be used by the neural network to make predictions. There are literally millions of features that it could focus on in image classification. Researchers could create an algorithm that can help amateur photographers improve their skills. For example, they could train a model that can distinguish professional photos from casual tourists’ photos. This model could be used for comparing amateur photos to professional ones and providing feedback on how to improve. This model could be used by researchers to identify artistic elements in professional photos, such as composition, color space, and postprocessing. It just so happens that professionally shot photos often include a watermark of the photographer’s names, but few tourist photos do. The model could take this shortcut and find the watermark.

We don’t want to tell photographers that a watermark will suffice for success. Therefore, we want to ensure that the model’s artistic features are more important than the presence of a watermark. Zhou states that while it is tempting to use feature-attribution methods to analyze our model’s model, there is no guarantee they will work properly. The model could have artistic features, the watermark or other features.

“We don’t know what spurious correlations are in the dataset.” Booth says that there could be many things that are completely invisible to the naked eye, such as the resolution of an object. A neural network is capable of identifying these features and using them to classify, even if they are not perceptible. This is the root problem. It is not only difficult to understand our data, but also impossible to fully understand them.

Researchers modified the dataset to reduce all correlations between the original image, data labels, and ensure that no original features are lost.

They then add a new feature that is so obvious that the neural network must focus on it in order to make its prediction. For example, bright rectangles with different colors for different image types.

We can confidently say that any model that achieves high levels of confidence must focus on the colored rectangle we have put in. Zhou states that this will allow us to see if feature-attribution methods rushes to emphasize that area over all else.

“Especially alarming” results

This technique was used to apply a variety of feature-attribution methods. These methods create what is called a saliencymap, which is a map that shows the concentration of important features across an entire image. The saliency map may show that 80 percent the most important features are located around birds’ beaks, for example, when the neural network classifies images of birds.

After removing all correlations from the image data, the researchers modified the photos in a variety of ways such as blurring certain parts of the photo, increasing the brightness or adding a watermark. Nearly 100 percent of the most important features should be found around the area that the researchers have manipulated, if the feature-attribution process is correct.

These results were not encouraging. None of the feature-attribution techniques reached the 100 percent goal. Most only managed to reach a random baseline of 50 percent. Some even did worse than the baseline. The feature-attribution methods can sometimes miss the prediction even though the new feature was the only one that the model could use.

“All of the spurious correlations are not reliable and none of these methods seem to be very reliable,” Zhou says. This is particularly alarming, as we don’t know which spurious correlations may apply to natural datasets,” Zhou states. It could be many factors. Although we thought these methods could be trusted to tell us the truth, our experiment shows that it is very difficult to trust them.

They found that all feature-attribution methods were more effective at detecting anomalies than identifying absences. These methods were able to detect a watermark faster than they could recognize that an image doesn’t contain one. In this instance, humans would have a harder time trusting a model that makes a negative prediction.

This team has shown that feature-attribution methods should be tested before being applied to real-world models, particularly in high-stakes scenarios.

Shah explains that researchers and practitioners might use explanation techniques such as feature-attribution methods to instill trust in a model. However, this trust cannot be established unless the explanation technique has been thoroughly evaluated.” While an explanation technique can be used to calibrate someone’s trust in a modeling model, it is equally important that a person’s trust is calibrated in the explanations of the model.”

The researchers plan to continue using their evaluation process to examine more subtle and realistic features that could cause spurious correlations. They also plan to study how humans can better understand saliency maps in order to make better decisions based upon predictions from a neural network.

Marked Mindz

How do explanation methods for machine learning models work?