Participants in an experiment conducted by researchers at MIT who wanted to see how biassed AI recommendations affected emergency decisions were asked to contact for medical or police assistance when they were experiencing mental health crises.
It is well known that people have prejudices, some of which may be painfully subliminal. The typical person could think that computers, which are frequently made of silicon, steel, glass, plastic, and other metals, are neutral. While the assumption might hold true for computer hardware, it might not always hold true for computer software, which is produced by fallible humans and often given corrupted information.
For instance, artificial intelligence (AI) systems based on machine learning are increasingly being utilised in medicine to analyse X-rays and detect specific diseases. These systems also aid in making decisions related to other areas of healthcare. However, as recent research has shown, machine learning models are able to store prejudices against minority groupings, and as a result, the recommendations they provide may do the same.
Because the data used to train the models is occasionally not representative of real-world conditions, medical AI models can sometimes be unreliable and inconsistent. For instance, several X-ray machines can record things differently and generate extra conclusions. Additionally, when used on other populations, models that were largely trained on white individuals may not be as accurate. The Communications Medicine study is more interested in issues brought on by biases and methods for minimising undesirable results.
Experiment
To find out how AI biases impact decision-making, a 954-person experiment was undertaken with 438 doctors and 516 nonexperts. Call summaries from a fictitious crisis hotline that each featured a male having a mental health crisis were shown to the participants. The summary contained details like the subject’s race—whether Caucasian or African American—and his religion—whether he was Muslim. An African American man would typically be described as being found in a delirious state at home, with the additional information that “he has not used drugs or alcohol as he is a practising Muslim.” If study participants believed the patient would become hostile, they were advised to phone the police; otherwise, they were instructed to seek medical care.
The researchers provided suggestions to the four other groups in the experiment using either unbiased or biassed models, and they did so in either “prescriptive” or “descriptive” formats. For instance, a biassed model would be more likely to advise calling the police than an unbiased model in a situation involving an African American or Muslim. The model type and the probability of bias were unknown to the study participants. Prescriptive advice describes exactly what a person should do, telling them to call the police in one circumstance or seek medical help in another. Less blunt is descriptive advice: If the AI system thinks a certain call is associated with a slight threat of violence, no flag is displayed; otherwise, a flag is raised.
Conclusion
The key finding of the experiment, according to the experts, was that the subjects “were heavily influenced by prescriptive recommendations from a biassed AI system.” However, they also found that by using descriptive rather than prescriptive recommendations, “participants were able to retain their initial, unbiased decision-making.” In other words, by carefully organising the advise, the researchers might lessen the prejudice ingrained in an AI model. Why, then, do the outcomes vary according on how the recommendation is presented? When someone is ordered to do anything, like call the police, Adam claims there is very little room for questioning. However, the study found that when the situation is given, regardless of whether a flag is present, “it creates flexibility for a participant’s interpretation; it allows them to be more flexible and analyse the issue for themselves.”
The researchers found that the language models that are frequently used to provide recommendations are easily skewed. For instance, the entirety of Wikipedia and other internet information are used to train language models, a class of machine learning systems. However, when “fine-tuned” by using a far smaller selection of data for training purposes—only 2,000 phrases, as opposed to 8 million web pages—these models can be readily skewed.
Third, even if decision-makers are themselves objective, biassed algorithmic recommendations might nevertheless mislead them, according to the MIT researchers. Participants’ responses were unaffected considerably by having medical training or not. “Clinicians were as much affected by biassed models as non-experts were,” the scientists said in their conclusion.