You want to play a song? You ask Alexa to play it for you. You wish to call a friend. You ask Siri and it dials the call. You come across a new word while reading. You put it up for search and get all the related synonyms and antonyms.
This is Natural Language Processing. Alexa, Siri, email and text autocorrect, and chatbots, all use Natural Language Processing (NLP) to process and respond to human language, spoken as well as written.
With AI becoming an indispensable part of our day-to-day lives, NLP and Natural Language Understanding (NLU) are also growing by leaps and bounds. However, the incredibly complex, inconsistent, and fluid human language still creates several challenges that NLP needs to resolve.
Exclusion
Due to the situatedness of language, every data set carries a demographic bias. And overfitting to this bias can severely affect the applicability of findings. In Natural Language Processing, overfitting is due to the model’s assumption that all languages are identical to the training sample.
This leads to exclusion or demographic misinterpretation. For instance, the use of standard language technology may be easier for white males rather than women or Asian men (as they were included while developing it).
This hidden ambiguity reinforces already existing demographical differences and makes the technology less user-friendly for the excluded groups. In the long run, it can also lead to exclusion of these groups.
The fix – Measures to address imbalanced or overfitting data can be used to correct demographic bias in data.
Overgeneralization
While exclusion is a side-effect of data, overgeneralization is that of modeling. Let’s consider automatic inference of user attributes – an interesting NLP task. Though it holds promising solutions and practical benefits, wrong assumptions can prove counterproductive.
Let’s say you receive an email congratulating you for your anniversary on your 28th birthday. You would be amused. However, if similar confusion happens with your religious views or sexual orientation, you might not take it lightly.
The fix – To address overgeneralization, we can use dummy variables rather than the ‘tertium non-datur’ approach for classification.
Overexposure
Unlike exclusion and overgeneralization which can be addressed algorithmically, overexposure stems from research design. The effect of overexposure usually appears when mainstream attention on a particular topic increases. Such topic overexposure leads to the availability heuristic.
It is a psychological effect where people consider those events (or groups/ individuals, characteristics, etc.) important that they recall more often. These heuristics become ethically charged when negative emotions are more strongly associated with certain groups. For instance, if research repeatedly finds a particular section as brutal or oppressive, or abnormal, it would create differences.
The fix – To avoid overexposure, it should be ensured that the bias is trained ethically, without any generalization or exclusions.
Underexposure
Natural Language Processing focuses more on Indo-European data sources than small languages from other language groups. This focus creates an imbalance in the amounts of labeled data available, as most of it covers only a small set of languages.
Mostly, Natural Language Processing is geared toward English and this has created an underexposure to typological variety – both syntax and morphology.
The fix – Though there are several approaches to developing multilingual NLP tools, they are more commercial-incentive than English. Hence, researchers are less likely to work on them.
The conclusion
Even if we address the above-listed concerns, NLP can still be some negative consequences on people’s lives. What should be remembered is that it still offers limitless benefits to a business. And with new technologies cropping up every other day, these glitches will be broken through in the coming years.
However, this does not undermine the importance and caution and correction to remedy these issues.
Source: indiaai.gov.in