With significant adoption underway in all facets of life and business, the challenges and concerns around training AI with unbiased data, data scarcity, trust, explainability and privacy are becoming the top concerns for broader adoption.
The phenomenal impact that Artificial Intelligence (AI) is projected to have on our economy and our daily lives is nothing short of astounding. It is predicted that AI will significantly ($15.7 trillion) contribute to the world economy by 2030. While its prominence has magnified its adoption and use-cases, criticisms abound with its adoption resulting in job losses, unintended biases, privacy, surveillance concerns and even the energy-hogging data centres building the AI models. As with any new technology, its abuse versus its safe and productive use with the right sets of ethics and regulations rests on us.
With significant adoption underway in all facets of life and business, the challenges and concerns around training AI with unbiased data, data scarcity, trust, explainability and privacy are becoming the top concerns for broader adoption. Researchers and thought leaders worldwide are trying to solve them with several new frontiers emerging and being explored. We took a deeper dive to understand these challenges and summarise our learnings here.
The current AI Systems are inherently black-box models with limited explainability that create barriers to adoption, especially in regulated environments like healthcare. This is where Explainable AI (XAI) comes in, as captured in a comprehensive paper by DARPA experts. XAI tries to solve the black-box problem by providing explanations via two approaches; Post-hoc (local explanations) and Ante-hoc (interpretation by design) systems, and tries to turn the black-box into a glass box or at least a semi-glass box. Another method to achieve the glass box is called Interactive Machine Learning (IML) that involves human-in-the-loop, observing trends in the algorithmic loops and making decisions that ultimately help gain a better understanding of the model. Several XAI frameworks and tools are in development, and a plethora of research ongoing.
Artificial Intelligence research has significantly picked up in India, and our review of patents and research shows a solid research base here in ‘edge AI’ and ‘Federated Learning’. Large tech giants have released edge frameworks orthogonal to the well-entrenched cloud-based AI/ML. Federated learning involves a central server that collates information from many edge-generated models to create a global model without transferring local data for training. It has a hyper-personalised approach, is time-efficient, cost-effective and supposedly privacy friendly as user data is not sent to the cloud.
At the same time, to accelerate AI at the edge, a new generation of AI edge chips (neuromorphic and digital-analogue flavours) are upcoming to do much heavier duty training and inferencing at the edge, running orders of magnitude faster and at a significantly lower power footprint. The new release of Google Chrome has implemented Federated Learning of Cohorts (FLoC), a Google’s version of Federated Learning which is an initiative to eliminate the pervasive online trackers and cookies in a privacy and security-conscious world. FedVision is an open-source platform to support the development of edge powered computer vision applications as uploading videos is a big privacy concern. With over 700 active AI startups in India, we expect to see some quality initiatives here.
AutoML has seen significant progress to ensure that data scientists are not stuck in repetitive and time-consuming tasks starting from data cleaning, playing around with different models and hyper-parameters and eventually fine-tuning them for best results. AutoML uses an inherent reinforcement learning and recurrent neural network approach so that these models and parameters start with an initial input or auto-picked, but gets continuously and automatically refined based on results.
There are a wide variety of platforms in the market today, and we are at Gen 3 of AutoML evolution with more verticalised domain-specific platforms. Most platforms still select the model and the hyperparameters, which means that the data scientists still need to do the bulk of the work in data preparation and cleaning, where the majority of time is often spent. Other advanced platforms also include cleaning, encoding and feature extraction, a must to build a good model quickly, but the approach is template driven and may not always be a good fit.
AI practitioners have always been plagued with a paucity of data and hence the effort to generate acceptable models with reduced datasets or simply their quest to find more data. Finding more data include public annotated data (e.g. Google public dataset, AWS open data), data augmentation running transforms on available data and transfer learning where other similar but the larger dataset is used to train the models. Rapid progress continues on creation of artificial or synthetic data. Synthetic Minority Over-sampling Technique (SMOTE) and several of its modifications are used in classic cases where minority data is sparse and hence oversampled. Generating completely new data with self-learning (AlphaGo self-played 4.9 million times) and simulation (recreating city traffic scenarios using gaming engines) are more recent approaches to create synthetic data. Unfortunately, more data also amplifies the resource and time constraints to train, including the time and effort required to clean, remove noise, remove redundancies, outliers etc. The holy grail of AI training is Few-Shot Learning (FSL), that is, training with a smaller dataset. It is an area of active research, as highlighted in this recent survey paper.
A vast amount of open-source models, datasets, active collaboration and benchmarks continue to accelerate AI development. Open AI’s GPT-3 launch took NLP to another level with 175 billion parameters trained on 570 gigabytes of text. Huawei recently trained the Chinese version of GPT-3 with 1.1 terabytes of Chinese text. Alphabet subsidiary Deepmind’s AlphaFold had the most significant breakthrough in Biology with 92.4 percent accuracy in the well-known protein structure and folding prediction competition. Cityscapes has built a large-scale 50 cities dataset of diverse urban street scenes. Beyond image and language recognition, the next frontier of AI is intent understanding from video. While India rose in the AI Vibrancy index from rank 23 to 5 in 2021, a lot still needs to be done in terms of collaboration, open-source and India specific datasets.
With the growing need for security of sensitive and private information, there is a call for machine learning algorithms to be run on data that is protected by encryption. Homomorphic encryption (HE) is a concept that is now being leveraged to train models on data without decrypting it and risking data leaks. Intel is one of the players in this space that has collaborated with Microsoft to develop silicon for this purpose. With growing interest in research and development in this field, these HE methods will become more commonplace and advanced.
Removing toxicity and biases is the aim of Ethical AI or Responsible AI, but development is at nascent stages. Google and Accenture have announced Responsible AI frameworks. European Commission’s white paper on AI focuses on trust, and the UN AI ethics committee formation is an excellent initiative.
The evolution of AI is happening at a breakneck pace, and 2021 will be no different.
Source-cio.economictimes.indiatimes.com