Synthetic data for AI will be one of the top 10 breakthrough technologies in 2022, as per the MIT Technology Review. Today data scientists are successfully getting the same results from synthetic datasets as much as the case with real ones. Take, for instance, Amazon’s AI team working on Alexa uses synthetic data to train its natural language understanding (NLU) system, to come out with versions of Alexa in three new languages: Hindi, Brazilian Portuguese and US Spanish. This laid the foundation for AI researchers to train their models with new languages without having large customer-interaction data. Some of the other use cases:
- Waymo, a Google company uses synthetic data to train its autonomous vehicles
- Amazon deploys synthetic images to train Amazon Go – store with advanced shopping technologies (including a cashier-less grocery store) – vision recognition systems.
- American Express uses synthetic financial data to improve its fraud detection
Seems fascinating, isn’t it? But what exactly is synthetic data?
To be precise, synthetic data is an alternative to real-world data, it is annotated information created with the help of algorithms or computer simulations.
Why is synthetic data gaining traction?
Democratising access to data: Today, machine learning models require a humongous amount of data for training. For instance, a typical image classification problem may require more than tens of thousands of images in order to create a classifier. These data, however, are not always easy to obtain. Moreover, datasets can cost millions of dollars to create if usable data exists in the first place.
“Data is the new oil in today’s age of AI, but only a lucky few are sitting on a gusher. So, many are making their own fuel, one that’s both inexpensive and effective. It’s called synthetic data,” said Gerard Andrews, Sr. Product Marketing Manager, NVIDIA in a blog.
Potential to weed out biases: Even the best datasets frequently have biases that negatively affect a model’s performance. Take a case where researchers at Data Science Nigeria urged their engineers to choose from accessible data sets featuring Western apparel for training computer-vision algorithms, but to their surprise, there were none for African clothing. The scientists countered the disparity by creating synthetic images of African fashion using AI and generating a completely new data set from scratch.
Guards against privacy concerns: From edge computing, machine learning, AI, web3.0 to even the metaverse – the new age proliferating technologies relies on large datasets to function effectively. Parallel to this trend, there is a growing awareness among its users to protect their personal data and prevent firms from using data for rogue use.
Also, multiple regulations are in place including GDPR in Europe, PDP bill in the Indian parliament, etc., to prevent businesses from utilising personal data without explicit consent, which has resulted in some of the largest tech firms running afoul of the law. Now, synthetic data has its own advantages – it can be gathered and shared without even worrying about complying with privacy laws.
Provides scalability: Only a handful of data scientists are in possession of the good quality and scale of datasets required to train their predictive models, while many don’t have at all. Synthetic data can bridge this gap. Many data scientists now use synthetic data to enhance their real-world records, swiftly scaling up existing data – or only the necessary subsets of this data – to generate more insightful observations and patterns.
Way to go forward
With the above-mentioned benefits, synthetic data do hold the potential to better ML models training, but a cautious approach is required to realise the true potential of the AI domain. It is quite important to note that synthetic data can be created with the help of limited original data. What if the original data lacks diversity in itself? The question needs to be answered every time before putting synthetic datasets to train future AI models.
Source: https://indiaai.gov.in/article/training-ai-with-synthetic-data-is-odds-on-favourite-here-s-why