New Universal Speech Model From Google Outperforms OpenAI Whisper In Tests

To accommodate 1,000 languages, Google researchers have just released an upgrade to their Universal Speech Model (USM). According to the researchers, this model outperforms OpenAI Whisper across the board for automated voice recognition.

Access to the USM API can be requested by researchers here.

According to the study, “Google USM: Scaling Automated Speech Recognition Beyond 100 Languages,” under-represented languages can be recognized using a huge, unlabeled multilingual dataset that was used to pre-train the model’s encoder and was then fine-tuned using a smaller amount of labelled data. Also, the training procedure effectively adapts new data and languages.

The pre-trained encoder’s efficiency was demonstrated by the researchers using YouTube Caption’s multilingual voice data for fine-tuning. Despite the minimal supervised data provided by YouTube, the model manages to achieve a record-low word error rate of around 30% across all 73 languages. In comparison to Whisper (large-v2), which was trained using more than 400k hours of labelled data for these 18 languages, the model has, on average, a 32.7% relative lower WER. Moreover, USM performs better than Whisper across the board for automatic speech recognition.

The 1,000 Languages Initiative was introduced last November with the goal of developing a machine learning model that will support the 1,000 most spoken languages in the world for greater inclusivity on a global scale. The main problem is figuring out how to support languages with few speakers or little available data because some of these languages are only spoken by less than twenty million people.

The USM is a collection of speech models with two billion parameters that were developed using a massive dataset of 12 million hours of audio and 28 billion text phrases from over 300 different languages. The models can automatically recognize speech in languages with little resources, such Amharic, Cebuano, Assamese, and Azerbaijani, to name a few. They are utilized on YouTube (for closed captions).

The upgraded version of the model employs the common encoder-decoder architecture. As an encoder, we employ the Conformer, also known as the convolution-augmented transformer. The Conformer block, which consists of attention, feed-forward, and convolutional modules, is a crucial component. It performs a sampling on the input before applying Conformer blocks and a projection layer to create the final embeddings.

The training of the model begins with unsupervised learning using speech recordings from hundreds of different languages. In order to do this, BEST-RQ is employed, which performs well on multilingual tasks when working with enormous amounts of unstructured audio data.

In the second optional phase, the researchers added more text data to the model using multi-objective supervised pre-training, which enhanced the model’s quality and language coverage. Whether or not text data is available will determine whether the second step is used, although USM performs best with this phase.

The model is refined on the downstream tasks in the final stage. With pre-training, it shows good performance with a minimal amount of task-related supervised data.

New Universal Speech Model From Google Outperforms OpenAI Whisper In Tests

Leave a Reply Cancel reply

Editors Corner

How can Artificial Intelligence tools be a blessing for recruiters?

Will Artificial Intelligence ever match human intelligence?

Artificial Intelligence: Features of peer-to-peer networking

What not to share or ask on Chatgpt?

How can Machine Learning help in detecting and eliminating poverty?

How can Artificial Intelligence help in treating Autism?

Speech Recognition and its Wonders in your corporate life

Most groundbreaking Artificial Intelligence-based gadgets to vouch for in 2023

Recommended News

Google: AI From All Perspectives

US And UK Doctors Think Pfizer Is Setting The Standard For AI And Machine Learning In Drug Discovery

An Agreement Is Signed By MEA, MeitY, And CSC To Offer E-Migration Services Via Shared Service Centers

PR Handbook For AI Startups: How To Avoid Traps And Succeed In A Crowded Field

Related Posts

Google: AI From All Perspectives

US And UK Doctors Think Pfizer Is Setting The Standard For AI And Machine Learning In Drug Discovery

PR Handbook For AI Startups: How To Avoid Traps And Succeed In A Crowded Field

OpenAI Creates An AI Safety Committee Following Significant Departures

India Teases An AI Project To "Catalyze The Next Generation Of The Internet"

Recent Posts

Google: AI From All Perspectives

US And UK Doctors Think Pfizer Is Setting The Standard For AI And Machine Learning In Drug Discovery

An Agreement Is Signed By MEA, MeitY, And CSC To Offer E-Migration Services Via Shared Service Centers

PR Handbook For AI Startups: How To Avoid Traps And Succeed In A Crowded Field

OpenAI Creates An AI Safety Committee Following Significant Departures

Tags

Follow us

Welcome Back!

Retrieve your password

Add New Playlist

Join Our Newsletter