To accommodate 1,000 languages, Google researchers have just released an upgrade to their Universal Speech Model (USM). According to the researchers, this model outperforms OpenAI Whisper across the board for automated voice recognition.
Access to the USM API can be requested by researchers here.
According to the study, “Google USM: Scaling Automated Speech Recognition Beyond 100 Languages,” under-represented languages can be recognized using a huge, unlabeled multilingual dataset that was used to pre-train the model’s encoder and was then fine-tuned using a smaller amount of labelled data. Also, the training procedure effectively adapts new data and languages.
The pre-trained encoder’s efficiency was demonstrated by the researchers using YouTube Caption’s multilingual voice data for fine-tuning. Despite the minimal supervised data provided by YouTube, the model manages to achieve a record-low word error rate of around 30% across all 73 languages. In comparison to Whisper (large-v2), which was trained using more than 400k hours of labelled data for these 18 languages, the model has, on average, a 32.7% relative lower WER. Moreover, USM performs better than Whisper across the board for automatic speech recognition.
The 1,000 Languages Initiative was introduced last November with the goal of developing a machine learning model that will support the 1,000 most spoken languages in the world for greater inclusivity on a global scale. The main problem is figuring out how to support languages with few speakers or little available data because some of these languages are only spoken by less than twenty million people.
The USM is a collection of speech models with two billion parameters that were developed using a massive dataset of 12 million hours of audio and 28 billion text phrases from over 300 different languages. The models can automatically recognize speech in languages with little resources, such Amharic, Cebuano, Assamese, and Azerbaijani, to name a few. They are utilized on YouTube (for closed captions).
The upgraded version of the model employs the common encoder-decoder architecture. As an encoder, we employ the Conformer, also known as the convolution-augmented transformer. The Conformer block, which consists of attention, feed-forward, and convolutional modules, is a crucial component. It performs a sampling on the input before applying Conformer blocks and a projection layer to create the final embeddings.
The training of the model begins with unsupervised learning using speech recordings from hundreds of different languages. In order to do this, BEST-RQ is employed, which performs well on multilingual tasks when working with enormous amounts of unstructured audio data.
In the second optional phase, the researchers added more text data to the model using multi-objective supervised pre-training, which enhanced the model’s quality and language coverage. Whether or not text data is available will determine whether the second step is used, although USM performs best with this phase.
The model is refined on the downstream tasks in the final stage. With pre-training, it shows good performance with a minimal amount of task-related supervised data.