ChatGPT was introduced to the public just over two months ago by OpenAI, catapulting the AI-powered chatbot into the spotlight and prompting discussions about how the new language model will transform business, education, and other fields. Then, to show the public that their so-called “generative AI” (technology that can generate conversational text, graphics, and more) was also ready for widespread usage, Chinese internet giants Google and Baidu unveiled their chatbots. Amazon has introduced a new language model in an effort to outperform GPT3.5.
The recently released language model from Amazon is ready to crush GPT3.5. On the ScienceQA benchmark, this new language model now outperforms many people and GPT-3.5 by 16 percentage points (75.17%). A sizable collection of annotated answers to multimodal science questions make up the ScienceQA standard. There are over 21,000 multimodal multiple-choice questions (MCQs). Recent technical developments enable large language models (LLMs) to work efficiently on tasks demanding complex reasoning. The method of generating intermediate logical steps to show how to perform something is known as chain-of-thought (CoT) prompting. The majority of current CoT research, however, only looks at linguistic modality, and when looking for CoT reasoning in multimodality, researchers typically employ the Multimodal-CoT paradigm. Multimodality requires several inputs, such as words and pictures.
How does it work?
Multimodal-CoT separates problems with more than one step into intermediate thinking processes that lead to the final response, even if the inputs come from various modalities like linguistic and visual. One of the most common techniques for implementing Multimodal-CoT is to combine data from several modalities into a single modality before asking LLMs to perform CoT. The disadvantage of this strategy is that while converting data between formats, a sizable quantity of information is lost. By combining multiple linguistic and visual components, small language models that have been refined can execute CoT reasoning in multimodality. However, the main issue with this approach is that these language models have a propensity to produce hallucinatory reasoning patterns that significantly affect the answer inference.
Amazon researchers created Multimodal-CoT, which combines visual features in a different training framework, to lessen the effects of these inaccuracies. It is the first study of its sort to examine how CoT thinking differs from other forms of reasoning. Amazon researchers claim that the technique outperforms human performance and GPT-3.5 accuracy by 16 percentage points on the ScienceQA test, demonstrating state-of-the-art performance.