Kosmos-1, a multimodal large language model (MLLM) from Microsoft, can handle both linguistic and visual data. Kosmos-1 can be used for a variety of tasks, including as image captioning, answering visual questions, and more.
The GPT model and the ability to transform a text prompt or input into an output are two LLMs that ChatGPT has advocated.
In a paper titled “Language Is Not All You Need: Matching Perception with Language Models,” Microsoft’s AI researchers assert that while users are impressed by these conversational abilities, LLMs still have trouble handling multimodal inputs like visual and audio recommendations. The study suggests that in order to advance from ChatGPT-like abilities to artificial general intelligence, multimodal perception, or knowledge acquisition and “grounding” in the real world, is required (AGI).
By using LLMs, the robotics companies Everyday Robots, owned by Alphabet, and Google’s Brain Team last year showed the value of grounding in getting robots to follow human descriptions of physical tasks. The plan involved grounding the language model in doable tasks in a specific real-world environment. Similar to how OpenAI’s GPT models were integrated with real-world feedback from Bing’s search ranking and search results, Microsoft used grounding in its Prometheus AI model.
According to Microsoft, the Kosmos-1 MLLM can see general modalities, follow directions (zero-shot learning), and learn context (few-shot learning). The purpose of the study is to “correlate perception with LLMs so that the models can see and speak,” according to the report.
Each illustration demonstrates how MLLMs like Kosmos-1 could automate a task in various scenarios. They may, for instance, explain to a Windows 10 user how to restart their computer (or carry out any other activity with a visual prompt), read a web page to launch a web search, comprehend health data from a gadget, caption photographs, and so forth. The model, however, cannot analyze videos.
Kosmos-1’s performance on the Raven IQ test was also tested by the researchers. “Significant performance difference between the current model and the average level of adults,” according to the findings. The model’s precision nevertheless implied that MLLMs might be able to “perceive abstract conceptual patterns in a nonverbal context” by coordinating perception with language models.
The study towards “web page question answering” looks intriguing given Microsoft’s desire to use Transformer-based language models to make Bing a more formidable rival to Google Search.
Conclusion
Kosmos-1, a Multimodal Large Language Model (MLLM) that can comprehend generic modalities, learn in context (i.e., few-shot), and obey instructions, is introduced by the researchers in this study (i.e., zero-shot). They specifically use large-scale multimodal corpora from the web, such as text and image combinations, image-caption pairs, and text data, to train Kosmos-1 from scratch.
Without making any adjustments to or updates to the gradients, researchers examine several settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a variety of tasks. According to experimental findings, Kosmos-1 is quite effective with
I language production, comprehension, and OCR-free NLP (directly fed with document images),
(ii) perception-language exercises such multimodal discussion, captioning of images, and answering visual questions, and
(iii) visual tasks like description-based picture identification (specifying classification via text instructions).
The researchers also demonstrate that cross-modal transfer—i.e., the transfer of knowledge from one modality to another—can be advantageous for MLLMs. Last but not least, they provide a dataset for the Raven IQ exam, which gauges how effectively MLLMs can reason abstractly.