In the ongoing story of artificial intelligence, 2024 will be another turning point, similar to ChatGPT’s launch. However, this new chapter is imagined via the prism of visual thinking rather than written in words. This change in AI from language ability to visual acuity portends a revolutionary era in which computers’ integration of vision and cognition promises to fundamentally reshape our conception of intelligence.
From Bits To Gigabits: From Language To Vision
In 2023, AI’s emphasis on language reasoning—exemplified by the achievements of massive language models—has elevated people’s imaginations to unprecedented levels. But relying only on linguistic input naturally restricts AI’s understanding of the human condition and, more significantly, the foundation of human knowledge.
Despite its great expressiveness, language only makes up a small portion of our cognitive range. The real world is experienced, sensed, and interacted with; it is not merely described. We need to go beyond text and speech if we want to fully utilise the potential of AI. The information about the inputs’ location in space and time must be assigned, above anything else.
Visual World Models in Mind
Now let’s move on to the next generation of foundation models: visual world models. These technologies do more than merely produce text or images on their own. They mimic human skills to interpret all sensory inputs by modelling and extracting trillions of intricate patterns from visual data collected across time and space.
Unbeknownst to humans today, this next generation of AI will be able to autonomously analyse and draw conclusions from billions of photos on social media or uncover patterns in satellite data. These models also hold the promise of revolutionising machine learning and human-machine interaction, as well as assisting humans in the discovery of novel and basic laws of nature.
Novel Information Finding
AI’s ability to integrate visual data expands beyond its current capabilities and opens up new areas of knowledge. Much of the new knowledge produced by humans is based on vision. While it may take a researcher months or even years to gain new insights, this is remarkably comparable to how a computer can do the same task much faster and with much more effective and efficient data interrogation.
The Undiscovered Data Treasure Mine
This breakthrough depends on AI’s capacity to access the enormous repositories of “hidden data,” or visual data that has, up until now, mostly remained untapped. This encompasses visual information gathered from the real world. This data, which comes from a variety of sources like YouTube, governmental organisations, and insurance companies, is essential for pretraining the most potent models in existence. AI can sort through this data using novel techniques for inference and training, turning complex information into insights that can be put to use.
Developing Human Potential
Beyond just seeing machines, the ramifications of visually enhanced AI extend to improving human vision and cognitive abilities. AI can handle data analysis, freeing up human labour to concentrate on more creative, strategic, and moral aspects of problem-solving. This unparalleled level of creativity and exploration is made possible by the symbiotic relationship between human and machine intelligence.
Implementation Difficulties
As we’ve shown today, the majority of LLMs need significant fine-tuning on task-relevant corporate data in order to reach the entry-level associate competency level. Although this is unlikely to be the case for more recent models, global models will probably need to remain aligned in order to maximise their productive value to organisations and to generalise to previously unknown situations while preventing misuse.
This will require extensive backtesting as well as perhaps limiting use cases to verified and well-defined jobs. As we’ve seen with some of the earlier computer vision models for detection, most organisations are unlikely to be able to operate their own models efficiently, as is the case with most early technology. It’s also critical to remember that models are blind, just like people.
Limitations And Ethics
Approaching this new period with caution and iterating according to impact is essential. When AI is able to perceive and understand our reality, there will be enormous ethical and cultural ramifications. This path necessitates a cooperative strategy centred on exploration rather than perfection up front. Not only will AI researchers be involved in this, but also ethicists, legislators, and stakeholders.