- Predicting the next token is understanding:
Large Language Models (LLMs) like GPT-4 don't just learn languages; they learn about the world. By predicting the next token in a vast array of texts, these models gradually build a 'world model'.
This means they're not only understanding language structure but also grasping the complex web of human knowledge, behavior, and societal norms. Essentially, they learn how the world works, one word at a time.
Now, imagine you're deeply immersed in a detective novel, rich with clues, complex characters, and twisted plots. The story builds to a climax where the detective declares, "Now I'm going to reveal the name of the murderer, and it is ___"
If an AI model can correctly predict the next word that fills this blank, it's showing an understanding that extends far beyond language. To accurately complete this sentence, the AI must understand the entire novel - every plot twist, character arc, and subtle hint.
This analogy vividly demonstrates how predicting the next word in a sequence requires and reflects a profound understanding of the context.
- Enhancing AI's Perception with Visual Data:
Over 50% of our brain's cortex is dedicated to visual processing, highlighting the importance of visual information in human understanding. Similarly, when AI incorporates visual data, it undergoes a transformative shift in comprehension.
A compelling example is how LLMs, despite not having 'seen' a single photon, gradually develop an understanding of colors. This information is indirectly 'leaked' into the AI's learning through the vast textual data it processes.
This process mirrors how human understanding of concepts like color can be shaped through descriptions, even without direct visual experience.
But by integrating visual data, AI's learning can be dramatically accelerated, much like adding a powerful new sense to its existing capabilities.
- Videos: The Next Frontier:
Incorporating videos into AI's learning process adds the dimension of time and motion. Videos help AI understand how objects and entities interact and change over time, following the physical laws of our universe.
It's the difference between a static picture of a bird and a video showing the fluid motion of its flight. By learning from videos, AI not only recognizes but understands the dynamics of the world around us, completing its transition from static observer to dynamic participant.
In this way, AI's progression from text to imagery, and eventually to videos, reflects a deepening and broadening of its understanding, paralleling the multi-faceted way humans perceive and interact with the world around us.