Time has sped up tremendously. In December, prof. Jordan Peterson gave a great interview to Lord Conrad Blake where Peterson touched on many topics. Among them and on the example of ChatGPT, which had just appeared then, is the question of the “dark side of AI”.
Peterson warned about the following.
“Be prepared for things to come up on the AI front over the next year that will make your hair stand on end.
Now there is already an AI capable of creating its own picture of the world only on the basis of the analysis of a colossal corpus of texts. And this AI is already smarter than many of us. But in a year it will become incomparably more intelligent than most of us. For it will build its picture of the world from trillions of patterns, extracted not only from people’s texts, but also from the world itself (its visual and other images). The knowledge at the heart of its vision of the world will come not only from the linguistic statistics of the texts that describe this world (as ChatGPT currently has). But also from the statistics of the patterns of formation and dynamics of the interactions of objects in this world.
So keep your hats on, ladies and gentlemen. As Jonathan Pajo said, “giants will come to Earth again, and we may see it, if we live.”
Less than three months later, as the warning of prof. Peterson began to come true.
A group of artificial intelligence researchers from Google and the Technical University of Berlin presented the first step towards what Peterson was talking about:
PaLM-E is a 562 billion parameter multi-modal visual language model (VLM) that combines vision and language to control robots.
Given the command “bring me some rice chips from the kitchen drawer”, PaLM-E can generate an action plan for a mobile robotic platform with a mechanical arm (developed by Google Robotics) and execute the entire set of generated actions.
PaLM-E does this by analyzing data from the robot’s camera without requiring a pre-rendered representation of the scene. This eliminates the need for human pre-processing or annotation and allows the robot to work autonomously.
PaLM-E is a predictor of the next token. It is so named because it is based on Google’s Large Language Model (LLM) called “PaLM”, similar to the technology behind ChatGPT.
But Google has made PaLM “realistic” by adding sensory information and robotic control.
Since it is based on a language model, PaLM-E continuously collects observations such as images or sensor data and encodes them into a sequence of vectors of the same size as the language tokens. This allows the model to “understand” sensory information in the same way that it processes language.
The new model demonstrates interesting and unexpected abilities.
For example, it exhibits “positive transfer”, which means that it can transfer the knowledge and skills she has learned from one task to another, resulting in significantly higher performance compared to single-tasking robot models.
In addition, the model exhibits multi-modal reasoning chains (allowing the model to analyze a sequence of inputs that include both linguistic and visual information) and multi-image inference (using multiple images as input to make a conclusion or prediction), even though the model was trained only on single-image prompts.
Peterson was right. Keep your hats on, ladies and gentlemen. For the giants are already approaching.