Whither AGI? - From The Dumpster Fire

I started watching this video of Fei-Fei Li because I saw a snippet of one of her interviews on Twitter.

It’s an interesting video. A couple of things that stood out to me:

AGI (whatever that means) is impossible with only text-based LLMs

I have a habit, as a computer-vision scientist, of looking to evolution and brain science for inspiration. When I search for the next north-star problem, I ask what evolution and brain development have already solved. Human language emerged relatively recently—on the order of hundreds of thousands of years—and humans are the only species with language as a full tool for communication, reasoning, and abstraction. Vision is different. The ability to understand, navigate, and act in a 3D world traces back roughly 540 million years to the first trilobites, and it sparked an evolutionary arms race in intelligence. That’s why solving spatial intelligence—understanding, generating, and reasoning about the 3D world—is a fundamental problem for AI. AGI won’t be complete without it. Getting there requires world models that go beyond flat pixels and beyond language to capture the true 3D structure of reality.

Why vision is harder than language

Language is fundamentally one-dimensional: syllables come in sequence, which is why sequence-to-sequence modeling is so classic. What people often miss is that language is purely generative. There is no “language” in nature—you can’t touch it or see it. It emerges from the human mind as a constructed signal.

The world, by contrast, is far more complex. It is fundamentally 3D—and if you add time, 4D. Visual perception is also a projection: eyes or cameras collapse 3D into 2D, which makes it a mathematically ill-posed problem. That’s why biological vision relies on multiple sensors.

Moreover, the world is not purely generative. You can generate virtual 3D worlds, but they must obey physics. At the same time, there is a real world that demands reconstruction and interaction. AI has to move fluidly between generation and reconstruction—across domains from gaming and the metaverse to robotics. All of this sits on the continuum of world modeling and spatial intelligence.

Unlike language, where abundant internet data exists, spatial intelligence lacks easily accessible datasets—it mostly lives in our heads. That makes the problem harder, but also more exciting. If it were easy, someone else would have solved it already. My career has always been about pursuing problems so difficult they border on delusional. Spatial intelligence is exactly that problem.

This reminds me of something Yann LeCun said:

My prediction is that autoregressive LLMs are doomed. A few years from now, nobody in their right mind will use them. They sometimes produce nonsense because of this autoregressive prediction. We’re missing something really big—a new concept of how to build AI systems.

We’re never going to get to human-level AI by just training large language models on bigger datasets. It’s not going to happen. We can’t even reproduce what a cat can do. A cat has an amazing understanding of the physical world, can plan complex actions, and has causal models of consequences. Humans are even more remarkable: a ten-year-old can load a dishwasher the first time they’re asked, and a teenager can learn to drive in 20 hours—because they have mental models of the world.

Meanwhile, autonomous driving companies have hundreds of thousands of hours of training data and still no Level 5 self-driving cars. AI systems can pass bar exams and prove theorems, but where is my domestic robot? The physical world is much more complicated than language. That’s the Moravec paradox: tasks hard for humans, like solving equations or playing chess, are easy for machines, while tasks easy for humans, like perception and motor control, remain incredibly hard.

We’re not going to get to human-level AI by just training on text.

→ Read Original Article