Skip to main content

When we read or engage in dialog, we don’t just memorize and index words; we develop “mental models.” These are rich structures that our minds use to model, or represent, how we think the world works. We build mental images of the entities and agents, their properties, connections, and relationships, and the causal mechanisms that allow us to reason, predict and make sense of what we hear, read or experience.

Our mental models are connected to language and can be communicated through language, but they are ultimately more fundamental to our understanding than the language we use to communicate them. We can describe the same mental model in many ways and even in many languages. The mental model is the primary representation — the model that in effect causes the language to occur and provides the basis for it to be understood.

It shouldn’t be surprising, then, that computers, which focus strictly on words and their occurrences, struggle with mental models. When we read or listen, we humans draw from our mental models to fill in so much that the language misses.

Let’s take a closer look at what’s in our own mental models, even for a simple kindergarten story:

  • Enzo and Mia were running a race. Enzo fell. He hurt his knee.• Mia looked back. She was almost at the finish line. She wanted to win. If she kept running, she would win.• Enzo was her friend. Mia stopped. She ran back to Enzo. She helped him up.[adapted from ReadWorks.org]On the surface, this language seems very simple. But if you reflect on your own mental model, you’ll start to see just how deep it goes.

We humans pack a pretty sophisticated model of the world into our heads — one that captures all manner of mechanisms, relationships, and mental sketches.

Start with the spatial model. How is the scene set when Mia considers whether to keep running? You can see it in your mind’s eye: Enzo is on the ground, maybe holding his knee. Mia is ahead, between Enzo and the finish line, her head pivoted around to look at him. Even for the details that aren’t entirely clear, you still have a general sense. For instance, the whole scene is probably outside, maybe on a track or in a field. It might be indoors in a gym, but probably not in a kitchen. You filled in all this information yourself, using clues from the text but also a powerful and detailed understanding of the world.

And that’s just the spatial configuration at one moment in the story; there’s so much more you can infer with your mental model! You understand how Enzo and Mia move around over time. You realize why Mia ran back: to help Enzo up, she had to be next to him. You know the reason she wanted to help him up: he was her friend, and people generally want to help their friends. She also wanted to win, because people generally want to win competitions. She would win only if she got to the finish line first, because that’s how races work. The fact that she chose to help Enzo instead of winning suggests that she’s a kind person. Whew!

All this is not in the words. If you’re thinking it’s still kind of obvious, you’re right. But it’s only obvious to humans. These inferences spring from the shared mental models we have built up over time. Without them as a foundation for answering questions, it is hard to imagine that the computer is indeed thinking in ways we can understand and trust.

Sure enough, asking questions that probe these mental models is a surefire way to stump some of today’s AI. For example, below are a few multiple-choice questions about Enzo and Mia’s race, showing the answers given by another state-of-the-art system called XLNet. We ran a version of XLNet that was trained on stories for English language learners in middle and high school, so it should have no trouble with kindergarten.

And yet:

These examples are not outliers. We asked XLNet over 200 similar questions probing its understanding of time, space, causality, and motivation in the story. The system claims to get over 80% of reading comprehension questions correct. But on our mental model-based questions, it gets more like 20% — and without the ability to dig into the model’s reasoning, it’s impossible to know whether even that 20% was basically luck. Clearly, XLNet’s mental model is out of whack.

When we humans are reading, the mental models that we build make up the foundations of reading comprehension. When a system lacks these foundations, it struggles to understand what it reads. That fact explains the mistakes we’ve seen. Once you know to probe for mental models, it’s easy to poke holes in a system’s facade of understanding.

Of course, the billion-dollar question is how to give computers those mental models. We humans acquire them by walking around and experiencing the world with senses, goals, and interests, which today’s computers obviously don’t have.

Some have argued that more of the conventional style of research — more BERTs and GPT-2s and XLNets — is all we need. But all these systems are structured and trained in essentially the same way: they learn how our words correlate with each other. They don’t see the underlying mental models that cause us to say the words, and it’s not clear how the standard approach can ever recover those mental models.

At EC, we believe that if AI is to have fluent and trustworthy human collaboration, it needs to be able to construct robust, explicit and transparent mental models from human language and communicate its own mental models back in the same form.