Skip to main content

In my last post, I shared some telling examples where computers failed to understand what they read. The errors they made were bizarre and fundamental. But why? Computers are clearly missing something, but can we more clearly pin down what?

Let’s examine one specific error that sheds some light on the situation. My team ran an experiment where we took the same first-grade story I discussed last time, but truncated the final sentence:

Fernando and Zoey go to a plant sale. They buy mint plants. They like the minty smell of the leaves.

Fernando puts his plant near a sunny window. Zoey puts her plant in her bedroom. Fernando’s plant looks green and healthy after a few days. But Zoey’s plant has some brown leaves.

“Your plant needs more light,” Fernando says.

Zoey moves her plant to a sunny window. Soon, ___________.

[adapted from]

Then we asked workers on Amazon Mechanical Turk to fill in the blank. Here’s what the workers suggested:

The humans didn’t say the exact same thing. But they did express substantially the same idea — an idea very close to the original author’s ending: “Both plants look green and healthy.” All the humans are on the same page.

Now let’s see what a computer says. We gave the same prompt to OpenAI’s GPT-2, touted specifically for its ability to generate surprisingly realistic continuations of a document. We queried it 15 times. Here are some examples of the results:

To give credit where it’s due, a few of its outputs were close to the human answers:

But the system gave no sign that it recognized these answers were better, and on the whole, it’s clear that GPT-2 has no idea what’s going on. Why?

When humans read a text — at least a straightforward text like a first-grade story — we come up with largely the same mental model. Computers do not. This is my core diagnosis of all the bizarre errors that computers make with language: the mental model is what’s missing.

  • What’s in a mental model?

It shouldn’t be surprising that computers struggle with mental models; we humans fill in so much when we read! Let’s take a closer look at what’s in our own mental models.

For variety’s sake, we’ll switch to a kindergarten story:

Enzo and Mia were running a race. Enzo fell. He hurt his knee.

Mia looked back. She was almost at the finish line. She wanted to win. If she kept running, she would win.

Enzo was her friend. Mia stopped. She ran back to Enzo. She helped him up.

[adapted from]

On the surface, this story seems simple. But if you reflect on your own mental model, you’ll start to see just how deep it goes.

We humans pack a pretty sophisticated model of the world into our heads — one that captures all manner of mechanisms, relationships, and mental sketches.

Start with the spatial. How is the scene set when Mia considers whether to keep running? You can see it in your mind’s eye: Enzo is on the ground, maybe holding his knee. Mia is ahead, between Enzo and the finish line, her head pivoted around to look at him. Even for the details that aren’t entirely clear, you still have a general sense. For instance, the whole scene is probably outside, maybe on a track or in a field. It might be indoors in a gym, but probably not in a kitchen. You filled in all this information yourself, using clues from the text but also a powerful and detailed understanding of the world.

And that’s just the spatial configuration at one moment in the story; there’s so much more you infer! You understand how Enzo and Mia move around over time. You realize why Mia ran back: to help Enzo up, she had to be next to him. You know the reason she wanted to help him up: he was her friend, and people generally want to help their friends. She also wanted to win, because people generally want to win competitions. She would win only if she got to the finish line first, because that’s how races work. The fact that she chose to help Enzo instead of winning suggests that she’s a kind person.

If you’re thinking this is all kind of obvious, you’re right! But it’s only obvious to humans. I claim that a mental model like this is the core thing computers are missing.

  • Mental model mistakes

Sure enough, asking questions that probe these mental models is a surefire way to stump computers. For example, below are a few multiple-choice questions about Enzo and Mia’s race, showing the answers given by another state-of-the-art system called XLNet. We ran a version of XLNet that was trained on stories for English language learners in middle and high school, so it should have no trouble with kindergarten.

And yet:

Again, these examples are not outliers. We asked XLNet over 200 similar questions probing its understanding of time, space, causality, and motivation in the story. The system claims to get over 80% of reading comprehension questions correct. But on our mental model-based questions, it gets more like 20% — and without the ability to dig into the model’s reasoning, it’s impossible to know whether even that 20% was basically luck. Clearly, XLNet’s mental model is out of whack.

When we humans are reading, the mental models that we build make up the foundations of reading comprehension. When a system lacks these foundations, it struggles to understand what it reads. That fact explains pretty much all the mistakes we’ve seen.

It also explains what we saw in the previous post. BERT claimed that when Zoey bought her plant, it was located “in her bedroom” — spatial model out of whack. It attributed her plant having too little light to “brown leaves” — causal model out of whack. Once you know to probe for mental models, it’s easy to poke holes in a system’s facade of understanding.

Of course, the billion-dollar question is how to give computers those mental models. We humans acquire them by walking around and experiencing the world with senses, goals, and interests, which today’s computers obviously don’t have.

Some have argued that more of the conventional style of research — more BERTs and GPT-2s and XLNets — is all we need. But all these systems are structured and trained in essentially the same way: they learn how our words correlate with each other. They don’t see the underlying mental models that cause us to say the words, and it’s not clear how the standard approach can ever recover those mental models. More on that in my next post.