Computers that claim to “understand” language still lack world models—faithful representations of how objects and events relate to each other. As I established in my last post, you can reliably trip up today’s systems by asking about when events occur, where things are located, and why it’s all happening.
At the same time, the broader field of natural language processing (NLP) seems to be advancing by leaps and bounds. Researchers keep announcing ever-more-powerful tools for translation, chatbots, and even writing assistance. Most recently, OpenAI’s GPT-3 system has proven to be an impressively fluent and adaptable text generator. With all this progress, why does NLP still struggle with world models?
The problem comes into focus when you consider how today’s NLP systems work. They’re all what I like to call “super-parrots.” They do well at narrow tasks, but only by constantly asking themselves one simple question: “Based on all the documents I’ve seen, what would a human likely say here?”
The catch is that those documents were written by humans for humans. What’s left unwritten, then, is the vast body of shared experience that led humans to write the words in the first place.
Machines do not share that world knowledge, and as we’ll see, squeezing it out of typical text would be impractical, maybe even impossible. That’s why super-parrots find world models so hard: their training leaves them blind to the world we’re writing about—and if we continue down the super-parrot path, it’s hard to see how that can change.
The text, lots of text, and nothing but the text
To see why today’s systems aren’t getting what they need from their training, it helps to understand how they work. In my last few posts, I’ve shared experiments on three state-of-the-art systems: BERT, GPT-2, and XLNet. All three systems are language models—the technology behind most of the recent breakthroughs in NLP, including GPT-3.
A language model is just a string probability guesser. Its superpower is to look at a string of text—a word, a sentence, a paragraph—and guess how likely it is that a human would write that string. To make these guesses, language models analyze mounds of text in search of statistical patterns, such as what words tend to appear near what other words, or how key terms repeat throughout a paragraph.
It turns out that guessing what humans might say is an enormously versatile power. Say you want to know who “she” refers to in the sentence “Joan thanked Susan for all the help she had given,” but you don’t speak English. With a good language model, you could substitute “Joan” and “Susan” for “she,” then check which version of the sentence humans are more likely to write. A well-trained model will recognize that a sequence like “Joan thanked Susan for the ___ Susan had ___” is more likely than “Joan thanked Susan for the ___ Joan had ___.”
You could use similar tricks to choose the best answer to a multiple-choice question. You could even mimic how someone would finish an email by guessing the most likely next word at each step. All useful behaviors, of course, as I’ll discuss in upcoming posts—but not necessarily demonstrations of understanding.
It’s not in the data
Language models and their descendants have helped tremendously with tasks like translation and web search. But ultimately, NLP aims higher. We want machines to understand what they read, and to converse, answer questions, and act based on their understanding. So I have to wonder: how much closer are we now than we were a decade ago, before neural network language models swept the field?
As impressive as today’s NLP is, I worry that it’s still on a path that comes with severe limitations. A system’s “understanding” can only go so far when its world consists entirely of what writers typically say. The concepts we want machines to learn just aren’t evident in the data we’re giving them.
Yes, language models can learn that humans often write “bowl” near “kitchen.” But that’s the grand total of what a language model understands about bowls and kitchens. Everything else that humans know about these objects—that bowls have raised edges, that bowls often break apart if you drop them, that people go to kitchens when hungry to find food—is taken for granted. All this context is obvious to us thanks to our shared experiences, so writers don’t bother to lay it all out.
Of course, some facts about the world, including bowls’ association with kitchens, do leak through into language. With enough text to reveal such associations, maybe a powerful computer really could reverse-engineer the world. After all, machine learning excels at inferring correlations!
But let’s consider what that would entail. Think of all the visceral, physical phenomena you could write about—things like shivering in the cold or spilling a glass of water. Now take one of those phenomena and imagine it as a motivation—say, someone preventively moving a glass out of the way of an incoming pot roast. Then wrap that motivation in conversation:
Manny approached the table with the steaming roast. “Watch out!” said Emma. “There, I moved the water for you.”
Can a machine really be expected to unravel all the layers of this situation into a faithful model? Could it really explain why Emma warned Manny, or predict the fate of the table had Emma done nothing, when all the machine is trained to do is guess what phrase comes next?
If language models are the endgame of NLP, then the answer would have to be yes. Language models would have to be able to simulate the world accurately enough to reason about infinite situations as complex as or more complex than the one above. The models would also have to be able to translate language into these detailed simulations. And they would have to learn all this just by squinting at the world through the pinhole of language, asking themselves time and time again what words a human would write.
Impossible? Perhaps not. But it certainly strains plausibility. Some aspects of the world might just never be reflected in corpora, even indirectly. After all, we humans don’t use language to articulate full models of the world; we use it to point to elements of the model we already share. Indeed, when conversation partners don’t share enough of their models, communication fails. So if all a computer can see is what people typically say, it’s not clear that a world model is even theoretically available to extract.
When two humans communicate, their words merely point to concepts each already has in their mental model of the world. Successful communication hinges on both participants sharing enough background knowledge and experiences that their mental models are similar.
A language model does not share most of the background knowledge that humans do; by default, all a computer has access to is the words, not the concepts they point to.
As for the information that is extractable, much of it would be a huge computational lift. As the pot roast example demonstrates, details of the world model would have to be ferreted out from the most oblique, distorted cues. For a language model, it’s just so much easier to find shallow patterns in the letters than to bootstrap an accurate world model. And if shallow patterns are all a model knows, the best it can hope to be is a super-parrot.
A paradigm shift
I’m not saying just that we’re not there yet. My fear is that the super-parrot paradigm will never achieve deeper understanding.
Of course, language modeling does have its appeal. The industry wants to harvest low-hanging fruit, and shallow textual analysis is tremendously valuable. (GPT-3 pulls off some genuinely impressive feats. Even though it lacks a world model that it’s communicating about, its unparalleled pattern-matching will surely find many applications.) Academics, meanwhile, like incremental progress against clear benchmarks, which are easier to define for text strings. Then there’s the issue of data: especially with the Internet, it’s far easier to harvest raw text than it is to gather rich data from which machines could learn a world model.
But if the end goal of NLP is intelligent assistants, guides, and conversationalists, language models and their ilk might just be hopelessly impoverished. They can certainly help, both by guiding guesses about world models and by improving fluency when communicating with humans. But for the core task of getting from text to understanding, it’s time to explore richer paradigms.
Of course, the billion-dollar question is what paradigm would suffice. I don’t claim to have all the answers, but we at Elemental Cognition have some pretty specific ideas. In future posts, I’ll start to lay out our vision.