Skip to main content

For computers to understand what they read, they need to know how the world works. Take, for example, the following story from the ROCStories corpus:

Gage was riding his bike. A car turned in front of him. Gage turned his bike sharply. He fell off of his bike. Gage skinned his knee.

To grasp what happened here, you rely on a lot of common sense. You recognize that the car’s turning put Gage in danger; that after Gage fell he was on the ground; that falling caused Gage to skin his knee; and on and on. All this commonsense knowledge comes naturally to humans, but computers still struggle with basic cause and effect. They just lack so much background about the world.

Ideally, computers would acquire background knowledge by tapping into humanity’s vast collective experience. Two commonly used tools for this purpose are crowdsourcing and statistical language models. Crowdsourcing lets researchers elicit input from large numbers of people, who state their knowledge explicitly. Language models, meanwhile, automatically pick up knowledge from more implicit cues: they mine large bodies of text for linguistic patterns that reflect facts about the world.

For knowledge about cause and effect, though, it’s not obvious how to apply either tool. A language model learns from documents, but those documents tend leave commonsense facts implicit; people rarely write statements as obvious as “Falling down leads to being on the ground.” They might describe a particular falling-down event, but they would likely leave the cause, the effect, or the connection between them implicit, or describe them in a convoluted way. That makes it difficult—maybe even impossible—for a language model to infer the causal relationships.

The gap could be filled by crowdsourcing, with crowd workers writing down explicit causal connections. But guiding humans to articulate how they are connecting the dots is harder than it might seem. When past projects have had humans annotate causes and effects, the resulting datasets have generally been limited in scope, size, and/or quality. It has remained unclear how crowdsourcing can yield causal knowledge that is extensive, accurate, and easy to apply to new situations.

Elemental Cognition’s GLUCOSE (“GeneraLized and COntextualized Story Explanations”) dataset, described in a soon-to-be-published paper, offers a solution. The paper demonstrates how to crowdsource causal explanations using a new data representation that captures the causes and effects of each event in a short story. GLUCOSE contains more than 310K such explanations, each consisting of an explanation for a specific event in a story (e.g., “Gage falls off his bike leads to Gage being on the ground”) and a general causal rule showing how the cause/effect relationship would generalize to other situations (e.g., “Person X falls off of something leads to Person X being on the ground”). The paper also shows how the dataset can help language models hypothesize about the causal relationships in new stories.

  • The GLUCOSE knowledge model

GLUCOSE aims to provide partial explanations of the causes and effects of each event in a short story. Consider the event Gage turned his bike sharply from the story above. GLUCOSE asks five questions about its causes:

For each of these questions, there is also a mirror-image version about the event’s effects—events caused/enabled by Gage’s turning, emotions engendered by his turning, etc. In total, then, GLUCOSE includes 10 “dimensions of causal explanation” for each event in the story.

The answers above describe causes of this one event in this particular story. But for a machine to apply common sense to every document it reads, it also needs to be able to generalize its knowledge to new stories. That’s why each entry in GLUCOSE contains not just a story-specific causal explanation, but a general causal rule exemplified by the story-specific explanation. For instance, the example answer above for Dimension #2 (emotions/drives) might be generalized to:

Someone wants safety Causes/Enables Someone moves away from Something that is dangerous

For each question, then, the crowd workers both record a general causal rule and demonstrate how it applies in this context. This combination of general and specific gives AI systems a powerful way to learn about what cause/effect relationships are relevant in which contexts. (It’s also the source of the “GeneraLized and COntextualized” part of the name “GLUCOSE.”)

The GLUCOSE knowledge model is the product of many iterations of experiments with Amazon Mechanical Turk workers. At first we targeted a richer suite of concepts and relationships, drawing on evidence from cognitive science and from our own pilot experiments about what makes for satisfying explanations. But our workers had trouble distinguishing some of our initial categories—e.g., “cause” vs. “enable”—so we simplified away the more confusing elements of our design. We also built a constrained input interface that helps the workers construct semi-standardized statements and rules. These measures helped guide workers to consistent explanations.

  • Injecting GLUCOSE into language models

GLUCOSE provides the crowdsourced knowledge, but natural language understanding (NLU) systems still need to learn from it. An NLU system might need to explain an infinite variety of stories, so it needs to generalize from its training both by applying causal rules it’s seen in new contexts and by hypothesizing new causal relationships. Such generalization is the second contribution of our paper: we demonstrate how to use GLUCOSE data with language models to automatically generate partial causal explanations for new scenarios.

The idea is to cast the task as reading a passage of text and generating more text in response—a setup language models are well-suited to learn. The model reads in the story, the sentence to be explained, and the dimension number to explain, all formatted as one paragraph. The model is then supposed to complete the paragraph by spitting out a story-specific explanation for the specified dimension.

These examples are not outliers. We asked XLNet over 200 similar questions probing its understanding of time, space, causality, and motivation in the story. The system claims to get over 80% of reading comprehension questions correct. But on our mental model-based questions, it gets more like 20% — and without the ability to dig into the model’s reasoning, it’s impossible to know whether even that 20% was basically luck. Clearly, XLNet’s mental model is out of whack.

When we humans are reading, the mental models that we build make up the foundations of reading comprehension. When a system lacks these foundations, it struggles to understand what it reads. That fact explains the mistakes we’ve seen. Once you know to probe for mental models, it’s easy to poke holes in a system’s facade of understanding.

Of course, the billion-dollar question is how to give computers those mental models. We humans acquire them by walking around and experiencing the world with senses, goals, and interests, which today’s computers obviously don’t have.

Some have argued that more of the conventional style of research — more BERTs and GPT-2s and XLNets — is all we need. But all these systems are structured and trained in essentially the same way: they learn how our words correlate with each other. They don’t see the underlying mental models that cause us to say the words, and it’s not clear how the standard approach can ever recover those mental models.

At EC, we believe that if AI is to have fluent and trustworthy human collaboration, it needs to be able to construct robust, explicit and transparent mental models from human language and communicate its own mental models back in the same form.

Results from different models on the GLUCOSE explanation task, as rated by Turker evaluation. Explanations were rated on a 4-point scale, with 0 meaning “completely incorrect” and 3 meaning “completely correct.”

When we put this task to an out-of-the-box language model, the results (light purple above) were underwhelming: most answers were rated by crowd evaluators as completely or mostly incorrect. But when we gave the same language model some extra training on the GLUCOSE data—a process known as “fine-tuning”—the scores on most dimensions more than doubled (medium purple). The results improved even further with a more sophisticated, state-of-the-art “encoder-decoder” model (dark purple), which leveraged the general causal rules to improve story-specific explanations.

Of course, these scores don’t tell the full story; there’s still plenty of room for improvement, particularly on texts beyond ROCStories. But our results do show that the technique is promising.

  • What it means for future research

We’re excited about GLUCOSE for several reasons. First, it demonstrates that, despite the subtleties that have hampered previous attempts, we can collect high-quality, large-scale data about commonsense causality. Each story event in the dataset received a diversity of explanations from different annotators, and the entries total more than 310K. Yet even with this scale and diversity, the quality remains high: workers continuously received expert feedback on their submission quality; we retained submissions mainly from workers with high expert ratings; and crowdsourced human evaluation rated most submissions as mostly or entirely correct (gold bars above).

The second reason to be excited is that, as our experiments demonstrate, crowdsourced data can make language models smarter about cause and effect. The language models bring general knowledge about the world and language; the crowdsourced data adds knowledge about causality; and what emerges from their union is a powerful system for hypothesizing causal explanations for new scenarios.

For Elemental Cognition, though, getting causal statements in GLUCOSE’s semi-structured format is just the start. We’re hoping GLUCOSE-trained models can help computers understand and reason about what they read, filling in the necessary background on cause and effect. Accordingly, we’re working to incorporate GLUCOSE into a broader NLU architecture that can interpret what it’s reading, answer questions, acquire missing knowledge from users, and explain its answers. We have a patent pending on how GLUCOSE can play a role there, so stay tuned for more research in this vein. We’re looking forward to continuing to share our progress toward general-purpose NLU!