Introducing GLUCOSE, a dataset for commonsense causal explanation in stories

For computers to understand what they read, they need to know how the world works. Take, for example, the following story from the ROCStories corpus:

Gage was riding his bike. A car turned in front of him. Gage turned his bike sharply. He fell off of his bike. Gage skinned his knee.

To grasp what happened here, you rely on a lot of common sense. You recognize that the car’s turning put Gage in danger; that after Gage fell he was on the ground; that falling caused Gage to skin his knee; and on and on. All this commonsense knowledge comes naturally to humans, but computers still struggle with basic cause and effect. They just lack so much background about the world.

Ideally, computers would acquire background knowledge by tapping into humanity’s vast collective experience. Two commonly used tools for this purpose are crowdsourcing and statistical language models. Crowdsourcing lets researchers elicit input from large numbers of people, who state their knowledge explicitly. Language models, meanwhile, automatically pick up knowledge from more implicit cues: they mine large bodies of text for linguistic patterns that reflect facts about the world.

For knowledge about cause and effect, though, it’s not obvious how to apply either tool. A language model learns from documents, but those documents tend leave commonsense facts implicit; people rarely write statements as obvious as “Falling down leads to being on the ground.” They might describe a particular falling-down event, but they would likely leave the cause, the effect, or the connection between them implicit, or describe them in a convoluted way. That makes it difficult—maybe even impossible—for a language model to infer the causal relationships.

The gap could be filled by crowdsourcing, with crowd workers writing down explicit causal connections. But guiding humans to articulate how they are connecting the dots is harder than it might seem. When past projects have had humans annotate causes and effects, the resulting datasets have generally been limited in scope, size, and/or quality. It has remained unclear how crowdsourcing can yield causal knowledge that is extensive, accurate, and easy to apply to new situations.

Elemental Cognition’s GLUCOSE (“GeneraLized and COntextualized Story Explanations”) dataset, described in a soon-to-be-published paper, offers a solution. The paper demonstrates how to crowdsource causal explanations using a new data representation that captures the causes and effects of each event in a short story. GLUCOSE contains more than 310K such explanations, each consisting of an explanation for a specific event in a story (e.g., “Gage falls off his bike leads to Gage being on the ground”) and a general causal rule showing how the cause/effect relationship would generalize to other situations (e.g., “Person X falls off of something leads to Person X being on the ground”). The paper also shows how the dataset can help language models hypothesize about the causal relationships in new stories.

The GLUCOSE knowledge model

GLUCOSE aims to provide partial explanations of the causes and effects of each event in a short story. Consider the event Gage turned his bike sharply from the story above. GLUCOSE asks five questions about its causes:

These examples are not outliers. We asked XLNet over 200 similar questions probing its understanding of time, space, causality, and motivation in the story. The system claims to get over 80% of reading comprehension questions correct. But on our mental model-based questions, it gets more like 20% — and without the ability to dig into the model’s reasoning, it’s impossible to know whether even that 20% was basically luck. Clearly, XLNet’s mental model is out of whack.

When we humans are reading, the mental models that we build make up the foundations of reading comprehension. When a system lacks these foundations, it struggles to understand what it reads. That fact explains the mistakes we’ve seen. Once you know to probe for mental models, it’s easy to poke holes in a system’s facade of understanding.

Of course, the billion-dollar question is how to give computers those mental models. We humans acquire them by walking around and experiencing the world with senses, goals, and interests, which today’s computers obviously don’t have.

Some have argued that more of the conventional style of research — more BERTs and GPT-2s and XLNets — is all we need. But all these systems are structured and trained in essentially the same way: they learn how our words correlate with each other. They don’t see the underlying mental models that cause us to say the words, and it’s not clear how the standard approach can ever recover those mental models.

At EC, we believe that if AI is to have fluent and trustworthy human collaboration, it needs to be able to construct robust, explicit and transparent mental models from human language and communicate its own mental models back in the same form.

Results from different models on the GLUCOSE explanation task, as rated by Turker evaluation. Explanations were rated on a 4-point scale, with 0 meaning “completely incorrect” and 3 meaning “completely correct.”

When we put this task to an out-of-the-box language model, the results (light purple above) were underwhelming: most answers were rated by crowd evaluators as completely or mostly incorrect. But when we gave the same language model some extra training on the GLUCOSE data—a process known as “fine-tuning”—the scores on most dimensions more than doubled (medium purple). The results improved even further with a more sophisticated, state-of-the-art “encoder-decoder” model (dark purple), which leveraged the general causal rules to improve story-specific explanations.

Of course, these scores don’t tell the full story; there’s still plenty of room for improvement, particularly on texts beyond ROCStories. But our results do show that the technique is promising.

What it means for future research

We’re excited about GLUCOSE for several reasons. First, it demonstrates that, despite the subtleties that have hampered previous attempts, we can collect high-quality, large-scale data about commonsense causality. Each story event in the dataset received a diversity of explanations from different annotators, and the entries total more than 310K. Yet even with this scale and diversity, the quality remains high: workers continuously received expert feedback on their submission quality; we retained submissions mainly from workers with high expert ratings; and crowdsourced human evaluation rated most submissions as mostly or entirely correct (gold bars above).

The second reason to be excited is that, as our experiments demonstrate, crowdsourced data can make language models smarter about cause and effect. The language models bring general knowledge about the world and language; the crowdsourced data adds knowledge about causality; and what emerges from their union is a powerful system for hypothesizing causal explanations for new scenarios.

For Elemental Cognition, though, getting causal statements in GLUCOSE’s semi-structured format is just the start. We’re hoping GLUCOSE-trained models can help computers understand and reason about what they read, filling in the necessary background on cause and effect. Accordingly, we’re working to incorporate GLUCOSE into a broader NLU architecture that can interpret what it’s reading, answer questions, acquire missing knowledge from users, and explain its answers. We have a patent pending on how GLUCOSE can play a role there, so stay tuned for more research in this vein. We’re looking forward to continuing to share our progress toward general-purpose NLU!

View all articles

Benjamin Gilbert
EVP of Marketing,
Elemental Cognition

Developer of a generative artificial intelligence based technology platform designed to empower human decision-making. The company applies large language models (LLMs) in combination with a variety of other AI (artificial intelligence) techniques, enabling users to accelerate and improve critical decision-making for complex, high-value problems where trust, accuracy, and transparency matter.

Filter

Blog

AI for when you can’t afford to be wrong

See how Elemental Cognition AI can solve your hard business problems.

Book a demo

Introducing GLUCOSE, a dataset for commonsense causal explanation in stories

Recent articles:

Filter

The LLM Sandwich: AI that Solves Complex Problems with Reliable Reasoning

Large Language Models Created Demand for AI Capable of Complex Reasoning They Can’t Deliver Alone

The Limitations of Large Language Models for Complex Reasoning

AI for when you can’t afford to be wrong

Company

Home

Company

Partnership

Blog

News

Research Papers

Performance Benchmarks

Solutions

Platform

Life Sciences

Higher Education

Travel

Supply Chain

Contact Sales

sales@ec.ai

Book a demo

Connect

info@ec.ai

Join us

Twitter

LinkedIn

Support

Legal Stuff

Terms of Use

Privacy Policy

Cookies

Security

© 2024 Elemental Cognition, Inc. All rights reserved.