Skip to main content

Elemental Cognition CEO David Ferrucci started the company with a big vision in mind. “If computers could understand what they read, the impact on humanity would be enormous,” he said. “Machines could help us in all kinds of new ways: reducing our information overload, answering our questions, and even using what they’ve read to help us think through new ideas. That’s why building a machine that can read, understand, and explain its answers is at the core of Elemental Cognition’s scientific mission.”

Getting there is a monumental challenge, but a crucial first step is defining the problem: what would it even mean for a machine to comprehend, and how would we know it had succeeded?

That question is the subject of our new position paper, “To Test Machine Comprehension, Start by Defining Comprehension,” slated for publication at this July’s ACL conference. The paper critically reviews the answers offered by current research and suggests a higher bar for evaluating machines’ comprehension.

  • Don’t just look around the reading comprehension gym

How to evaluate “machine reading comprehension,” or MRC, is a question the AI community has been grappling with for decades. Today, systems are typically evaluated on datasets that consist of text passages accompanied by a battery of questions. Comprehension is judged by how many of the questions a system can answer correctly.

On many of these benchmarks, recent natural language processing (NLP) technology has matched or even surpassed human performance. Accordingly, the past few years have seen a flurry of new datasets, each designed to make the task harder. For instance, they might ask questions that hinge on implicit commonsense facts like “People sometimes get headaches when they don’t sleep,” or that require counting how many times something happens.

In the paper, we argue that this difficulty-centric approach misses the mark. It’s like training to become a professional sprinter by “glancing around the gym and adopting any exercises that look hard.” You’ll end up working some relevant muscles, but you probably won’t achieve your ultimate goal.

Similarly, the goal of MRC isn’t to answer some arbitrary collection of tricky questions; it’s for systems to do something with what they’ve read. Ultimately, we want automated research assistants that synthesize information from many documents; game characters that act on what we tell them; legal assistance tools that understand our predicaments and help evaluate options; and so on. Today’s MRC research is effectively a training program for the longer-term goal of such sophisticated applications—and the dominant paradigm fails to keep its eye on the prize.

  • Measuring comprehension with a Template of Understanding

Instead of focusing on difficulty, we propose starting by establishing what content—what information expressed, implied, or relied on by a passage—machines need to comprehend for a given purpose. Then, MRC tests can ask questions that systematically evaluate systems’ progress toward that longer-term goal.

We demonstrate our approach with a “template of understanding” (ToU) for stories, which (as we argue in the paper) are a particularly fruitful genre for downstream applications. Specifically, we propose that story-oriented applications will need the answers to at least the following four questions:

  • Spatial: Where is everything located and how is it positioned throughout the story?
  • Temporal: What events occur and when?
  • Causal: How do events lead mechanistically to other events?
  • Motivational: Why do the characters decide to take the actions they take?

The paper goes into much more detail on where these questions come from, how they can be instantiated for any given story, what reasonable answers might look like, and how systems might be evaluated on their answers. In short, we show how the ToU for stories can serve as the foundation of a more thorough and systematic MRC evaluation.

There’s still a long way to go before computers can reliably answer the ToU questions, as anyone familiar with MRC systems would probably agree. But to provide some concrete evidence, we created a simplified multiple-choice MRC test by fleshing out answers to the ToU questions for a few stories from an existing MRC dataset. When we put the test to a near-state-of-the-art system, it could correctly answer just 20% of the questions.

Building and evaluating MRC systems that perform better is going to take a lot of work from across the AI/NLP community and beyond. If you’d like to join us in hashing out these ideas, you can read the paper now on arXiv, chat with us on Twitter, and/or come to our (virtual!) talk at ACL in July. We’re looking forward to continuing this line of research, and we hope it stimulates some great conversations!