When Elemental Cognition CEO David Ferrucci started the company, he had a big vision in mind.
“If computers could understand what they read, the impact on humanity would be enormous,” he said. “Machines could help us in all kinds of new ways: reducing our information overload, answering our questions, and even using what they’ve read to help us think through new ideas. That’s why building a machine that can read, understand, and explain its answers is at the core of Elemental Cognition’s scientific mission.”
Getting there will be a monumental challenge, but a crucial first step is defining the problem: what would it even mean for a machine to comprehend, and how would we know it had succeeded?
How to evaluate ‘machine reading comprehension’, or MRC, is a question the AI community has been grappling with for decades. Today, systems are typically evaluated on datasets that consist of text passages accompanied by a battery of questions. Comprehension is judged by how many of the questions a system can answer correctly.
On many of these benchmarks, recent natural language processing (NLP) technology has matched or even surpassed human performance. Accordingly, the past few years have seen a flurry of new datasets, each designed to make the task harder. For instance, they might ask questions that hinge on implicit commonsense facts like “People sometimes get headaches when they don’t sleep,” or questions that require counting how many times something happens.
In the paper, we argue that this difficulty-centric approach misses the mark. It’s like training to become a professional sprinter by “glancing around the gym and adopting any exercises that look hard.” You’ll end up working some relevant muscles, but you probably won’t achieve your ultimate goal.
Similarly, the goal of MRC isn’t to answer some arbitrary collection of tricky questions. It’s for systems to do something with what they’ve read. Ultimately, we want automated research assistants that synthesize information from many documents; game characters that act on what we tell them; legal assistance tools that understand our predicaments and help evaluate options; and so on. Today’s MRC research is effectively a training program for the longer-term goal of such sophisticated applications—and the dominant paradigm fails to keep its eye on the prize.
Instead of focusing on difficulty, we propose starting by establishing what content—what information expressed, implied, or relied on by a passage—machines need to comprehend for a given purpose. Then, MRC tests can ask questions that systematically evaluate systems’ progress toward that longer-term goal.
We demonstrate our approach with a “template of understanding” (ToU) for stories, which (as we argue in the paper) are a particularly fruitful genre for downstream applications.
The paper goes into much more detail on what questions are included in the ToU, how they can be instantiated for any given story, what reasonable answers might look like, and how systems might be evaluated on their answers. In short, we show how the ToU for stories can serve as the foundation of a more thorough and systematic MRC evaluation.
There’s still a long way to go before computers can reliably answer the ToU questions, as anyone familiar with MRC systems would probably agree. But to provide some concrete evidence, we created a simplified multiple-choice MRC test by fleshing out answers to the ToU questions for a few stories from an existing MRC dataset. When we put the test to a near-state-of-the-art system, it could correctly answer just 20% of the questions.
Building and evaluating MRC systems that perform better is going to take a lot of work from across the AI/NLP community and beyond.
If you’d like to join us in hashing out these ideas, you can read the paper now on arXiv or chat with us on Twitter.