Our new position paper: How should machine reading comprehension be measured?

Elemental Cognition CEO David Ferrucci started the company with a big vision in mind. “If computers could understand what they read, the impact on humanity would be enormous,” he said. “Machines could help us in all kinds of new ways: reducing our information overload, answering our questions, and even using what they’ve read to help us think through new ideas. That’s why building a machine that can read, understand, and explain its answers is at the core of Elemental Cognition’s scientific mission.”

Getting there is a monumental challenge, but a crucial first step is defining the problem: what would it even mean for a machine to comprehend, and how would we know it had succeeded?

That question is the subject of our new position paper, “To Test Machine Comprehension, Start by Defining Comprehension,” slated for publication at this July’s ACL conference. The paper critically reviews the answers offered by current research and suggests a higher bar for evaluating machines’ comprehension.

Don’t just look around the reading comprehension gym

How to evaluate “machine reading comprehension,” or MRC, is a question the AI community has been grappling with for decades. Today, systems are typically evaluated on datasets that consist of text passages accompanied by a battery of questions. Comprehension is judged by how many of the questions a system can answer correctly.

On many of these benchmarks, recent natural language processing (NLP) technology has matched or even surpassed human performance. Accordingly, the past few years have seen a flurry of new datasets, each designed to make the task harder. For instance, they might ask questions that hinge on implicit commonsense facts like “People sometimes get headaches when they don’t sleep,” or that require counting how many times something happens.

In the paper, we argue that this difficulty-centric approach misses the mark. It’s like training to become a professional sprinter by “glancing around the gym and adopting any exercises that look hard.” You’ll end up working some relevant muscles, but you probably won’t achieve your ultimate goal.

Similarly, the goal of MRC isn’t to answer some arbitrary collection of tricky questions; it’s for systems to do something with what they’ve read. Ultimately, we want automated research assistants that synthesize information from many documents; game characters that act on what we tell them; legal assistance tools that understand our predicaments and help evaluate options; and so on. Today’s MRC research is effectively a training program for the longer-term goal of such sophisticated applications—and the dominant paradigm fails to keep its eye on the prize.

Measuring comprehension with a Template of Understanding

Instead of focusing on difficulty, we propose starting by establishing what content—what information expressed, implied, or relied on by a passage—machines need to comprehend for a given purpose. Then, MRC tests can ask questions that systematically evaluate systems’ progress toward that longer-term goal.

We demonstrate our approach with a “template of understanding” (ToU) for stories, which (as we argue in the paper) are a particularly fruitful genre for downstream applications. Specifically, we propose that story-oriented applications will need the answers to at least the following four questions:

Spatial: Where is everything located and how is it positioned throughout the story?
Temporal: What events occur and when?
Causal: How do events lead mechanistically to other events?
Motivational: Why do the characters decide to take the actions they take?

The paper goes into much more detail on where these questions come from, how they can be instantiated for any given story, what reasonable answers might look like, and how systems might be evaluated on their answers. In short, we show how the ToU for stories can serve as the foundation of a more thorough and systematic MRC evaluation.

There’s still a long way to go before computers can reliably answer the ToU questions, as anyone familiar with MRC systems would probably agree. But to provide some concrete evidence, we created a simplified multiple-choice MRC test by fleshing out answers to the ToU questions for a few stories from an existing MRC dataset. When we put the test to a near-state-of-the-art system, it could correctly answer just 20% of the questions.

Building and evaluating MRC systems that perform better is going to take a lot of work from across the AI/NLP community and beyond. If you’d like to join us in hashing out these ideas, you can read the paper now on arXiv, chat with us on Twitter, and/or come to our (virtual!) talk at ACL in July. We’re looking forward to continuing this line of research, and we hope it stimulates some great conversations!

The EC Reasoning Engine solves hard problems.

The core of EC’s neuro-symbolic AI architecture is the reasoning engine. It combines multiple powerful and precise reasoning strategies that work together to solve hard problems efficiently with traceable logic.

This includes formal systems and computation devices from mathematical constraint modeling, efficient constraint propagation and backtracking, possible worlds reasoning and optimization, linear and non-linear programming, operations research, dependency management and truth maintenance, and many more.

The reasoning engine, not the LLM, generates answers that are provably correct. The reasoning engine itself is general purpose and is not fine-tuned based on the problem it is solving.

Fine-tuned LLMs make knowledge accessible.

Fine-tuned LLMs bridge the gap between human knowledge and the reasoning engine.

The EC AI platform uses LLMs on one end to capture knowledge from documents and experts in a form that can be consumed by the reasoning engine. This knowledge captures the facts, rules, and constraints of the target application. We also use LLMs on the other end to deliver answers from the reasoning engine and interact with end users in natural language.

The LLMs do not generate the answer itself; instead, they sandwich the powerful reasoning engine, which produces accurate answers, in between them. Hence, the LLM Sandwich.

Formal knowledge models enable reliable precision.

At EC, we have developed our own language called Cogent so anyone can easily build formal knowledge models. This reads like English, but is actually directly executable code. This is a major innovation in automatic programming that I will explore in more detail in later articles.

Cogent is transparent and easy to read like natural language, but also precise, unambiguous, and rigorous. It fulfills the same function as existing formal knowledge models that read more like math or code, but enables anyone to build and manage these models. These models are continuously refined and validated for logical consistency by the reasoning engine.

Formal knowledge models are the glue between the LLMs and the reasoning engine. They are the cheese that melts to bring the whole sandwich together, if you prefer the sandwich analogy. They are all that is needed to power a complex reasoning application that delivers accurate and optimal answers every time.

Cloud APIs power fast and scalable app deployment.

Customized based on a business’ knowledge model and generated automatically by EC AI, callable cloud APIs enable any multi-modal frontend of an application to be connected directly into the reasoning engine.

This enables businesses to rapidly deploy AI applications capable of complex reasoning.

The whole architecture can be trained jointly.

At EC, all elements of the entire sandwich can be trained. Our reasoning engine, its formal knowledge models, and its interactions with LLMs can be jointly trained and fine-tuned to work together efficiently through reinforcement learning.

This continuously improves the tight integration between natural language and formal reasoning, something LLMs alone will not achieve.

View all articles

Benjamin Gilbert
EVP of Marketing,
Elemental Cognition

Developer of a generative artificial intelligence based technology platform designed to empower human decision-making. The company applies large language models (LLMs) in combination with a variety of other AI (artificial intelligence) techniques, enabling users to accelerate and improve critical decision-making for complex, high-value problems where trust, accuracy, and transparency matter.

Filter

Blog

AI for when you can’t afford to be wrong

See how Elemental Cognition AI can solve your hard business problems.

Book a demo

Our new position paper: How should machine reading comprehension be measured?

Recent articles:

Filter

The LLM Sandwich: AI that Solves Complex Problems with Reliable Reasoning

Large Language Models Created Demand for AI Capable of Complex Reasoning They Can’t Deliver Alone

The Limitations of Large Language Models for Complex Reasoning

AI for when you can’t afford to be wrong

Company

Home

Company

Partnership

Blog

News

Research Papers

Performance Benchmarks

Solutions

Platform

Life Sciences

Higher Education

Travel

Supply Chain

Contact Sales

sales@ec.ai

Book a demo

Connect

info@ec.ai

Join us

Twitter

LinkedIn

Support

Legal Stuff

Terms of Use

Privacy Policy

Cookies

Security

© 2024 Elemental Cognition, Inc. All rights reserved.