It’s time for a new way to measure machine intelligence

When Elemental Cognition CEO David Ferrucci started the company, he had a big vision in mind.

“If computers could understand what they read, the impact on humanity would be enormous,” he said. “Machines could help us in all kinds of new ways: reducing our information overload, answering our questions, and even using what they’ve read to help us think through new ideas. That’s why building a machine that can read, understand, and explain its answers is at the core of Elemental Cognition’s scientific mission.”

Getting there will be a monumental challenge, but a crucial first step is defining the problem: what would it even mean for a machine to comprehend, and how would we know it had succeeded?

That question is the subject of our position paper, “To Test Machine Comprehension, Start by Defining Comprehension,” published at last July’s ACL conference. The paper critically reviews the answers offered by current research and suggests a higher bar for evaluating machines’ comprehension.

How to evaluate ‘machine reading comprehension’, or MRC, is a question the AI community has been grappling with for decades. Today, systems are typically evaluated on datasets that consist of text passages accompanied by a battery of questions. Comprehension is judged by how many of the questions a system can answer correctly.

On many of these benchmarks, recent natural language processing (NLP) technology has matched or even surpassed human performance. Accordingly, the past few years have seen a flurry of new datasets, each designed to make the task harder. For instance, they might ask questions that hinge on implicit commonsense facts like “People sometimes get headaches when they don’t sleep,” or questions that require counting how many times something happens.

In the paper, we argue that this difficulty-centric approach misses the mark. It’s like training to become a professional sprinter by “glancing around the gym and adopting any exercises that look hard.” You’ll end up working some relevant muscles, but you probably won’t achieve your ultimate goal.

Similarly, the goal of MRC isn’t to answer some arbitrary collection of tricky questions. It’s for systems to do something with what they’ve read. Ultimately, we want automated research assistants that synthesize information from many documents; game characters that act on what we tell them; legal assistance tools that understand our predicaments and help evaluate options; and so on. Today’s MRC research is effectively a training program for the longer-term goal of such sophisticated applications—and the dominant paradigm fails to keep its eye on the prize.

Instead of focusing on difficulty, we propose starting by establishing what content—what information expressed, implied, or relied on by a passage—machines need to comprehend for a given purpose. Then, MRC tests can ask questions that systematically evaluate systems’ progress toward that longer-term goal.

We demonstrate our approach with a “template of understanding” (ToU) for stories, which (as we argue in the paper) are a particularly fruitful genre for downstream applications.

The paper goes into much more detail on what questions are included in the ToU, how they can be instantiated for any given story, what reasonable answers might look like, and how systems might be evaluated on their answers. In short, we show how the ToU for stories can serve as the foundation of a more thorough and systematic MRC evaluation.

There’s still a long way to go before computers can reliably answer the ToU questions, as anyone familiar with MRC systems would probably agree. But to provide some concrete evidence, we created a simplified multiple-choice MRC test by fleshing out answers to the ToU questions for a few stories from an existing MRC dataset. When we put the test to a near-state-of-the-art system, it could correctly answer just 20% of the questions.

Building and evaluating MRC systems that perform better is going to take a lot of work from across the AI/NLP community and beyond.

If you’d like to join us in hashing out these ideas, you can read the paper now on arXiv or chat with us on Twitter.

View all articles

Benjamin Gilbert
EVP of Marketing,
Elemental Cognition

Developer of a generative artificial intelligence based technology platform designed to empower human decision-making. The company applies large language models (LLMs) in combination with a variety of other AI (artificial intelligence) techniques, enabling users to accelerate and improve critical decision-making for complex, high-value problems where trust, accuracy, and transparency matter.

Filter

Blog

AI for when you can’t afford to be wrong

See how Elemental Cognition AI can solve your hard business problems.

Book a demo

It’s time for a new way to measure machine intelligence

Recent articles:

Filter

The LLM Sandwich: AI that Solves Complex Problems with Reliable Reasoning

Large Language Models Created Demand for AI Capable of Complex Reasoning They Can’t Deliver Alone

The Limitations of Large Language Models for Complex Reasoning

AI for when you can’t afford to be wrong

Company

Home

Company

Partnership

Blog

News

Research Papers

Performance Benchmarks

Solutions

Platform

Life Sciences

Higher Education

Travel

Supply Chain

Contact Sales

sales@ec.ai

Book a demo

Connect

info@ec.ai

Join us

Twitter

LinkedIn

Support

Legal Stuff

Terms of Use

Privacy Policy

Cookies

Security

© 2024 Elemental Cognition, Inc. All rights reserved.