Take the recent question-answering system called Aristo. When it came out, early articles proclaimed that the system was “ready for high school science, maybe even college” and calledit “as smart as/smarter than an eighth-grader.”
Aristo got there by “reading” millions of background documents. But it performs well only at one specific task: answering multiple-choice questions from standardized science tests. If you ask it to do anything outside that domain — for example, to explain its reasoning or even to answer without multiple-choice options — it will fail miserably. As more sobervoices soon pointed out, whatever Aristo got from its reading is a far cry from human understanding.
How big is that gap? Most people working on natural language processing, including Aristo’s creators, would readily admit it’s a gaping chasm. But as someone working to engineer a more capable system, I want to emphasize just how distant Aristo and its cousins are from what the coverage sometimes suggests and, more importantly, from what we want out of language understanding AI. Let’s consider a few examples to see how meager AI’s understanding really is.
Mastering Wikipedia? Not so fast
Consider the following passage condensed from Wikipedia:
The Standing Liberty quarter is a 25-cent coin that was struck by the United States Mint from 1916 to 1930. It succeeded the Barber quarter, which had been minted since 1892. The coin was designed by sculptor Hermon Atkins MacNeil.
In 1915, Director of the Mint Robert W. Woolley set in motion efforts to replace the Barber dime, quarter, and half dollar, as he mistakenly believed that the law required new designs. MacNeil submitted a militaristic design that showed Liberty on guard against attacks.
In late 1916, Mint officials made major changes to the design without consulting MacNeil. The sculptor complained after receiving the new issue in January 1917. The Mint obtained special legislation to allow MacNeil to redesign the coin as he desired.
This is exactly the sort of text that most question-answering systems are trained on. So it’s no surprise that Google’s “BERT” system* can respond to questions about it by highlighting the answers:
If a human gave answers like these, you’d assume they understood quite a lot: that the Mint produced Barber quarters, then Standing Liberty quarters; that MacNeil objected to the Mint’s design procedure; and so on. BERT also seems to understand these facts — so much so that similar models led to breathless headlines like, “Computers are getting better than humans at reading.”
Really? Let’s probe deeper by tweaking the questions:
Yikes! Somehow BERT can’t distinguish which coin was retired when. It also attributes a second event involving Woolley to the same cause as before, and its explanation for MacNeil’s displeasure lurches from one irrelevant fact to another based on the phrasing of the question. Of course, getting it to reason about multiple facts together seems like a lost cause.
Even though BERT gets many questions right, mistakes like the ones I’ve highlighted are revealing: they prompt us to question whether the machine really understands what it is reading. Imagine if a child aced the questions that BERT does, then brought forth this set of ridiculous errors. We’d find their responses not just disappointing, but puzzling! We’d have to conclude that their successes hadn’t relied on much understanding after all.
Instead, the strategy revealed by the mistakes is more like a student who skipped class and is now trying to compensate by skimming for words from the question. BERT saw Woolley in a “why” question, so it assumed the answer was whatever followed as in a sentence about Woolley. It found a sentence about something being minted since 1892, so the answer must be the nearby Barber quarter. The system is doing linguistic pattern-matching — powerful, but superficial.
Back to first grade
You might be tempted to defend BERT on the grounds that, like a confused six-year-old taking an eighth-grade science test, it just finds the Woolley/MacNeil passage too complex. The passage involves employment, minting, even legislative procedure. Maybe BERT would do better on a simple first-grade story about a different sort of mint:
Fernando and Zoey went to a plant sale. They bought mint plants. Fernando put his plant near a sunny window. Zoey put her plant in her bedroom. Fernando’s plant looked green and healthy after a few days. But Zoey’s plant had some brown leaves. “Your plant needs more light,” Fernando said. Zoey moved her plant to a sunny window. Soon, both plants looked green and healthy! [adapted from ReadWorks.org]
Again, BERT does correctly answer many questions, including where Fernando and Zoey buy mint plants (plant sale), what Zoey placed in her home (plant), and where her plant was before she moved it (her bedroom). But other answers point to a wildly inconsistent interpretation of the world of the story. A sampling of BERT’s errors:
Even when the answers have the right form, they don’t make sense. How could Zoey’s plant have already been in her bedroom when she bought it? How could brown leaves, an effect of too little light, have caused too little light? The system has clearly failed to conceptualize the relationships between the entities and events in the passage, the questions, and the world at large.
Are we on the road to machines that understand, or to super-parrots?
BERT, Aristo, GPT-2, XLNet…all these models are made of the same stuff. They’re phenomenal at learning the form of language, and when the task is heavily constrained, that can be enough to game the current AI metrics for reading comprehension.
But as their mistakes show, these systems have only the barest shadow of an understanding of the events and relationships in the text. As I’ll discuss next time, that’s a problem for the entire field. If this is the state of the art, the art desperately needs a new target state — a new standard for machines that understand.
At Elemental Cognition, we’re aiming higher. Stay tuned for my next few posts on how we think we can get there.
* For the technically curious, these answers were generated using the standard BERT model fine-tuned on the SQuAD 1.1 dataset. Depending on how you measure, the model’s performance is either a few points behind or nearly a point ahead of the best models on the SQuAD leaderboard as of November 2019.