Skip to main content

To realize their full potential, they must be augmented with logical reasoning. 

As demonstrated by the recent release of ChatGPT, the evolution of language models over the past few years has been a marvel to witness. They can respond creatively and fluently to just about any prompt.

How do they do it? By finding statistical patterns in large volumes of text, they can then extrapolate what comes next in new contexts. After a question comes an answer. After a problem comes a solution. There are technical details, but that is the core. Amazingly simple!

You will see debates about whether a model that predicts “what comes next” could ever “truly understand” language. This is a philosophical question. At Elemental Cognition, we are more interested in a practical one: what can such models be relied upon to do?

We continue to believe that the basis of such a model’s useability depends on the strength of two characteristics: fluency and correctness. Our view is that while statistical language models can provide a fluent interface, they are not reliable substitutes for correct, logical reasoning. And without correct logical reasoning, we cannot trust them to make mission-critical decisions.

You will see discussion about how these systems are opaque, and how the patterns that power them are encoded as uninterpretable parameters in a neural network. This is true, and it makes the systems hard to trust: we don’t know what they “think”, so we can’t understand what they will do. But if that seems a bit abstract, there is also a more practical problem: the things language models say are often just wrong.

It’s always a bit of a downer to ring this bell, but it’s more important now than ever. Precisely because these models are increasingly fluent, it is increasingly hard to tell when they veer into nonsense. Elemental Cognition was founded in 2015 to create a single AI architecture that combines the ability to do precise logical reasoning with the ability to fluently use language. This is not easy to do well, but it is clearly necessary. Look no further than ChatGPT for examples as to why.

The popular Q&A website Stack Exchange had to ban responses generated by ChatGPT, and their explanation is illuminating (emphasis in original):

The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good […] The volume of these answers (thousands) and the fact that the answers often require a detailed read by someone with at least some subject matter expertise in order to determine that the answer is actually bad has effectively swamped our volunteer-based quality curation infrastructure.

Or take this concrete example. The text that ChatGPT has been trained on includes many facts about US geography: travelogues, routing instructions, and incidental mentions of taking a highway from one state to another. Has its pattern-matching made it good enough to give directions? Let’s see.

This sounds pretty good, especially if you aren’t familiar with the map. You might even trust it! That would be a mistake. I-80 goes through northern Nevada, but it is nowhere near Las Vegas. I-94 does not pass through South Dakota at all. It does pass through Wisconsin but does not come within 100 miles of Green Bay. I-80 and I-94 don’t even intersect until east of Green Bay. What you need is a GPS system (or roadmap expert) designed to give directions accurately, not a language model designed to produce something that sounds like directions.

Here is an even simpler example, just to make the point.

The correct answer is 4,943,492,465 – about 10x ChatGPT’s answer. Its statistical patterns know that an arithmetic problem is usually followed by a number, and it has a rough idea of what that number looks like. Just don’t bank on it being exactly (or approximately) correct.

Earlier, we said that language models learn from patterns in text: after a question comes an answer. Now we clearly see the problem: the language models have only managed to learn that after something that reads like a question comes something that reads like an answer. Sometimes it’s the right answer, sometimes it isn’t. The language model doesn’t know the difference.

When it comes to solving problems that require correctness, logic, and precision, language models fail. When it comes to tasks where medium accuracy is good enough or tasks where a human in the loop verifies accuracy, language models can be a big help. That is how we use them at EC. However, it is only part of the puzzle.

At EC, we pursue a hybrid AI approach. We evaluate and combine the best of language models with the best of logical reasoning to deliver solutions that are fluent and correct. With large language models becoming largely commoditized, the real magic is in the combination – to deliver human-like language fluency with the provable correctness that formal reasoning has to offer.