How did AI get so much smarter?

(this month’s photo is a picture of a brown bat. It’s small and fluffy with a stubby nose, and clinging to the gray bark of a tree. Photo by N. J. Stewart wildlife unmodified and used under the Creative Commons license)

When I write about intelligence, I tend to downplay AI and Deep Learning. These are powerful problem solving tools, but they’re over-hyped, and they don’t “think” the way people do. They have no memory, no sense of self, and no goals, at least in the usual sense of the words. But, large language models (LLMs) like OpenAI’s GPT are shockingly good at generating text that seems like something a person might make. They’re much more human-like than anything that came before. Why is that? The short answer is that they use a new kind of Deep Learning architecture known as a Transformer, which introduced a few small tricks that make a very big difference.

The first thing to note is that, while lots of people argue about whether LLMs can answer questions, reason, solve problems, brainstorm, or make art, what they really do is text prediction. They take some words as a starting point and then they guess what comes next based on their training data. If LLMs have any deeper cognitive abilities than that, they must be somehow tapping into the human cultural intelligence that is embedded within that text. Or, maybe they’re just parroting back fragments of intelligent things other people have said, without any understanding or integration—mindless idiots, randomly stringing words together in ways that sound just smart enough to distract us. We honestly don’t know yet! But whatever intelligence they possess, it exists entirely in the realm of language.

Research into getting computers to understand text and speech (known as Natural Language Processing, or NLP) started back in the 1950’s. Back then, computers were specialist’s tools, and making one that anyone could use just by telling it what to do was a dream. At first, researchers tried to formally describe language as we use it, feeding computers dictionaries, grammar rules, and lists of facts, but this never worked! It turns out, we don’t explicitly know all the rules of human language that we intuitively follow, and they’re usually fuzzy rules, with lots of conditions and exceptions. The key challenge of NLP was getting computers (which are obsessively logical and precise) to deal with this messiness and ambiguity, which we don’t even fully understand ourselves. Perhaps the most important advance was when researchers gave up trying to explain language to computers, and instead started teaching them by example.

Modern NLP represents words as lists of numbers called “vectors.” Like an (X, Y) coordinate, each vector represents a point in space. Not physical space, though, more like an abstract space of concepts. Maybe nouns go to the right, verbs to the left. Natural concepts are up, man-made concepts are down. Except, instead of two dimensions, maybe there are 10,000 of them. The layout of this space is pretty arbitrary. The absolute position of a word doesn’t mean anything, only where it is relative to other words. Nearby words have similar meanings, and relationships between words are represented by the distance and angle between them. This is all weirdly self-referential. Words are only defined in terms of other words! But it works surprisingly well. You don’t need explicit rules about which words go together and how, you can just look at lots of examples, and infer those relationships with statistics. People talk about “training” an AI by having it “read” lots of text, but really all that means is iteratively tweaking the lists of numbers, slowly moving the words through this abstract meaning space until they settle into positions that reflect how they co-occur together in the training text.

There’s one big problem with representing words as vectors, though: ambiguity. What do you do with a word like “bat,” which has several meanings? There’s no way one vector can represent this. The trick is to look for context. When you see a phrase like “brown bat” or “wooden bat,” the meaning is clear. Instead of thinking of these as pairs of words, you might think of them as compound words, each with their own distinct meaning. This is a powerful idea, but hard to generalize. Take a more difficult example: “Hearing a strange flutter and crash in the dark, he grabbed his bat for defense and went to investigate” Which kind of “bat” are we talking about? Words like “flutter” and “dark” might suggest the animal, but “grabbing” a bat for “defense” suggests the object instead. We need context to disambiguate, but which context? We’d like to ignore the first half of the sentence (which isn’t talking about the “bat”) and focus on the second half of the sentence (which is).

NLP has found elegant ways to solve this problem. They call these techniques “attention,” since the model is learning to “pay attention” to some words and not others, but I find that name misleading. For human beings, attention is something very different. We seem to have a “mind’s eye” that we can move about at will. We can choose to pay attention to this or that, our attention gets drawn to salient features, and we may even notice our attention drifting and redirect it. But these AIs have no mind’s eye, no will, and no intuition about relevance. The attention models we’re talking about are just more vector math. In addition to finding vectors to represent the meaning of each word individually, they also find vectors to represent patterns of words. They learn, “in this context, these words together mean that.” Adding an extra layer of complexity lets the model represent how words interact to change the meaning of other words or the sentence as a whole.

Researchers have explored many variations on this attention trick. Transformer models use an advanced kind of attention that represents context bi-directionally. They model how different words tend to get modified by context, and how different contexts tend to modify nearby words. The benefit of this is that such a model doesn’t just learn that “brown bat” is the name of an animal, but it might learn that “brown” is an adjective that applies to physical objects, that in English adjectives tend to modify the noun that follows them, and that “bat” can refer to one of several animal species, sometimes distinguished by color. That is, rather than modeling some particular context, models like this can learn general rules and relationships between different kinds of words. They can learn grammar. Not just the “official” grammar of a language like English, but any system of relationships and interactions between words, including dialects, domain-specific jargon, storytelling tropes, or the gender roles of a society.

The other trick that makes Transformers better with language is pluralism. Some NLP systems represent more complex meanings by using bigger vectors. More numbers in each vector means a larger conceptual space. Instead, Transformers use more vectors. They don’t learn the one meaning of this word, they learn to represent the many meanings of this word in the many contexts that contain it. This works a bit like voting. When processing a sentence, several different “attention heads” each consider one possible interpretation of a word, attending to different patterns of contextual cues. The overall meaning is determined by adding them all together. This is really useful for weighing subtle cues against each other to resolve ambiguity, but also to represent sentences with multiple layers of meaning. A word can have many meanings at the same time, and the many meanings of all the words in a sentence can interact in complex ways. The fancy kind of attention used in Transformers can automatically discover this sort of layered structure in language.

As clever as these attention methods are, they are not the secret to Transformers’ success. They do greatly improve the richness of NLP models, but at first they were mostly used with “recurrent neural networks,” a kind of Deep Learning model that processes data sequentially. That’s probably because they work a bit like how we imagine a human reader does: they “read” each word in a text, one at a time, using attention to figure out how each new word should update the meaning of the text so far. This works pretty well, but it doesn’t scale up to long passages of text. These models have a limited attention span, eventually forgetting important details they read several sentences ago. Also, processing long texts one word at a time is painfully slow. Even on the world’s fastest computer, reading a book from beginning to end takes time per page, and training a model like this takes vast amounts of text, so this was a major limitation.

The paper that first introduced Transformers was called Attention is All You Need, which highlights the key innovation: they got rid of the recurrent network, and built an AI using just this attention mechanism, all on its own. In other words, they found a way to do the same vector math, but solving for a large block of text all at once (and possibly out of order) rather than word-by-word. This doesn’t make the model “smarter.” It doesn’t even reduce the overall amount of number crunching. It just makes the work more parallelizable. Instead of having one computer read War and Peace from cover to cover, they could have many computers each read a few paragraphs, then combine their results. This made it possible to throw more money at the problem, using whole datacenters of computers to train a language model on vastly more text than ever before. Billions of documents, trillions of words. It’s the sheer volume of training data that made LLMs so much better. That’s why they’re called “large” language models.

So, how should we think about LLMs like GPT? Well, first off, human language is irregular and complex, but it’s also highly structured. Cleverly designed statistical learning tools can automatically discover that hidden structure just by processing obscene amounts of text. Neural networks are great for letting computers work with these sorts of fuzzy rules. They can extract meaning from text, manipulate it, and generate new text. But to an LLM, words are just vectors, defined by their relationships to each other. They have no connection to physical reality, because LLMs have no physical existence. There is no communication going on when you have a “conversation” with an LLM. To the AI, a dialog is just a sequence of vectors that follow one another according to some grammar. The AI has no mind, no intentions, and no meaning it wishes to convey. It has no conception of being truthful or helpful, only what words tend to follow certain questions. It does not learn from a conversation, it just re-reads the full chat history each time it makes a response. It appears like a good conversational partner, because it is made to imitate one, but what’s happening behind the screen isn’t “thinking” as we know it.

Still, LLMs really are much more human-like than any other AI that came before. Representing language with a high-dimensional abstract concept space works surprisingly well, and so do the “attention” methods described above. They let us represent a huge, open-ended space of ideas that can build on and interact with each other. They let us represent ambiguity, nuance, and innuendo. So, maybe those vector math tricks could actually teach us something about how language processing works in the brain? On the other hand, LLMs are also remarkable in how different they are from humans. An LLM can learn English, but only by reading every document on the internet, not one word at a time, but all at once. In contrast, babies learn language by interacting with the world, learning how words relate to objects, people, events, actions, and desires. Even though they’re exposed to far less language, they learn much faster, and in a way that tightly integrates all of their senses, relationships, and the lifestyle they were born into. Since LLMs seem so human-like, it’s very tempting to imagine them with the same kind of awareness, purpose, and empathy that we have, but they simply aren’t there. Those are a product of being alive in the world, and can’t be found in text, no matter how much of it.