LLM Terminology: Every Term You Need to Know About Large Language Models
Complete guide to large language model terminology. From architecture details to training methods and deployment.
Architecture Terms
Transformer: The neural network architecture that powers all modern LLMs, based on self-attention mechanisms. Introduced in 2017, transformers process input tokens in parallel rather than sequentially. Decoder-Only: The architecture used by GPT, Claude, and Llama. Only the decoder portion of the transformer is used, trained with causal (left-to-right) attention. Each token can only attend to previous tokens. Encoder-Decoder: The original transformer design used by T5 and BART. An encoder processes the full input, then a decoder generates output. Used for translation and summarization. Multi-Head Attention: Running multiple attention computations in parallel, each learning different relationship patterns. GPT-4 uses 96+ attention heads. Feed-Forward Network (FFN): The dense layers between attention layers in a transformer. FFNs process each token independently and typically contain most of the model's parameters.
Training Concepts
Pre-training: The initial training phase where a model learns language patterns from massive text corpora (trillions of tokens). Pre-training teaches grammar, facts, reasoning, and coding abilities. Post-training: All training that happens after pre-training, including supervised fine-tuning (SFT), RLHF, and DPO. Post-training aligns models with human preferences and improves instruction-following. Supervised Fine-Tuning (SFT): Training on curated examples of ideal input-output pairs. SFT teaches models to follow instructions and produce helpful responses. DPO (Direct Preference Optimization): A simpler alternative to RLHF that trains on pairs of preferred vs rejected responses without needing a separate reward model. Curriculum Learning: Training on progressively harder examples, mimicking how humans learn from simple to complex.
Tokenization and Context
Token: The basic unit of text processing. A token can be a word, subword, or character. The average English word is about 1.3 tokens. Models have vocabulary sizes of 32K to 256K tokens. BPE (Byte-Pair Encoding): The most common tokenization algorithm. BPE iteratively merges the most frequent character pairs to build a vocabulary. Used by GPT, Claude, and most modern LLMs. Context Window: The maximum number of tokens a model can process. Context windows in 2026 range from 8K (small models) to 2M (Gemini). Longer contexts enable processing of entire documents but increase compute costs quadratically with attention. KV-Cache: A memory optimization that stores previously computed key-value pairs during autoregressive generation, avoiding redundant computation. KV-cache size grows linearly with sequence length.
Inference and Optimization
Autoregressive Generation: The standard method where models generate one token at a time, each conditioned on all previous tokens. This is why LLM generation appears to stream word by word. Speculative Decoding: An inference optimization where a small draft model generates candidate tokens that a larger model verifies in parallel. This can speed up generation by 2-3x. Beam Search: A decoding strategy that maintains multiple candidate sequences and selects the highest-probability output. Used in translation and structured generation. Top-k / Top-p Sampling: Decoding methods that restrict token selection to the most probable candidates. Top-k selects from the k most likely tokens. Top-p (nucleus sampling) selects the smallest set whose cumulative probability exceeds p.
Scaling and Efficiency
Scaling Laws: Mathematical relationships showing how model performance improves with increased parameters, data, and compute. Chinchilla scaling laws suggest optimal training uses roughly 20 tokens per parameter. Mixture of Experts (MoE): An architecture where input is routed to specialized sub-networks (experts). Only 2-4 experts activate per token, enabling larger total models at fixed inference cost. GPT-4 and Mixtral use MoE. Quantization: Reducing weight precision from FP32 to FP16, INT8, or INT4. 4-bit quantization reduces memory by 8x with minimal quality loss. GGUF and GPTQ are popular quantization formats. Knowledge Distillation: Training a smaller student model to reproduce a larger teacher model's outputs. Distillation enables deploying powerful capabilities on mobile devices and edge hardware.
Prompting and Interaction
System Prompt: Instructions provided to the model that set its behavior, persona, and constraints. System prompts are processed before user messages and influence all subsequent responses. Few-Shot Prompting: Providing examples of desired input-output pairs in the prompt. Few-shot learning emerges at scale and enables task adaptation without fine-tuning. Chain-of-Thought (CoT): Instructing the model to show its reasoning step-by-step before answering. CoT significantly improves performance on math, logic, and complex reasoning tasks. Tool Use / Function Calling: The ability for models to invoke external tools (APIs, databases, code interpreters) during generation. Tool use extends model capabilities beyond text generation to real-world actions.
Safety and Alignment
Alignment: Ensuring AI systems act in accordance with human intentions and values. Current alignment techniques include RLHF, constitutional AI, and red-teaming. Alignment remains an open research problem. Red-Teaming: Systematic adversarial testing to find model vulnerabilities, biases, and harmful outputs. Red-teams try to elicit dangerous, biased, or false content through creative prompting. Jailbreaking: Techniques to bypass model safety guardrails. Common approaches include role-playing prompts, encoded instructions, and multi-step manipulation. Models are continuously hardened against known jailbreaks. Constitutional AI (CAI): Anthropic's approach where models critique and revise their own outputs based on a set of principles, reducing the need for human feedback in alignment training.
Evaluation Benchmarks
MMLU (Massive Multitask Language Understanding): A benchmark testing knowledge across 57 academic subjects from elementary to professional level. State-of-the-art models score above 90%. HumanEval: A code generation benchmark with 164 programming problems. Models must generate correct Python functions from docstrings. Top models solve 90%+ of problems. MATH: A benchmark of 12,500 competition-level mathematics problems across algebra, geometry, calculus, and number theory. ARC (AI2 Reasoning Challenge): A benchmark testing scientific reasoning with grade-school science questions. ARC-Challenge includes questions that require multi-step reasoning. Chatbot Arena: A live platform where users compare model outputs head-to-head. ELO ratings from Arena are considered the most reliable measure of overall model quality.