LLM Terminology: Every Term You Need to Know About Large Language Models

Complete guide to large language model terminology. From architecture details to training methods and deployment.

Architecture Terms

Transformer: The neural network architecture that powers all modern LLMs, based on self-attention mechanisms. Introduced in 2017, transformers process input tokens in parallel rather than sequentially. Decoder-Only: The architecture used by GPT, Claude, and Llama. Only the decoder portion of the transformer is used, trained with causal (left-to-right) attention. Each token can only attend to previous tokens. Encoder-Decoder: The original transformer design used by T5 and BART. An encoder processes the full input, then a decoder generates output. Used for translation and summarization. Multi-Head Attention: Running multiple attention computations in parallel, each learning different relationship patterns. GPT-4 uses 96+ attention heads. Feed-Forward Network (FFN): The dense layers between attention layers in a transformer. FFNs process each token independently and typically contain most of the model's parameters.

Training Concepts

Pre-training: The initial training phase where a model learns language patterns from massive text corpora (trillions of tokens). Pre-training teaches grammar, facts, reasoning, and coding abilities. Post-training: All training that happens after pre-training, including supervised fine-tuning (SFT), RLHF, and DPO. Post-training aligns models with human preferences and improves instruction-following. Supervised Fine-Tuning (SFT): Training on curated examples of ideal input-output pairs. SFT teaches models to follow instructions and produce helpful responses. DPO (Direct Preference Optimization): A simpler alternative to RLHF that trains on pairs of preferred vs rejected responses without needing a separate reward model. Curriculum Learning: Training on progressively harder examples, mimicking how humans learn from simple to complex.

Tokenization and Context

Token: The basic unit of text processing. A token can be a word, subword, or character. The average English word is about 1.3 tokens. Models have vocabulary sizes of 32K to 256K tokens. BPE (Byte-Pair Encoding): The most common tokenization algorithm. BPE iteratively merges the most frequent character pairs to build a vocabulary. Used by GPT, Claude, and most modern LLMs. Context Window: The maximum number of tokens a model can process. Context windows in 2026 range from 8K (small models) to 2M (Gemini). Longer contexts enable processing of entire documents but increase compute costs quadratically with attention. KV-Cache: A memory optimization that stores previously computed key-value pairs during autoregressive generation, avoiding redundant computation. KV-cache size grows linearly with sequence length.

Inference and Optimization

Autoregressive Generation: The standard method where models generate one token at a time, each conditioned on all previous tokens. This is why LLM generation appears to stream word by word. Speculative Decoding: An inference optimization where a small draft model generates candidate tokens that a larger model verifies in parallel. This can speed up generation by 2-3x. Beam Search: A decoding strategy that maintains multiple candidate sequences and selects the highest-probability output. Used in translation and structured generation. Top-k / Top-p Sampling: Decoding methods that restrict token selection to the most probable candidates. Top-k selects from the k most likely tokens. Top-p (nucleus sampling) selects the smallest set whose cumulative probability exceeds p.

Scaling and Efficiency

Scaling Laws: Mathematical relationships showing how model performance improves with increased parameters, data, and compute. Chinchilla scaling laws suggest optimal training uses roughly 20 tokens per parameter. Mixture of Experts (MoE): An architecture where input is routed to specialized sub-networks (experts). Only 2-4 experts activate per token, enabling larger total models at fixed inference cost. GPT-4 and Mixtral use MoE. Quantization: Reducing weight precision from FP32 to FP16, INT8, or INT4. 4-bit quantization reduces memory by 8x with minimal quality loss. GGUF and GPTQ are popular quantization formats. Knowledge Distillation: Training a smaller student model to reproduce a larger teacher model's outputs. Distillation enables deploying powerful capabilities on mobile devices and edge hardware.

Prompting and Interaction

System Prompt: Instructions provided to the model that set its behavior, persona, and constraints. System prompts are processed before user messages and influence all subsequent responses. Few-Shot Prompting: Providing examples of desired input-output pairs in the prompt. Few-shot learning emerges at scale and enables task adaptation without fine-tuning. Chain-of-Thought (CoT): Instructing the model to show its reasoning step-by-step before answering. CoT significantly improves performance on math, logic, and complex reasoning tasks. Tool Use / Function Calling: The ability for models to invoke external tools (APIs, databases, code interpreters) during generation. Tool use extends model capabilities beyond text generation to real-world actions.

Safety and Alignment

Alignment: Ensuring AI systems act in accordance with human intentions and values. Current alignment techniques include RLHF, constitutional AI, and red-teaming. Alignment remains an open research problem. Red-Teaming: Systematic adversarial testing to find model vulnerabilities, biases, and harmful outputs. Red-teams try to elicit dangerous, biased, or false content through creative prompting. Jailbreaking: Techniques to bypass model safety guardrails. Common approaches include role-playing prompts, encoded instructions, and multi-step manipulation. Models are continuously hardened against known jailbreaks. Constitutional AI (CAI): Anthropic's approach where models critique and revise their own outputs based on a set of principles, reducing the need for human feedback in alignment training.

Evaluation Benchmarks

MMLU (Massive Multitask Language Understanding): A benchmark testing knowledge across 57 academic subjects from elementary to professional level. State-of-the-art models score above 90%. HumanEval: A code generation benchmark with 164 programming problems. Models must generate correct Python functions from docstrings. Top models solve 90%+ of problems. MATH: A benchmark of 12,500 competition-level mathematics problems across algebra, geometry, calculus, and number theory. ARC (AI2 Reasoning Challenge): A benchmark testing scientific reasoning with grade-school science questions. ARC-Challenge includes questions that require multi-step reasoning. Chatbot Arena: A live platform where users compare model outputs head-to-head. ELO ratings from Arena are considered the most reliable measure of overall model quality.

LLM Terminology: Every Term You Need to Know About Large Language Models

Architecture Terms

Training Concepts

Tokenization and Context

Inference and Optimization

Scaling and Efficiency

Prompting and Interaction

Safety and Alignment

Evaluation Benchmarks

More from SEO Keyword Maximizer

ai glossary

crypto glossary

ai agent faq

web crawler faq

machine learning faq

cybersecurity glossary

cloud computing glossary

search engine faq

data science glossary

api glossary

web scraping faq