AI Glossary: 100+ Artificial Intelligence Terms Defined

Comprehensive glossary of artificial intelligence terms from agents to zero-shot learning. Updated for 2026.

A — Agents to Attention

AI Agent: An autonomous system that perceives its environment, makes decisions, and takes actions to achieve goals. Modern AI agents can browse the web, write code, and interact with APIs. Examples include Claude Computer Use, AutoGPT, and Devin. Alignment: The challenge of ensuring AI systems act in accordance with human values and intentions. Misalignment occurs when an AI optimizes for unintended objectives. Attention Mechanism: A neural network component that allows models to focus on relevant parts of the input. Self-attention is the core of transformer architectures used in GPT, Claude, and Gemini. The attention mechanism computes weighted relationships between all tokens in a sequence, enabling long-range dependencies.

B — Backpropagation to BERT

Backpropagation: The algorithm used to train neural networks by computing gradients of the loss function with respect to each weight. Gradients flow backward through the network, enabling weight updates via gradient descent. Batch Size: The number of training examples processed before updating model weights. Larger batches provide more stable gradients but require more memory. Typical batch sizes range from 32 to 4096. BERT: Bidirectional Encoder Representations from Transformers. A 2018 Google model that processes text bidirectionally, understanding context from both left and right. BERT revolutionized NLP benchmarks and remains the foundation for many search and classification systems. Benchmark: A standardized test used to evaluate AI model performance. Common benchmarks include MMLU (knowledge), HumanEval (coding), MATH (reasoning), and ARC (general intelligence).

C — Chain-of-Thought to CUDA

Chain-of-Thought (CoT): A prompting technique where models are encouraged to show intermediate reasoning steps before producing a final answer. CoT dramatically improves performance on math, logic, and multi-step problems. CLIP: Contrastive Language-Image Pre-training. An OpenAI model that learns visual concepts from natural language descriptions, enabling zero-shot image classification. Context Window: The maximum number of tokens a language model can process in a single interaction. In 2026, context windows range from 8K (small models) to 1M+ tokens (Gemini). Larger contexts enable processing entire codebases or books. CUDA: NVIDIA's parallel computing platform for GPU-accelerated computing. Nearly all AI training runs on CUDA-enabled GPUs. Alternatives include AMD ROCm and Intel oneAPI.

D — Diffusion to Distillation

Diffusion Models: A class of generative models that learn to reverse a noise-adding process. Starting from pure noise, the model iteratively denoises to produce images, video, or audio. Stable Diffusion, DALL-E 3, and Midjourney all use diffusion architectures. Distillation: The process of training a smaller model to mimic a larger one. The student model learns from the teacher's output probabilities rather than raw training data. Distillation enables deploying powerful models on edge devices. Deep Learning: A subset of machine learning using neural networks with multiple layers (hence "deep"). Deep learning powers most modern AI including language models, image generators, and robotics systems. Dropout: A regularization technique that randomly deactivates neurons during training to prevent overfitting. Typical dropout rates range from 10-50%.

E — Embeddings to Evaluation

Embeddings: Dense vector representations of data (text, images, audio) in a continuous space. Similar items have vectors that are close together. Embeddings enable semantic search, clustering, and recommendation systems. Popular embedding models include OpenAI text-embedding-3, Cohere Embed, and BGE. Emergence: The phenomenon where large models exhibit capabilities not present in smaller versions. Examples include in-context learning, chain-of-thought reasoning, and tool use. Emergence remains one of the least understood aspects of scaling laws. Evaluation: The process of measuring AI model performance against benchmarks and human judgments. Evaluation methods include automated benchmarks, human preference ratings (RLHF), red-teaming, and capability assessments.

F — Fine-tuning to Foundation Models

Fine-tuning: Adapting a pre-trained model to a specific task by training on domain-specific data. Methods include full fine-tuning, LoRA (Low-Rank Adaptation), QLoRA, and prompt tuning. Fine-tuning can dramatically improve performance on specialized tasks while requiring far less compute than training from scratch. Foundation Model: A large model trained on broad data that can be adapted to many downstream tasks. GPT-4, Claude, Gemini, and Llama are foundation models. The term emphasizes that these models serve as a base for building applications. Few-shot Learning: A model's ability to learn from just a few examples provided in the prompt, without weight updates. This emerged as a key capability of large language models starting with GPT-3.

G — GANs to Guardrails

GAN (Generative Adversarial Network): A framework where two neural networks (generator and discriminator) compete, producing increasingly realistic outputs. GANs pioneered AI image generation before diffusion models. GPU (Graphics Processing Unit): The primary hardware for AI training and inference. NVIDIA H100 and B200 GPUs dominate the market. A single GPT-4 scale training run may use 25,000+ GPUs for months. Grounding: Connecting AI outputs to verifiable real-world information. Techniques include retrieval-augmented generation (RAG), tool use, and citation. Guardrails: Safety mechanisms that constrain AI behavior. These include content filters, output validators, constitutional AI principles, and human-in-the-loop review systems.

H — Hallucination to Hyperparameters

Hallucination: When an AI model generates plausible-sounding but factually incorrect information. Hallucinations remain a fundamental challenge for language models. Mitigation strategies include RAG, fine-tuning on verified data, and chain-of-thought prompting with self-verification. HITL (Human-in-the-Loop): A design pattern where human oversight is integrated into AI decision-making. Critical for high-stakes applications like medical diagnosis, legal analysis, and financial trading. Hyperparameters: Configuration values set before training begins, as opposed to parameters learned during training. Key hyperparameters include learning rate, batch size, number of layers, attention heads, and context length. Hyperparameter tuning significantly impacts model quality.

I-L — Inference to LoRA

Inference: Running a trained model to generate predictions or outputs. Inference optimization focuses on reducing latency and cost through techniques like quantization, speculative decoding, and KV-cache. Knowledge Graph: A structured representation of entities and their relationships. Knowledge graphs complement language models by providing verified facts. Google's Knowledge Graph contains billions of entities. LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds small trainable matrices to frozen model weights. LoRA reduces fine-tuning memory by 10-100x while maintaining quality. LLM (Large Language Model): A neural network trained on vast text corpora to understand and generate language. The largest LLMs in 2026 have over 1 trillion parameters.

M-P — MCP to Prompt Engineering

MCP (Model Context Protocol): An open protocol by Anthropic for connecting AI models to external tools and data sources. MCP enables structured tool use with type-safe schemas. Mixture of Experts (MoE): An architecture where only a subset of model parameters are activated for each input. MoE enables larger total parameter counts while keeping inference costs manageable. GPT-4 and Mixtral use MoE architectures. Parameter: A learned value in a neural network, typically a weight or bias. Modern LLMs have billions to trillions of parameters. Prompt Engineering: The practice of crafting inputs to elicit desired outputs from AI models. Techniques include system prompts, few-shot examples, chain-of-thought, and structured output formats.

Q-R — Quantization to RLHF

Quantization: Reducing the precision of model weights (e.g., from 32-bit to 4-bit) to decrease memory and compute requirements. Quantized models run faster and on smaller hardware with minimal quality loss. RAG (Retrieval-Augmented Generation): A technique that retrieves relevant documents before generating a response, grounding outputs in real data. RAG reduces hallucinations and enables models to access up-to-date information. Reinforcement Learning from Human Feedback (RLHF): A training approach where models are optimized based on human preference judgments. RLHF is used to align language models with human values and make outputs more helpful and safe. Constitutional AI (CAI) extends RLHF with AI-generated feedback.

S-T — Scaling Laws to Transformers

Scaling Laws: Empirical relationships between model size, training data, compute, and performance. Scaling laws predict that model quality improves predictably with increased resources, following power-law curves. Self-Supervised Learning: Training on unlabeled data by creating pseudo-labels from the data itself. Next-token prediction in language models is the most successful self-supervised task. Temperature: A parameter controlling the randomness of model outputs. Temperature 0 produces deterministic outputs; higher temperatures increase diversity. Transformers: The neural network architecture behind modern LLMs. Transformers use self-attention to process sequences in parallel, enabling efficient training on massive datasets. Introduced in the 2017 paper "Attention Is All You Need."

U-Z — Vector Database to Zero-Shot

Vector Database: A specialized database for storing and querying high-dimensional embedding vectors. Used in RAG systems for semantic search. Popular options include Pinecone, Weaviate, Chroma, and pgvector. Web Crawling: Automated browsing and indexing of web pages. AI companies operate large-scale crawlers (GPTBot, ClaudeBot, CCBot) to collect training data and keep knowledge bases current. See our detailed analysis of web crawler economics and bot detection methods. Zero-Shot Learning: A model's ability to perform tasks it was never explicitly trained on, using only a natural language description. Zero-shot capability is a hallmark of large foundation models and scales with model size.

AI Glossary: 100+ Artificial Intelligence Terms Defined

A — Agents to Attention

B — Backpropagation to BERT

C — Chain-of-Thought to CUDA

D — Diffusion to Distillation

E — Embeddings to Evaluation

F — Fine-tuning to Foundation Models

G — GANs to Guardrails

H — Hallucination to Hyperparameters

I-L — Inference to LoRA

M-P — MCP to Prompt Engineering

Q-R — Quantization to RLHF

S-T — Scaling Laws to Transformers

U-Z — Vector Database to Zero-Shot

More from SEO Keyword Maximizer

crypto glossary

llm terminology

ai agent faq

web crawler faq

machine learning faq

cybersecurity glossary

cloud computing glossary

search engine faq

data science glossary

api glossary

web scraping faq