Data Science Glossary: 80+ Terms Every Data Scientist Should Know

Complete data science dictionary covering statistics, machine learning, data engineering, and analytics terms for 2026.

A — A/B Testing to Anomaly Detection

A/B Testing: A controlled experiment comparing two versions (A and B) to determine which performs better on a defined metric. Used extensively in product development, marketing, and ML model evaluation. Statistical significance is required before declaring a winner. Anomaly Detection: Identifying data points that deviate significantly from expected patterns. Techniques include isolation forests, autoencoders, and statistical methods like Z-score analysis. Critical for fraud detection, system monitoring, and quality control. ANOVA: Analysis of Variance — a statistical method for comparing means across multiple groups. One-way ANOVA compares one factor; two-way ANOVA examines interactions between two factors.

B — Bagging to Bayesian Statistics

Bagging (Bootstrap Aggregating): An ensemble technique that trains multiple models on random subsets of the data and averages their predictions. Random Forest is the most well-known bagging algorithm. Reduces variance without increasing bias. Batch Processing: Processing data in large groups rather than individually. Contrast with stream processing. Common in ETL pipelines using tools like Apache Spark, Hadoop, or dbt. Bayesian Statistics: A framework that updates probability estimates as new data arrives using Bayes theorem: P(A|B) = P(B|A) × P(A) / P(B). Bayesian methods are particularly useful when data is scarce or prior knowledge is available.

C — Classification to Cross-Validation

Classification: A supervised learning task where the model predicts discrete categories. Binary classification (spam/not-spam) and multi-class classification (image recognition) are the main types. Common algorithms: logistic regression, decision trees, SVMs, neural networks. Clustering: Unsupervised learning that groups similar data points together. K-means, DBSCAN, and hierarchical clustering are popular methods. Used for customer segmentation, document grouping, and pattern discovery. Cross-Validation: A technique for assessing model generalization by splitting data into multiple train/test sets. K-fold cross-validation (typically k=5 or k=10) is standard practice for model selection.

D — Data Lake to Dimensionality Reduction

Data Lake: A centralized repository storing raw data in its native format. Unlike data warehouses, data lakes accept structured, semi-structured, and unstructured data. Technologies: AWS S3, Azure Data Lake, Google Cloud Storage. Modern data lakes use formats like Delta Lake and Apache Iceberg. Data Pipeline: An automated workflow that moves data from sources through transformations to destinations. ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two main paradigms. Tools: Apache Airflow, Prefect, Dagster. Dimensionality Reduction: Techniques for reducing the number of features while preserving important information. PCA (Principal Component Analysis), t-SNE, and UMAP are widely used for visualization and preprocessing.

E — Ensemble Methods to Feature Engineering

Ensemble Methods: Combining multiple models to produce better predictions than any single model. Three main types: bagging (Random Forest), boosting (XGBoost, LightGBM, CatBoost), and stacking. Ensemble methods consistently win ML competitions. Feature Engineering: Creating new input features from raw data to improve model performance. Techniques include one-hot encoding, binning, polynomial features, and domain-specific transformations. Feature engineering often has more impact than model selection. Feature Store: A centralized system for storing, versioning, and serving ML features. Enables feature reuse across teams and ensures consistency between training and inference. Tools: Feast, Tecton, Hopsworks.

R — Regression to ROC Curve

Regression: A supervised learning task predicting continuous numerical values. Linear regression, polynomial regression, and neural network regression are common approaches. Evaluated using MSE, RMSE, MAE, and R-squared. Regularization: Adding a penalty term to the loss function to prevent overfitting. L1 regularization (Lasso) encourages sparsity, L2 (Ridge) shrinks coefficients, and Elastic Net combines both. Critical for high-dimensional datasets. ROC Curve: Receiver Operating Characteristic curve plots true positive rate vs false positive rate at various classification thresholds. AUC (Area Under Curve) summarizes classifier performance — 1.0 is perfect, 0.5 is random.

Data Science Glossary: 80+ Terms Every Data Scientist Should Know

A — A/B Testing to Anomaly Detection

B — Bagging to Bayesian Statistics

C — Classification to Cross-Validation

D — Data Lake to Dimensionality Reduction

E — Ensemble Methods to Feature Engineering

R — Regression to ROC Curve

More from SEO Keyword Maximizer

ai glossary

crypto glossary

llm terminology

ai agent faq

web crawler faq

machine learning faq

cybersecurity glossary

cloud computing glossary

search engine faq

api glossary

web scraping faq