Data Science Glossary: 80+ Terms Every Data Scientist Should Know
Complete data science dictionary covering statistics, machine learning, data engineering, and analytics terms for 2026.
A — A/B Testing to Anomaly Detection
A/B Testing: A controlled experiment comparing two versions (A and B) to determine which performs better on a defined metric. Used extensively in product development, marketing, and ML model evaluation. Statistical significance is required before declaring a winner. Anomaly Detection: Identifying data points that deviate significantly from expected patterns. Techniques include isolation forests, autoencoders, and statistical methods like Z-score analysis. Critical for fraud detection, system monitoring, and quality control. ANOVA: Analysis of Variance — a statistical method for comparing means across multiple groups. One-way ANOVA compares one factor; two-way ANOVA examines interactions between two factors.
B — Bagging to Bayesian Statistics
Bagging (Bootstrap Aggregating): An ensemble technique that trains multiple models on random subsets of the data and averages their predictions. Random Forest is the most well-known bagging algorithm. Reduces variance without increasing bias. Batch Processing: Processing data in large groups rather than individually. Contrast with stream processing. Common in ETL pipelines using tools like Apache Spark, Hadoop, or dbt. Bayesian Statistics: A framework that updates probability estimates as new data arrives using Bayes theorem: P(A|B) = P(B|A) × P(A) / P(B). Bayesian methods are particularly useful when data is scarce or prior knowledge is available.
C — Classification to Cross-Validation
Classification: A supervised learning task where the model predicts discrete categories. Binary classification (spam/not-spam) and multi-class classification (image recognition) are the main types. Common algorithms: logistic regression, decision trees, SVMs, neural networks. Clustering: Unsupervised learning that groups similar data points together. K-means, DBSCAN, and hierarchical clustering are popular methods. Used for customer segmentation, document grouping, and pattern discovery. Cross-Validation: A technique for assessing model generalization by splitting data into multiple train/test sets. K-fold cross-validation (typically k=5 or k=10) is standard practice for model selection.
D — Data Lake to Dimensionality Reduction
Data Lake: A centralized repository storing raw data in its native format. Unlike data warehouses, data lakes accept structured, semi-structured, and unstructured data. Technologies: AWS S3, Azure Data Lake, Google Cloud Storage. Modern data lakes use formats like Delta Lake and Apache Iceberg. Data Pipeline: An automated workflow that moves data from sources through transformations to destinations. ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two main paradigms. Tools: Apache Airflow, Prefect, Dagster. Dimensionality Reduction: Techniques for reducing the number of features while preserving important information. PCA (Principal Component Analysis), t-SNE, and UMAP are widely used for visualization and preprocessing.
E — Ensemble Methods to Feature Engineering
Ensemble Methods: Combining multiple models to produce better predictions than any single model. Three main types: bagging (Random Forest), boosting (XGBoost, LightGBM, CatBoost), and stacking. Ensemble methods consistently win ML competitions. Feature Engineering: Creating new input features from raw data to improve model performance. Techniques include one-hot encoding, binning, polynomial features, and domain-specific transformations. Feature engineering often has more impact than model selection. Feature Store: A centralized system for storing, versioning, and serving ML features. Enables feature reuse across teams and ensures consistency between training and inference. Tools: Feast, Tecton, Hopsworks.
R — Regression to ROC Curve
Regression: A supervised learning task predicting continuous numerical values. Linear regression, polynomial regression, and neural network regression are common approaches. Evaluated using MSE, RMSE, MAE, and R-squared. Regularization: Adding a penalty term to the loss function to prevent overfitting. L1 regularization (Lasso) encourages sparsity, L2 (Ridge) shrinks coefficients, and Elastic Net combines both. Critical for high-dimensional datasets. ROC Curve: Receiver Operating Characteristic curve plots true positive rate vs false positive rate at various classification thresholds. AUC (Area Under Curve) summarizes classifier performance — 1.0 is perfect, 0.5 is random.