Open Datasets for AI: The Best Public Resources for Machine Learning
Curated list of the best open datasets for AI training, fine-tuning, and evaluation. Updated for 2026.
Why Open Datasets Matter
Open datasets are the foundation of reproducible AI research. They enable independent researchers and small companies to train competitive models without billion-dollar crawling infrastructure. The open data movement has produced landmarks like Common Crawl, The Pile, and RedPajama, which power models like Llama, Falcon, and Mistral. Open datasets also enable fair benchmarking — you can only compare models meaningfully when trained on comparable data.
Text Datasets
Common Crawl: The largest open web dataset, providing petabytes of crawled HTML. Updated monthly. Used as a foundation by most open-source models. The Pile: A curated 800GB dataset by EleutherAI combining 22 diverse sources including Wikipedia, ArXiv, GitHub, Stack Exchange, and books. SlimPajama: A cleaned and deduplicated version of RedPajama, containing 627 billion tokens. Used for training models up to 7B parameters. FineWeb: A high-quality web dataset by Hugging Face, filtered using educational content classifiers. Dolma: The dataset behind Allen AI's OLMo models, containing 3 trillion tokens from diverse sources with documented provenance.
Code Datasets
The Stack v2: Over 900 million files from GitHub covering 600+ programming languages, with permissive license filtering. Used for training StarCoder and Code Llama. CodeParrot: A clean Python code dataset from GitHub. MBPP (Mostly Basic Python Problems): 974 programming tasks for evaluating code generation. CodeContests: Competition programming problems from Codeforces, AtCoder, and similar platforms. Used to train AlphaCode. HumanEval: 164 hand-written Python programming problems with test cases, the standard benchmark for code generation.
Instruction and Alignment Datasets
FLAN Collection: Google's collection of 1,800+ NLP tasks formatted as instructions, used for instruction-tuning. OpenAssistant Conversations (OASST): A crowd-sourced dataset of human-written assistant conversations in 35 languages. ShareGPT: Conversations shared by ChatGPT users, widely used for fine-tuning open models (though with unclear licensing). Anthropic HH-RLHF: Human preference data for training helpful and harmless assistants, used for RLHF and DPO training. UltraFeedback: AI-generated preference data across 64K prompts, used for training open RLHF models.
Multimodal Datasets
LAION-5B: 5.85 billion image-text pairs from the internet. Used to train Stable Diffusion and other image generators. (Partially taken down due to CSAM concerns.) CC12M: 12 million image-text pairs from Conceptual Captions, a cleaner alternative to LAION. WebVid: 10 million video-text pairs for training video generation models. AudioSet: 2 million audio clips with labels, the standard dataset for audio understanding. LLaVA-Instruct: Visual instruction-following data for training multimodal language models. ShareGPT4V: High-quality image descriptions generated by GPT-4V for training open multimodal models.
Evaluation Benchmarks
MMLU: 15,908 questions across 57 subjects, testing broad knowledge. The most widely cited LLM benchmark. GSM8K: 8,500 grade-school math problems requiring multi-step reasoning. Standard math benchmark. ARC: AI2 Reasoning Challenge with science questions at grade-school level. Tests scientific reasoning. TruthfulQA: 817 questions designed to test whether models generate truthful answers on topics where human misconceptions are common. HellaSwag: Sentence completion benchmark testing commonsense reasoning. GPQA: Graduate-level science questions used to test expert-level knowledge. Models approaching human expert performance on this benchmark.
How to Choose a Dataset
Selecting the right dataset depends on your goal. For pre-training a general model: start with SlimPajama or FineWeb for text, add The Stack for code. For fine-tuning on instructions: use FLAN for broad capabilities, OpenAssistant for conversational style. For alignment: Anthropic HH-RLHF or UltraFeedback for preference learning. For evaluation: use a diverse benchmark suite (MMLU + HumanEval + GSM8K minimum). Always check licensing — many datasets have restrictions on commercial use. Hugging Face Datasets Hub hosts over 100,000 datasets with standardized access via the datasets library.