robots.txt vs AI Crawlers: The Rules Are Changing

How robots.txt is being used (and ignored) by AI training crawlers, and what website owners can do about it.

The Original Purpose of robots.txt

robots.txt was created in 1994 as a simple agreement between websites and search engine crawlers. The idea was straightforward: website owners could tell crawlers which pages to skip. For 30 years, this worked because both sides benefited — search engines needed websites to cooperate, and websites needed search traffic. But AI training crawlers have fundamentally changed this dynamic.

How AI Crawlers Treat robots.txt

The major AI companies have introduced their own crawler user-agents: GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), Bytespider (ByteDance), and Meta-ExternalAgent (Meta). Each company claims to respect robots.txt, but enforcement is inconsistent. Some crawlers use generic user-agents that don't match their published bot names. Others crawl first and check robots.txt later. The fundamental issue: AI companies have no reciprocal relationship with websites — they extract training data but send no traffic back.

The Legal Landscape in 2026

Several lawsuits are testing whether robots.txt creates a legally binding contract. The New York Times sued OpenAI for training on their content despite robots.txt blocks. The ruling is still pending, but early signals suggest courts may treat robots.txt violations as a form of trespass. Meanwhile, the EU AI Act requires AI companies to document their training data sources, creating indirect pressure to respect robots.txt. In practice, most AI companies now respect explicit blocks for their named crawlers — but the unnamed crawlers remain a gray area.

What Actually Works for Blocking

Based on our testing: blocking by user-agent name works for the major crawlers that identify themselves (GPTBot, ClaudeBot, Bingbot). Rate limiting via server configuration catches aggressive crawlers regardless of user-agent. IP-based blocking works but requires maintaining updated IP ranges (OpenAI publishes theirs, others don't). CAPTCHAs and JavaScript challenges stop most crawlers but also hurt legitimate users. The most effective approach is a combination: named blocks in robots.txt plus server-side rate limiting for unknown agents.

The Opt-In Future

The industry is moving toward an opt-in model. Google's AI Overviews already use a separate permission layer beyond robots.txt. OpenAI offers a media partnership program where publishers grant crawling access in exchange for compensation. Anthropic's approach is to respect robots.txt strictly for ClaudeBot. The emerging standard is likely: robots.txt for search crawlers (the existing system), plus a separate AI-specific permission layer — possibly through llms.txt, a new standard that lets sites describe their AI crawling preferences.

Our Approach: Observe Everything

At Global Chat, we take the opposite approach: we allow all crawlers and track everything they do. Rather than blocking AI bots, we study them. This gives us data on crawl patterns, frequency, behavior, and capabilities that helps the broader web community understand what AI crawlers actually do. Our bot detection middleware logs every visit without blocking, building what we believe is one of the most detailed public datasets on AI crawler behavior.

More from HN Technical Deep-Dives