How We Detect AI Bots: A Technical Deep-Dive
Technical breakdown of user-agent analysis, behavioral fingerprinting, and real-time bot detection for AI web crawlers.
The Problem
AI bots are crawling the web at an unprecedented scale. GPTBot, ClaudeBot, Googlebot, and dozens of others visit millions of sites daily. Most site owners have no idea which bots visit, how often, or what they do. We built a detection system to find out.
Layer 1: User-Agent Detection
The simplest approach: match user-agent strings against known bot signatures. We maintain a database of 30+ AI bot user-agents including GPTBot, ClaudeBot, CCBot, Bytespider, PetalBot, and others. This catches ~80% of known bots. The signatures are checked in Next.js middleware on every request, adding <1ms latency.
Layer 2: Behavioral Fingerprinting
Some bots disguise their user-agent. We detect these through behavior: request timing (bots are more regular), header patterns (bots often omit Accept-Language), TLS fingerprints (JA3/JA4), and navigation patterns (bots don't scroll, don't hover, don't have mouse events). We track page transitions to build a crawl graph per visitor.
Layer 3: Capability Testing
The most interesting layer. We serve progressively harder challenges: can the bot follow JavaScript-rendered links? Can it fill out a form? Can it parse structured data? Can it read a crypto wallet address? Each test reveals different capability tiers — from basic crawlers to fully autonomous agents.
Architecture
The system runs as Next.js middleware on Vercel Edge. Bot detection happens at the edge with zero cold start. Detections are logged to Supabase in the background using event.waitUntil() so they don't block the response. A daily cron aggregates per-bot statistics, path traversals, and funnel metrics.
What We Found
9 unique AI bots visit our site regularly. Googlebot is the most frequent (2-3x daily). GPTBot and ClaudeBot visit within hours of content changes. Most bots only crawl 1-2 pages per visit — crawl depth is surprisingly shallow. Schema.org structured data correlates with more frequent re-crawls. None of the crawlers have passed our form interaction test yet.
Open Source
The detection system is open source. The core bot detection runs in ~50 lines of TypeScript. The harder part is the analytics pipeline that makes the data useful. We're building toward a standard "bot capability benchmark" that any site can deploy.