Article

What 12 Days of AI Crawler Logs Reveal: Who Actually Fetches Your Content (2026 First-Party Data)

7 min readLumenGEO Research
AI crawlersfirst-party dataChatGPT-UserGPTBotPerplexityBotrobots.txtoriginal researchAI bots

We logged every AI crawler that touched one GEO-focused site for 12 days. The result: 2,746 fetches from 13 distinct AI bots. The single biggest slice was not training crawls but agent fetches (48.5%) — pages pulled in real time while a person was mid-conversation with an AI assistant. OpenAI's three crawlers accounted for roughly 63% of all hits, and only 23% of traffic was training-only crawling. This is first-party server-log data, with the full methodology and limitations below.

Most writing about AI crawlers is either vendor marketing or guesswork from third-party panels. We wanted first-party numbers, so we instrumented our own server to log every verified AI bot that requested a page, classify it by purpose, and check the user agent against each company's published IP ranges. This is what 12 days of that log shows.

Last updated: June 2026

Data window June 2026

Agent traffic, not training crawls, dominates. Across 2,746 fetches, live agent bots (ChatGPT-User, Perplexity-User) were 48.5% of all activity, retrieval bots that ground answers were another 15.3%, and search indexing was 13%. Training-only crawling was just 23.2%. For citations, the live and retrieval traffic is the part that matters, and it is the majority.

What we measured (read this first)

Every request to lumengeo.co from a known AI user agent is logged with its bot name, a purpose classification, the path, and whether the source IP falls inside that company's published crawler ranges. The window for this study is 14 to 25 June 2026: 12 days, one site, 2,746 logged AI-bot hits, averaging roughly 229 per day.

We classify each bot into one of four purposes:

  • Agent — fetched live, in-session, when a user asks an assistant something that pulls your page (ChatGPT-User, Perplexity-User).
  • Retrieval — fetched to ground or build an answer index (OAI-SearchBot, PerplexityBot).
  • Search — traditional search indexing that also feeds AI answers (Bingbot).
  • Training — bulk crawling for model training corpora (GPTBot, ClaudeBot, Bytespider, CCBot, Meta-ExternalAgent, Amazonbot).

This is a small, single-site sample, not an internet-wide census. Treat the shape of the data as directional and the exact percentages as specific to one GEO-niche site. The methodology and limitations section at the end is honest about what this does and does not prove.

Agent fetches, not training crawls, are the biggest slice

The common assumption is that AI bot traffic is mostly models hoovering up your content for training. In our log it is the opposite. Here is the split by purpose:

PurposeShareHitsWhat it means
Agent (live, in-session)48.5%1,331A user was actively asking an AI assistant something your page answered
Training (bulk crawl)23.2%638Content collected for model training corpora
Retrieval (answer grounding)15.3%420Bots building or refreshing the answer index
Search (index, feeds AI)13.0%357Bingbot indexing, which also backs ChatGPT and Copilot

Agent and retrieval together are 63.8% of all AI-bot traffic. That is the citation-relevant majority: bots fetching your content to put an answer in front of a real person, right now. Training, the part most of the "AI is stealing my content" anxiety focuses on, is under a quarter.

If you block AI crawlers with a blanket rule, you are mostly blocking the live traffic that gets you cited, not the training traffic you were worried about. The two are different bot classes and deserve different decisions.

One company accounts for roughly 63% of all AI-bot hits

OpenAI operates three of the crawlers we saw, across all three live purposes: ChatGPT-User (agent), OAI-SearchBot (retrieval), and GPTBot (training). Together they made up 1,734 of 2,746 hits, about 63% of everything. No other company came close.

ChatGPT-User alone was 1,277 hits, 47% of the entire log, on its own. That single user agent fetched our content more than every non-OpenAI bot combined. It is a blunt reminder of where AI-search attention is concentrated in 2026: optimising for ChatGPT visibility is not one option among many, it is most of the live demand.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit

The full bot leaderboard

Every AI crawler we saw in 12 days, ranked by activity, with the share of hits we could cryptographically verify against published IP ranges:

BotPurposeHitsVerified
ChatGPT-UserAgent1,27793%
BingbotSearch357not checked
OAI-SearchBotRetrieval30266%
ClaudeBotTraining209not checked
BytespiderTraining171not checked
GPTBotTraining15588%
PerplexityBotRetrieval11788%
Meta-ExternalAgentTraining91not checked
Perplexity-UserAgent3591%
Claude-UserAgent19not checked
CCBotTraining9not checked
AmazonbotTraining3not checked

Verified versus unverified: what we could actually prove

User agents are trivially spoofable. Anyone can send a request that claims to be GPTBot. So we checked every hit against the IP ranges the major AI companies publish for their crawlers. 60.5% of all hits (1,660 of 2,746) verified cleanly.

The unverified remainder is not all fake. Some genuine, well-behaved bots, including Bingbot and ClaudeBot, do not expose a verification method our check currently uses, so their hits land as unverified even when they are real. The honest reading is: verified hits you can trust, unverified hits you should treat with caution and confirm before acting on. If you are making robots.txt or analytics decisions based on bot traffic, verifying the IP is not optional, it is the whole game.

What this means for your GEO and robots.txt strategy

Three practical takeaways from the data:

  1. Decide robots.txt per bot class, not with a blanket rule. Blocking training crawlers (GPTBot, CCBot) keeps you out of training corpora without touching live citations. Blocking agent and retrieval bots (ChatGPT-User, OAI-SearchBot, PerplexityBot) removes you from live AI answers, which is almost never what a brand actually wants. Our guide to robots.txt for AI crawlers breaks down the safe configuration.
  2. Treat ChatGPT as the centre of gravity, then diversify. With OpenAI bots at 63% of traffic, ChatGPT visibility is the highest-leverage place to start. Perplexity is the clear second live source. The rest are worth covering but should not set your priorities.
  3. Log your own traffic before you trust anyone's dashboard. A panel-based estimate cannot tell you which bots fetch your specific pages. First-party logging can, and it is the only way to connect a crawl to an eventual citation. This is the same first-party logging that powers our brand-citation tracking.

Methodology and limitations

The data is from lumengeo.co server logs, 14 to 25 June 2026, capturing requests from known AI user agents. Purpose classification (agent, retrieval, search, training) is ours, based on each bot's documented role. Verification compares the source IP to published crawler IP ranges where a company provides them.

Limitations to keep in mind: this is one site in the GEO and AI-search niche, so the bot mix is plausibly skewed toward AI-curious crawling more than a typical site would see. Twelve days is a short window. "Not checked" in the verified column means we do not yet run an IP check for that bot, not that the bot is unverifiable. And a crawl is not a citation: a fetch means a bot retrieved your page, not that an AI answer cited it. For how fetches turn into citations, see how to get cited by ChatGPT and our AI Citation Index.

We will refresh this study as the log grows. The direction we expect to hold: agent and retrieval traffic stays the majority, and the live bots, not the training bots, are the ones worth optimising for.