— Article

AI Crawler List 2026: Every AI Bot, User Agent, and How to Verify Them

June 27, 202610 min readLumenGEO Research

AI crawlersAI bot user agentsChatGPT-UserGPTBotPerplexityBotrobots.txtfirst-party databot verification

— DataSource: LumenGEO first-party server logs, lumengeo.co, 14–28 June 2026 (3,392 hits, 13 bots)

Every AI crawler that actually showed up (14-day first-party log, 2026)

3,392

AI-crawler fetches logged in 14 days from one site, across 146 paths and 25 countries

LumenGEO first-party logs, June 2026

13

distinct AI crawlers observed in under two weeks (~225/day)

LumenGEO first-party logs, June 2026

59.4%

of hits were agent + retrieval fetches — the citation-relevant majority

LumenGEO first-party logs, June 2026

58%

came from OpenAI's three crawlers (ChatGPT-User, OAI-SearchBot, GPTBot)

LumenGEO first-party logs, June 2026

1,987

of 3,392 hits (58.6%) cryptographically verified against published bot IP ranges

LumenGEO first-party logs, June 2026

67 (2.0%)

hits failed verification — claimed an identity but came from outside that IP range

LumenGEO first-party logs, June 2026

Over fourteen days, one GEO-focused site logged 3,392 fetches from 13 distinct AI crawlers — and the single largest slice was not training. Live, in-session agent bots (ChatGPT-User, Perplexity-User, Claude-User) were 44.3% of all hits, and retrieval bots that ground AI answers added another 15.1%, so the citation-relevant majority — bots fetching a page to answer a real person right now — was 59.4%. Training crawls were 29.6%. This page is the directory those numbers come from: every AI crawler we saw, its company, its purpose, the exact user-agent string it sent, whether it obeys robots.txt, and how to verify it is genuine. It is built on first-party server logs, not a vendor panel.

Most "AI crawler list" articles are copied from each other and rarely say which bots actually show up, how often, or which claims of identity hold up under verification. This one is different in two ways. The directory below is checked against each company's official documentation. The hit counts are first-party: every request to lumengeo.co from a known AI user agent over a two-week window, classified by purpose and cross-checked against published bot IP ranges. Where the official record is contested — Perplexity's robots.txt behavior is the clearest case — we say so plainly instead of repeating a vendor claim.

Last updated: June 2026

Data window June 2026

This is the companion reference to our most-fetched page, robots.txt for AI crawlers. That guide tells you what to allow and block; this one tells you exactly who is knocking.

The four kinds of AI crawler

The phrase "AI bot" collapses four very different jobs into one word. Get the categories right and every robots.txt and analytics decision downstream gets easier. Get them wrong and you will block the traffic that earns citations while leaving the traffic you were worried about untouched.

Agent (live, in-session). Fetched in real time because a person asked an assistant something that requires reading your page. ChatGPT-User, Perplexity-User, and Claude-User are agent bots. This is the highest-intent AI traffic there is — a potential reader, by proxy, right now.
Retrieval (answer grounding). Fetched to build or refresh the index an AI search engine draws from when it composes a cited answer. OAI-SearchBot, PerplexityBot, Claude-SearchBot, and DuckAssistBot are retrieval bots. This is the machinery that decides whether you can be cited at all.
Search (index). Traditional search crawling that also feeds AI answers. Bingbot indexes for Bing and, in doing so, supplies Microsoft Copilot and parts of ChatGPT search.
Training (bulk). Wide crawling that collects content into model-training corpora. GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, and Amazonbot are training bots. These send no users and produce no citations; the visit is pure cost.

The reason the split matters: agent and retrieval bots are the ones tied to being cited in live answers. Training bots are not. Blocking one class has nothing to do with the other.

AI crawlers do four different jobs: agent (live user fetches), retrieval (answer-index building), search (classic indexing that feeds AI), and training (bulk corpus collection). Only agent and retrieval traffic is tied to being cited in AI answers. Training crawling, the activity most "block AI" anxiety targets, has no effect on your citations either way.

The master AI crawler list

Every AI crawler worth knowing in 2026, grouped by company. The Bot column shows the robots.txt user-agent token verbatim — the token you write a Disallow rule against is the name in this column. Google-Extended and Applebot-Extended are the two exceptions: they are robots.txt tokens only, not crawlers, and make no HTTP requests. Rows marked "Yes" under "In our logs" are bots we actually observed; the rest are documented by their vendors but did not appear in our window.

Bot	Company	Type	Respects robots.txt	How to verify	In our logs
`GPTBot`	OpenAI	Training	Yes	Published IP ranges: `openai.com/gptbot.json`	Yes
`OAI-SearchBot`	OpenAI	Retrieval	Yes	Published IP ranges: `openai.com/searchbot.json`	Yes
`ChatGPT-User`	OpenAI	Agent	Yes	Published IP ranges: `openai.com/chatgpt-user.json`	Yes
`ClaudeBot`	Anthropic	Training	Yes	Published IP ranges: `claude.com/crawling/bots.json`	Yes
`Claude-User`	Anthropic	Agent	Yes	Published IP ranges: `claude.com/crawling/bots.json`	Yes
`Claude-SearchBot`	Anthropic	Retrieval	Yes	Published IP ranges: `claude.com/crawling/bots.json`	No
`PerplexityBot`	Perplexity	Retrieval	Stated — contested (see note)	Published IP ranges: `perplexity.ai/perplexitybot.json`	Yes
`Perplexity-User`	Perplexity	Agent	User-initiated; may fetch a user-supplied URL even if disallowed	Published IP ranges: `perplexity.ai/perplexity-user.json`	Yes
`bingbot`	Microsoft	Search (feeds Copilot)	Yes	Forward-confirmed reverse DNS to `*.search.msn.com` (Bing publishes no IP file)	Yes
`Meta-ExternalAgent`	Meta	Training	Stated — some non-compliance reported	No published IP file we check	Yes
`Meta-ExternalFetcher`	Meta	Agent	User-initiated; may bypass robots.txt	No published IP file we check	No
`Bytespider`	ByteDance	Training	Claimed — widely reported to ignore it	No published ranges; not verifiable by us	Yes
`Amazonbot`	Amazon	Training	Yes	Forward-confirmed reverse DNS (per Amazon's docs)	Yes
`CCBot`	Common Crawl	Training	Yes	User-agent + Common Crawl docs (no IP file we verify)	Yes
`DuckAssistBot`	DuckDuckGo	Retrieval	Yes	DuckDuckGo-listed IPs; reverse DNS	Yes
`Google-Extended`	Google	Training opt-out token	Yes (token only)	N/A — makes no requests	No
`Applebot-Extended`	Apple	Training opt-out token	Yes (token only)	N/A — makes no requests	No

A few notes the table can't carry. Perplexity documents that PerplexityBot obeys robots.txt, but in August 2025 Cloudflare reported that Perplexity was using stealth, undeclared crawlers to reach content that site owners had disallowed; Perplexity has acknowledged that a user-supplied URL may be fetched even when robots.txt would block it. Treat its compliance as stated-but-disputed. Bytespider (ByteDance) publishes no vendor documentation page and has repeatedly been observed fetching paths under Disallow: /, so if you actually need it gone, robots.txt alone is not enough — you need an edge or WAF rule. Anthropic historically said it did not publish IP ranges and has since released a verification IP list; reverse DNS remains a fallback. If a bot you want to exclude only honors robots.txt as an honor system, the firewall is the enforcement layer.

The robots.txt token for each AI crawler is its user-agent name — write a Disallow rule against GPTBot, ClaudeBot, or PerplexityBot directly. The only exceptions are Google-Extended and Applebot-Extended, which are robots.txt tokens with no crawler behind them: they exist solely so you can opt out of Google and Apple AI training without affecting normal search.

User-agent strings reference

These are the exact user-agent strings we observed in our logs, reproduced verbatim. Use them to build log filters and detection rules — but remember a user agent is just a text header, and anyone can send one that says GPTBot. The string is a starting point for identification, never proof. For proof you verify the IP, which is covered in how to verify AI crawlers.

Bot	Observed user-agent string
ChatGPT-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot`
OAI-SearchBot	`...; compatible; OAI-SearchBot/1.4; +https://openai.com/searchbot`
GPTBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.4; +https://openai.com/gptbot)`
PerplexityBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`
Perplexity-User	`...; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user`
ClaudeBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)`
Claude-User	`Claude-User (claude-code/2.1.x; +https://support.anthropic.com/)`
Bingbot	`Mozilla/5.0 ... (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/...`
Bytespider	`Mozilla/5.0 (Linux; Android 5.0) ... (compatible; Bytespider; https://zhanzhang.toutiao.com/)`
Meta-ExternalAgent	`meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)`
CCBot	`CCBot/2.0 (https://commoncrawl.org/faq/)`
Amazonbot	`Mozilla/5.0 ... (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/...`
DuckAssistBot	`DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html)`

One detail worth flagging: Claude-User now arrives carrying a claude-code/2.1.x token, which means the fetch was triggered by a developer working inside Claude or Claude Code, not a consumer chatting on the web. The agent class is broadening beyond chat windows into coding tools and automations.

Which crawlers actually showed up (and how often)

Here is the part panels can't give you: who actually fetched our pages, in what volume, and how much of it survived verification. The window is 14–28 June 2026 — 3,392 logged AI-bot hits across 146 paths and 25 countries, roughly 225 per day.

Bot	Company	Type	Hits	Verification
ChatGPT-User	OpenAI	Agent	1,434	1,362 verified, 31 failed, 41 unchecked (749 distinct IPs)
Bytespider	ByteDance	Training	491	Not IP-verifiable by us
OAI-SearchBot	OpenAI	Retrieval	374	320 verified, 9 failed
Bingbot	Microsoft	Search	373	Verify via reverse DNS
ClaudeBot	Anthropic	Training	250	Not checked
GPTBot	OpenAI	Training	158	140 verified, 17 failed (10.8% — highest fail rate)
PerplexityBot	Perplexity	Retrieval	138	124 verified, 10 failed
Meta-ExternalAgent	Meta	Training	91	Not checked
Perplexity-User	Perplexity	Agent	48	41 verified, 0 failed
Claude-User	Anthropic	Agent	21	UA shows `claude-code/2.1.x`
CCBot	Common Crawl	Training	9	Not checked
Amazonbot	Amazon	Training	4	Reverse-DNS verifiable
DuckAssistBot	DuckDuckGo	Retrieval	1	Not checked

By purpose the split was Agent 1,503 (44.3%), Training 1,003 (29.6%), Retrieval 513 (15.1%), and Search 373 (11.0%). Agent and retrieval together — the citation-relevant traffic — came to 2,016 hits, or 59.4% of everything. OpenAI's three crawlers alone (ChatGPT-User, OAI-SearchBot, GPTBot) accounted for 1,966 hits, about 58% of the log. No other company was close.

The most useful behavioral finding is about re-crawling. ChatGPT-User did not visit our reference content once and move on. It re-fetched the robots.txt for AI crawlers article on 13 of roughly 14 days — essentially daily — and hit the homepage 946 times across the window. Explainer and reference pages get pulled again and again as live questions touch them, which is exactly why an evergreen, well-structured reference page is such durable GEO real estate. For how to watch this on your own site, see how to track AI search traffic and our 14-day AI crawler traffic study.

In 3,392 fetches, ChatGPT-User alone was 1,434 hits and re-fetched our main reference article on 13 of 14 days. Reference and explainer content is not crawled once — it is re-pulled almost daily as live AI queries touch the topic, which makes a single well-built reference page some of the most durable AI-search real estate you can own.

Verification is what separates a trustworthy bot log from a guess. Of the 3,392 hits, 1,987 (58.6%) cryptographically verified against the relevant company's published IP ranges. Sixty-seven (2.0%) actively failed: they claimed a checkable identity but came from an IP outside that company's ranges. The rest are simply unverifiable by us, because we do not run an IP check for every vendor and some legitimate bots don't expose a method we use. Read that honestly: the 67 failures are the provable floor of impostor traffic, not a ceiling, and "unverifiable" is not a synonym for "fake." The highest fail rate among checkable bots was GPTBot at 10.8% (17 of 158) — a reminder that the most-spoofed identities are the most famous ones. We cover the impostor problem in fake AI bot traffic.

— Free GEO audit

See which AI engines actually cite you

Your audit checks whether ChatGPT, Perplexity, and Google AI cite your site — and shows the gaps stopping them.

Run my free GEO audit

Or check your AI-crawler access

Should you block them?

There is no single "block AI bots" decision. Each class carries a different consequence, so decide per class.

Training bots (GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot). Blocking these is a pure content-rights choice with zero citation cost. If you do not want your content in model-training corpora, disallow the training tokens and add the two opt-out tokens, Google-Extended and Applebot-Extended, which cover Gemini and Apple Intelligence training without touching your search visibility. The catch: Bytespider and, by some reports, Meta's crawlers do not reliably honor robots.txt, so enforcing a block on those requires an edge or WAF rule, not just a Disallow line.
Retrieval bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot, DuckAssistBot). These build the indexes AI engines cite from. Block them and you remove yourself from that engine's cited answers. This is almost never what a brand actually wants, and it is the most common accidental self-inflicted wound in AI search.
Agent bots (ChatGPT-User, Perplexity-User, Claude-User, Meta-ExternalFetcher). Blocking these means refusing to load a page for a person who explicitly asked an assistant to read it. It is the highest-intent traffic on the list. Block it only with a very specific reason.
Search (Bingbot). Blocking Bingbot removes you from Bing — and, because Bing's index feeds Microsoft Copilot and parts of ChatGPT search, from those AI surfaces too. Block only if you genuinely want out of Bing entirely.

The principle to carry away: blocking training does not block citations, and blocking agent or retrieval bots removes you from live AI answers. They are different levers with opposite effects. For the exact copy-paste configurations, the CDN edge-block gotcha, and how to confirm your rules work, use robots.txt for AI crawlers.

Blocking AI training crawlers (GPTBot, ClaudeBot, CCBot) is a content-rights decision with no effect on whether you get cited. Blocking AI retrieval and agent crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User, Perplexity-User) is what actually removes you from live AI answers. Most sites that "block AI bots" with one blunt rule end up blocking the wrong class.

FAQ

What is the difference between ChatGPT-User and GPTBot?

They are two different OpenAI crawlers with opposite purposes. GPTBot is a training crawler that bulk-collects content for model training, and you can opt out of it without affecting anything else. ChatGPT-User is an agent crawler that fetches a specific page in real time because a person asked ChatGPT to read it. Blocking GPTBot keeps you out of training data but has no effect on citations; blocking ChatGPT-User refuses to load pages for live users. OpenAI's third bot, OAI-SearchBot, is separate again and handles search retrieval for ChatGPT.

How do I identify AI crawlers in my server logs?

Start with the user-agent string — most legitimate AI bots self-identify with a token like GPTBot, ClaudeBot, PerplexityBot, or ChatGPT-User and a URL pointing to the operator's documentation. But a user agent is just a header and is trivially spoofed, so the name alone is not proof. For confirmation, check the source IP against the company's published range file (for example openai.com/gptbot.json or perplexity.ai/perplexitybot.json) or run a forward-confirmed reverse DNS lookup for bots that use it, such as Bingbot resolving to search.msn.com. Verified hits you can trust; unverified ones you should treat with caution.

Do AI crawlers respect robots.txt?

Most documented crawlers from OpenAI, Anthropic, Microsoft, Amazon, and Common Crawl state that they honor robots.txt, and in practice they generally do. The notable exceptions are user-initiated agent fetches and a few bad actors: Perplexity has acknowledged that a user-supplied URL may be fetched even when robots.txt would block it, and Bytespider has repeatedly been observed crawling disallowed paths. Because robots.txt is an honor system, anything you truly need to keep out should be enforced at the firewall or CDN edge, not just declared in a text file.

Which AI crawler visits websites most often?

In our 14-day first-party log, ChatGPT-User was by far the most active at 1,434 of 3,392 total hits — more than every non-OpenAI bot combined. It is an agent crawler, fetching pages in real time as ChatGPT users ask questions, and it re-fetched our main reference article on 13 of about 14 days. OpenAI's three crawlers together made up roughly 58% of all AI-bot traffic, which is a strong signal that ChatGPT visibility carries outsized weight in 2026.

Does blocking AI crawlers stop me from being cited?

It depends entirely on which crawlers you block. Blocking training crawlers like GPTBot, ClaudeBot, and CCBot has no effect on citations — they only feed model training. Blocking retrieval and agent crawlers like OAI-SearchBot, PerplexityBot, and ChatGPT-User does stop citations, because those are the bots that build AI answer indexes and fetch your pages for live answers. A blanket "block all AI bots" rule is the common mistake: it removes you from AI search while saving only a little training bandwidth.