Article

robots.txt for AI Crawlers: The Complete 2026 Configuration Guide

18 min readLumenGEO Research
robots.txtAI crawlersGPTBotOAI-SearchBotcrawler accessAI search visibility

The single most important fact about robots.txt and AI search in 2026: blocking AI training scrapers does not block AI search citations, and blocking AI retrieval crawlers makes you invisible to AI search. These are two different classes of bot with two different effects, and most sites that try to "manage AI access" block the wrong one. A site can disallow GPTBot, ClaudeBot, and CCBot — protecting its content from model training — and still be fully cited by ChatGPT, Perplexity, and Copilot, because those citations are driven by a separate set of retrieval crawlers. The reverse is the silent failure: a blunt "block all AI bots" rule, or a Cloudflare edge toggle, that catches the retrieval crawlers and quietly removes you from AI search. This guide is the complete, copy-paste reference for getting it right.

robots.txt was designed for one job: telling search-engine crawlers which paths they may fetch. AI search broke that simple model by introducing several distinct kinds of bot that all "crawl" but do very different things. Treating them as one category — the thing the phrase "block AI bots" encourages — is the most common and most costly robots.txt mistake of 2026.

This guide covers the three-class crawler taxonomy, the full user-agent reference table, copy-paste robots.txt configurations for the common goals, the CDN edge-block gotcha that overrides robots.txt entirely, and how to verify your configuration actually works.

Last updated: May 2026

robots.txt now governs three different classes of AI bot. Training scrapers (GPTBot, ClaudeBot, CCBot, Google-Extended) feed model training. Retrieval crawlers (OAI-SearchBot, PerplexityBot, Claude-SearchBot, Bingbot) feed AI search and drive citations. User-triggered fetchers (ChatGPT-User, Perplexity-User, Claude-User) act on a live user request. Blocking the first class is a content-rights choice with no citation cost. Blocking the second class makes you invisible to AI search. Know which is which before you write a single Disallow line.

Why "block AI bots" is the wrong mental model

The phrase "AI bots" collapses three fundamentally different activities — training, search retrieval, and live user fetching — into one word, and robots.txt rules written against that collapsed category almost always produce the wrong outcome.

When site owners decide to "do something about AI crawlers," they usually reach for one of two blunt instruments: a wildcard block of every AI user-agent they can find in a list, or a one-click toggle in their CDN dashboard. Both treat AI bots as a single undifferentiated threat. Both are wrong, because the bots are not a single thing.

Consider what each class actually does when it visits your site:

  • A training scraper copies your content into a dataset that may be used to train a future model. It sends no users to you and produces no citations. Its visit is pure cost — bandwidth out, nothing back.
  • A retrieval crawler indexes your content so an AI search engine can find and cite it when a user asks a relevant question. Its visit is the entire mechanism by which you get cited in AI search. Block it and you remove yourself from that engine's answers.
  • A user-triggered fetcher loads a specific page because a real person, right now, asked an AI assistant a question that requires reading it. Its visit is a potential customer's proxy.

A rule that blocks "AI bots" indiscriminately hits all three. You save a little bandwidth from the training scrapers — and you simultaneously delete yourself from AI search results and refuse to load pages for live users. That is a catastrophic trade dressed up as a privacy decision.

The correct mental model is not "AI bots: block or allow." It is: which class, and what does blocking it actually cost me?

There is no such thing as "blocking AI bots" as a single decision. Every robots.txt rule you write affects a specific class with a specific consequence. Blocking training scrapers costs you nothing in citations. Blocking retrieval crawlers costs you all AI-search visibility for that engine. Conflating the two is the central robots.txt error of 2026.

The three-class AI crawler taxonomy

AI crawlers fall into three classes — training scrapers, retrieval/search crawlers, and user-triggered fetchers — and each class has a different operator intent, a different effect on AI citations, and therefore a different correct treatment in robots.txt.

This taxonomy is the framework the rest of the guide depends on. It is consistent with the three-class model in our AI-invisibility guide — same three classes, examined here at the level of individual robots.txt directives.

Class 1: Training scrapers

These crawlers collect public web content to build datasets for training future AI models. They are batch crawlers — they sweep broadly, on their own schedule, with no connection to any live user query. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), CCBot (Common Crawl), Applebot-Extended (Apple), Meta-ExternalAgent (Meta).

Effect of blocking: none on AI-search citations. You remove your content from future training corpora — a legitimate content-rights decision — but you do not affect whether AI search engines can find and cite you today. Blocking Class 1 is the one place where "protect my content" and "stay visible in AI search" do not conflict.

The one nuance: Google-Extended also governs whether your content is eligible for Gemini grounding, not only training. Blocking it is therefore a slightly riskier choice than blocking a pure training scraper — see the FAQ.

Class 2: Retrieval and search crawlers

These crawlers build and maintain the search indexes that AI engines query in real time to answer questions. When ChatGPT, Perplexity, or Copilot produces an answer with citations, those citations come from content these crawlers indexed. Examples: OAI-SearchBot (powers ChatGPT search), PerplexityBot (Perplexity's index), Claude-SearchBot (Claude's search layer), and Bingbot (the Bing index, which feeds ChatGPT, Copilot, and other engines' retrieval).

Effect of blocking: you become invisible to that engine's AI search. Block OAI-SearchBot and ChatGPT search cannot cite you. Block Bingbot and you damage retrieval across every engine that draws on the Bing index. Class 2 crawlers must be allowed if you want AI-search citations. There is no upside to blocking them — they send users and citations toward you, not away.

Class 3: User-triggered fetchers

These bots fetch a specific URL because a live user just asked an AI assistant something that requires reading that page. Examples: ChatGPT-User (a ChatGPT user's request or agent action), Perplexity-User (a Perplexity user follows or triggers a fetch), Claude-User (a Claude user asks a question needing a page). They are not batch crawlers — each request maps to one human, right now.

Effect of blocking: the AI assistant cannot load your page on behalf of that user. For almost every site this is undesirable: the user is a real prospect, and the assistant is acting as their browser. Allow Class 3 unless you have a specific, deliberate reason not to.

Three classes, three intents, three treatments. Training scrapers (Class 1): block freely, no citation cost. Retrieval crawlers (Class 2): always allow — they are the citation mechanism. User-triggered fetchers (Class 3): allow by default — each request is a live user. The whole of correct robots.txt configuration is applying the right treatment to the right class.

The complete AI crawler user-agent reference table

Use this table to map every AI user-agent string you might see in your logs to its operator, its class, and the precise consequence of disallowing it.

User-agent strings are case-insensitive in robots.txt matching, and crawlers match the longest applicable User-agent block. The strings below are the tokens to use in your robots.txt — they are the product-token names operators publish, not the full HTTP User-Agent header.

User-agent tokenOperatorClassWhat blocking it does
GPTBotOpenAI1 — TrainingExcludes your content from OpenAI model-training datasets. No effect on ChatGPT search citations.
ClaudeBotAnthropic1 — TrainingExcludes your content from Anthropic model-training datasets. No effect on Claude search or user fetches.
Google-ExtendedGoogle1 — Training (+ grounding)Excludes content from Gemini training and can reduce Gemini grounding eligibility. No effect on classic Google Search.
CCBotCommon Crawl1 — TrainingExcludes content from the Common Crawl dataset, used by many model trainers. No direct citation effect.
Applebot-ExtendedApple1 — TrainingExcludes content from Apple's generative-model training. Applebot itself (Siri/Spotlight) is separate.
Meta-ExternalAgentMeta1 — TrainingExcludes content from Meta's AI training crawls.
OAI-SearchBotOpenAI2 — RetrievalRemoves you from the ChatGPT search index. ChatGPT search can no longer cite you.
PerplexityBotPerplexity2 — RetrievalRemoves you from Perplexity's index. Perplexity can no longer cite you in answers.
Claude-SearchBotAnthropic2 — RetrievalRemoves you from Claude's search layer. Degrades Claude's ability to cite you.
BingbotMicrosoft2 — RetrievalRemoves you from the Bing index — which feeds Copilot and ChatGPT retrieval. High-impact block.
ChatGPT-UserOpenAI3 — User-triggeredChatGPT cannot fetch your page when a user (or its agent) requests it live.
Perplexity-UserPerplexity3 — User-triggeredPerplexity cannot fetch your page on a live user request.
Claude-UserAnthropic3 — User-triggeredClaude cannot fetch your page when a user asks a question that needs it.

Two naming notes that trip people up. First, Anthropic uses three distinct tokensClaudeBot (training), Claude-SearchBot (retrieval), Claude-User (user-triggered). Older guides reference Claude-Web and anthropic-ai; those are deprecated, and you should configure against the current three-token scheme. Second, OpenAI's ChatGPT-User is genuinely Class 3 — it does not build a persistent index; it fetches per request. Do not confuse it with OAI-SearchBot, which is the actual ChatGPT-search index crawler.

The reference table is the source of truth: token, operator, class, consequence. The high-stakes rows are the Class 2 retrieval crawlers — OAI-SearchBot, PerplexityBot, Claude-SearchBot, Bingbot. Disallowing any one of them removes you from a major AI search engine. Bookmark the table and check every Disallow line against it before deploying.

Copy-paste robots.txt configurations

Below are four ready-to-use robots.txt configurations, one for each common goal. Pick the one that matches your intent, paste it at the root of your robots.txt, and verify it afterward.

robots.txt lives at https://yourdomain.com/robots.txt and must be served as plain text. Rules are grouped by User-agent; a crawler obeys the most specific block that names it. Always include your sitemap.

Goal 1: Maximum AI-search visibility (recommended for most sites)

You want to be cited everywhere and you do not mind contributing to training. This is the right default for any business that wants AI-search traffic.

# Allow all AI crawlers — maximum AI-search visibility
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

A permissive User-agent: * block already allows every AI crawler. You do not need to name them. The most common cause of accidental AI invisibility is not an explicit AI block — it is a pre-existing restrictive Disallow under User-agent: * that was written years ago for traditional SEO and now silently catches retrieval crawlers too.

Goal 2: Stay cited in AI search, but opt out of model training

You want ChatGPT, Perplexity, and Copilot to cite you — but you do not want your content used to train models. This is the most popular deliberate configuration in 2026, and it is fully supported: OpenAI and Anthropic explicitly allow you to permit search while disallowing training.

# Allow AI search + user fetches, block AI model training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Retrieval + user-triggered crawlers stay allowed
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

This is the configuration that proves the guide's central point: every training scraper is blocked, every retrieval and user crawler is allowed, and the result is full AI-search citation eligibility with zero training contribution. One caveat to weigh: blocking Google-Extended also reduces Gemini grounding eligibility — if Gemini visibility matters to you, consider removing the Google-Extended block.

Goal 3: Allow AI search, but keep training and user agents out of one private section

You want full visibility for public content but need to keep a specific path — say /account/ or /internal/ — out of all AI access.

# AI crawlers allowed site-wide except a private section
User-agent: GPTBot
Disallow: /account/
Disallow: /internal/

User-agent: OAI-SearchBot
Disallow: /account/
Disallow: /internal/

User-agent: PerplexityBot
Disallow: /account/
Disallow: /internal/

User-agent: Claude-SearchBot
Disallow: /account/
Disallow: /internal/

User-agent: ChatGPT-User
Disallow: /account/
Disallow: /internal/

User-agent: *
Disallow: /account/
Disallow: /internal/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Path-level rules are far more precise than all-or-nothing user-agent blocks. If your real concern is one section, scope the rule to that section rather than blocking a whole crawler.

Goal 4: Block everything AI (understand the cost first)

You want no AI access of any kind — no training, no search, no user fetches. This makes you completely invisible to AI search. It is a defensible choice only if AI-search traffic genuinely has no value to you.

# Block ALL AI access — removes you from AI search entirely
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

If you deploy this, do it knowingly. Most sites that have a configuration like this got there by accident — a "block AI" toggle they no longer remember enabling — and are losing AI-search traffic they would happily keep.

For most businesses, Goal 1 or Goal 2 is correct. Goal 2 — block training, allow retrieval and user crawlers — is the deliberate sweet spot: full AI-search citations, no training contribution. Goal 4 (block everything) is rarely the right call and is more often an accident than a decision. Whichever you pick, check it against the user-agent table before deploying.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit

The CDN edge-block gotcha: the rule that overrides robots.txt

A perfect robots.txt does not guarantee AI crawlers can reach you, because your CDN can block them at the network edge before robots.txt is ever read — and Cloudflare's bot-fight and managed AI-bot features do exactly this, often silently and by default.

robots.txt is a request a crawler reads after it connects to your site. The CDN edge layer is a gate that decides whether the crawler connects at all. The edge sits above robots.txt. If it blocks a crawler, the crawler never gets far enough to read your robots.txt — your carefully written Allow: lines are irrelevant.

This is, per current GEO Playbook 3.0 research, the single most common silent cause of AI invisibility. Three specific Cloudflare behaviors cause it:

  • Block AI bots managed rule. Cloudflare offers a one-click "Block AI bots" rule on all plans. It blocks AI crawlers at the edge. Many site owners enabled it once and forgot.
  • Default-on blocking for new domains. Since mid-2025, Cloudflare has blocked AI crawlers by default on newly onboarded domains. A site can be edge-blocking AI crawlers from day one without anyone choosing it.
  • Bot Fight Mode. Cloudflare's Bot Fight Mode (and Super Bot Fight Mode) challenges or blocks traffic it classifies as automated. Legitimate AI retrieval crawlers can be caught in this net and served a challenge page instead of your content — which, to the crawler, looks like an empty or broken page.

The danger is that none of this shows up in robots.txt. A site owner opens robots.txt, sees Allow: /, and concludes AI access is fine. It is not — the block is one layer up, on a dashboard they have not opened.

If you use Cloudflare (or Fastly, Akamai, or any CDN with bot management), fixing AI access is a two-place job: robots.txt and the CDN bot-management settings. On Cloudflare, that means Security → Bots: confirm the "Block AI bots" rule is off (or scoped to training-only crawlers if your CDN supports that granularity), and confirm Bot Fight Mode is not challenging retrieval crawlers. Cloudflare also lets you allow verified AI crawlers explicitly — use that if available.

robots.txt is necessary but not sufficient. The CDN edge layer is checked first and can block AI crawlers before robots.txt is read. Cloudflare's "Block AI bots" rule, its mid-2025 default-on blocking for new domains, and Bot Fight Mode are the three usual culprits. If you fix robots.txt but not the edge, you have fixed nothing. Always verify both layers.

How to verify your configuration actually works

Do not assume a robots.txt edit took effect — verify it by fetching your pages with the actual AI crawler user-agent strings and confirming you get an HTTP 200 with real content.

Verification takes a few minutes and catches both robots.txt mistakes and edge-block problems that no amount of reading robots.txt would reveal.

Step 1: Fetch as a retrieval crawler

From a terminal, request your key pages using the Class 2 user-agent strings. A 200 means the crawler can reach the page; a 403, 429, or a challenge/CAPTCHA page means something — robots.txt or the edge — is blocking it.

curl -A "OAI-SearchBot" -I https://yourdomain.com/your-key-page
curl -A "PerplexityBot" -I https://yourdomain.com/your-key-page
curl -A "Claude-SearchBot" -I https://yourdomain.com/your-key-page
curl -A "Bingbot" -I https://yourdomain.com/your-key-page

Test several pages, not just the homepage — edge rules sometimes treat paths differently. Drop the -I flag to fetch the full body and confirm you get real content rather than a challenge page disguised as a 200.

Step 2: Confirm robots.txt parses as intended

Fetch robots.txt itself and read it as a crawler would: curl https://yourdomain.com/robots.txt. Check that no Class 2 retrieval token appears under a Disallow: /, and that no broad User-agent: * block has a restrictive Disallow that unintentionally catches crawlers. Google's robots.txt tester and Bing Webmaster Tools both validate parsing.

Step 3: Check the CDN dashboard

If you use a CDN, open its bot-management settings (Cloudflare: Security → Bots) and confirm no AI-bot block or Bot Fight rule is challenging retrieval crawlers. This is the step that catches the invisible failures.

Step 4: Use a crawler-access checker

To do all of this without a terminal, LumenGEO's free AI crawler access checker fetches your site as each major AI crawler and reports, per class, whether you are reachable — surfacing both robots.txt and edge-layer blocks in one view.

Verification is non-optional. Fetch your pages with the real Class 2 user-agent strings and confirm a genuine 200 — that single test catches both robots.txt errors and silent edge blocks. Then read robots.txt as a crawler sees it and check the CDN dashboard. A configuration you have not verified is a configuration you do not actually have.

Frequently Asked Questions

Does blocking GPTBot stop ChatGPT from citing my site?

No. GPTBot is a Class 1 training scraper — blocking it only excludes your content from OpenAI's model-training datasets. ChatGPT search citations are driven by OAI-SearchBot (the retrieval crawler) and ChatGPT-User (live user fetches). You can block GPTBot and remain fully citable in ChatGPT search, as long as OAI-SearchBot is allowed at both the robots.txt and CDN-edge layers.

Which AI crawlers should I never block if I want AI-search visibility?

The Class 2 retrieval crawlers: OAI-SearchBot, PerplexityBot, Claude-SearchBot, and Bingbot. These build the indexes AI search engines query to produce cited answers. Blocking any one of them removes you from that engine's AI search. The Class 3 user-triggered fetchers (ChatGPT-User, Perplexity-User, Claude-User) should also stay allowed for almost every site, since each request represents a live user.

What's the difference between OAI-SearchBot and ChatGPT-User?

OAI-SearchBot is OpenAI's retrieval crawler — it builds the search index that ChatGPT queries, and it is Class 2. ChatGPT-User is Class 3: it fetches a specific page in real time because a user (or a ChatGPT agent acting for a user) asked something that needs it. One builds a persistent index; the other is a per-request fetch. Both should be allowed if you want ChatGPT visibility, but they are different bots with different roles.

Can I allow AI search but block AI training?

Yes — this is fully supported and is the most popular deliberate configuration in 2026. Disallow the Class 1 training scrapers (GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent) and allow the Class 2 and Class 3 crawlers. OpenAI and Anthropic explicitly document that permitting search while disallowing training is a valid choice. See Goal 2 above for the copy-paste configuration.

Why is my site invisible to AI search even though my robots.txt allows everything?

Almost certainly the CDN edge layer. Cloudflare and similar CDNs can block AI crawlers at the network edge before robots.txt is ever read — via the "Block AI bots" managed rule, default-on blocking for domains onboarded since mid-2025, or Bot Fight Mode catching legitimate retrieval crawlers. A permissive robots.txt cannot override an edge block. Check your CDN's bot-management settings, not just robots.txt.

Do I need to list every AI crawler by name in robots.txt?

No. A simple User-agent: * with Allow: / already permits every AI crawler — you only need named blocks when you want to treat a specific crawler differently from the default. Naming retrieval crawlers explicitly is still useful as documentation and as a guard against a future restrictive User-agent: * rule, but it is not required for access. Conversely, the most common accidental block is an old restrictive Disallow under User-agent: *, not a missing Allow.

Should I block Google-Extended?

Be careful with this one. Google-Extended is mostly Class 1 — blocking it opts your content out of Gemini training — but it also governs Gemini grounding eligibility. Blocking it can reduce how often Gemini and AI Overviews surface your content, so it is the one training token with a potential citation cost. It has no effect on classic Google Search. If Gemini visibility matters to you, allow Google-Extended.

What user-agent strings does Anthropic use now?

Anthropic uses three current tokens: ClaudeBot (Class 1, training), Claude-SearchBot (Class 2, retrieval/search indexing), and Claude-User (Class 3, user-triggered fetches). Older guides reference Claude-Web and anthropic-ai; those are deprecated and you should configure against the current three-token scheme. Blocking ClaudeBot excludes you from training only — it does not affect Claude-SearchBot or Claude-User.

How do I test whether AI crawlers can actually reach my site?

From a terminal, fetch your key pages with the real retrieval-crawler user-agent strings: curl -A "OAI-SearchBot" -I https://yourdomain.com/page and the same for PerplexityBot, Claude-SearchBot, and Bingbot. An HTTP 200 with real content means the crawler can reach the page; a 403, 429, or challenge page means something is blocking it. Test several pages, check the CDN dashboard too, or use a free crawler-access checker to do all of it at once.

How long after fixing robots.txt will I be cited?

Fixing access removes the block — it does not instantly produce citations. Once retrieval crawlers can reach your pages, they become eligible for citation; re-crawl and re-indexing then take days to a few weeks. After that, content quality, structure, freshness, and third-party brand mentions determine whether you are actually cited. Correct crawler access is the prerequisite for AI-search visibility, not the finish line.