— Article

Why Your Site May Be Invisible to AI (The Cloudflare Block Most Sites Miss)

Q: How can my site be invisible to AI if my robots.txt allows everything?

Crawler access is controlled at two layers and robots.txt is the second one. The CDN edge layer (Cloudflare and similar) sits above robots.txt and is checked first. If the edge layer blocks an AI crawler, it never reaches your robots.txt. Since mid-2025 Cloudflare blocks AI crawlers by default on new domains, so many sites are edge-blocked without their owners knowing.

Q: How do I test whether AI crawlers can reach my site?

From a terminal, fetch key pages with retrieval-agent user-agent strings: curl -A "OAI-SearchBot" -I https://yoursite.com/page and curl -A "PerplexityBot" -I https://yoursite.com/page. An HTTP 200 with real content means the crawler can reach the page. A 403, 429, or challenge page means something is blocking it.

Q: Does my sitemap affect AI visibility?

Yes, as of 2026. AI crawlers including GPTBot and ClaudeBot actively consume XML sitemaps for content discovery. A stale sitemap (missing new pages, listing dead URLs) now harms AI discovery, not just traditional indexing. Keep sitemap.xml current and ensure every important page is listed.

Q: Does Google-Extended need special handling?

Yes. Google-Extended governs Gemini's training and grounding. Blocking it does not affect classic Google Search, but it can reduce Gemini grounding — so if you want Gemini and AI Overviews visibility, allowing Google-Extended is the safer call. It is the one Class 1 crawler where blocking has a potential citation cost.

May 20, 202612 min readLumenGEO Research

robots.txtCloudflareAI crawlerscrawler accesstechnical GEO

Your site may be completely invisible to AI search engines without you knowing it — and the most common cause is not your content. Since mid-2025, Cloudflare and other CDNs block AI crawlers by default at the network edge, before your robots.txt is ever read. A site can have a perfectly permissive robots.txt and still be invisible to ChatGPT, Perplexity, and other AI engines because an edge rule is silently blocking them. This guide covers the two-layer crawler-access problem, the modern three-class crawler taxonomy, and how to diagnose whether AI engines can actually reach your site.

Most GEO advice assumes the AI engine can at least reach your content. Often it cannot — and the failure is invisible because it happens above the layer most people check. A brand can spend months on content optimization while every AI crawler is being turned away at the door. This article is the diagnostic.

Last updated: May 2026

Crawler access is a two-layer problem. Everyone checks robots.txt; almost no one checks the CDN edge layer above it. Since mid-2025, Cloudflare blocks AI crawlers by default at the edge — before robots.txt is read. A site with a permissive robots.txt can still be entirely invisible to AI engines. If you are not cited anywhere, check the edge layer before you touch your content.

The two-layer crawler-access problem

AI crawler access is controlled at two layers: the CDN/edge layer (Cloudflare and similar), which can block crawlers at the network level before any file is read, and the robots.txt layer, which the crawler reads only if it gets past the edge. A permissive robots.txt is worthless if the edge layer is blocking — and most site owners only ever check robots.txt.

When an AI crawler requests a page on your site, the request passes through layers before it reaches your content. Two of them control access:

Layer 1 — the CDN / edge layer. Most modern sites sit behind a CDN — Cloudflare, most commonly. The CDN inspects incoming requests at the network edge and can block them before they ever reach your server or your robots.txt. Cloudflare offers a "Block AI bots" managed rule, available on all plans, that does exactly this. And critically: since mid-2025, Cloudflare has blocked AI crawlers by default on new domains. Many site owners have this protection on and have no idea.

Layer 2 — the robots.txt layer. If a crawler gets past the edge, it then reads your robots.txt to see what it is allowed to access. This is the layer everyone knows about and checks.

The problem is the ordering. The edge layer sits above robots.txt. If Cloudflare blocks GPTBot at the edge, it does not matter what your robots.txt says — the crawler never gets far enough to read it. You can have a flawless, fully-permissive robots.txt and be completely invisible to AI search because of a single CDN toggle.

This is why so many sites are invisible to AI without knowing it. The site owner checks robots.txt, sees everything allowed, and concludes crawler access is fine. It is not. The block is one layer up, on a dashboard they never opened.

The edge layer sits above robots.txt and is checked first. Cloudflare blocks AI crawlers by default on new domains and offers a one-click block rule on all plans. A permissive robots.txt cannot override an edge block. Any AI-visibility diagnosis has to start at the edge layer — not robots.txt.

The three-class crawler taxonomy

AI crawlers fall into three classes that should be treated differently: training scrapers (block if you want — they cost bandwidth and drive no referrals), retrieval agents (allow — these are what drive AI citations), and agentic browsers (situational). Blocking the wrong class either wastes bandwidth or makes you invisible to AI search.

"Block AI bots" sounds simple, but it is too blunt. AI crawlers are not one thing. The 2026 standard is a three-class taxonomy:

Class 1: Training scrapers — block if you choose

These crawlers collect content to train future AI models. They consume your bandwidth and return nothing directly — no referral traffic, no citations. Examples include GPTBot, CCBot, ClaudeBot, Meta-ExternalAgent, and Google-Extended. Blocking these is a defensible choice: it protects your content from training use at no citation cost. (One nuance: Google-Extended governs Gemini training and grounding — blocking it can reduce Gemini grounding, so if you want Gemini visibility, allowing it is the safer call.)

Class 2: Retrieval agents — always allow

These crawlers fetch content in real time to answer a user's query right now — and they are what produce AI citations. Examples: OAI-SearchBot and ChatGPT-User (ChatGPT), Claude-Web and anthropic-ai (Claude), PerplexityBot (Perplexity), Bingbot (Bing, and therefore ChatGPT and Copilot retrieval). Blocking any of these makes you invisible to that engine's citations. These must be allowed at both the edge and robots.txt layers — our guide to robots.txt for AI crawlers gives the exact allow directives for each one.

Class 3: Agentic browsers — situational

These are AI agents browsing on a user's behalf — ChatGPT-User running a task, Comet's agent. For most sites, allow them: an agent browsing your site is a potential customer's proxy. Block only if you have a specific reason.

The critical mistake the blunt "block AI bots" toggle makes is lumping Class 1 and Class 2 together. Block training scrapers if you want — but if the same rule catches the retrieval agents, you have traded a little bandwidth saving for total AI-search invisibility. Note also that Anthropic split its crawler: ClaudeBot is training (Class 1), Claude-Web and anthropic-ai are retrieval (Class 2) — treat them differently.

AI crawlers are three classes, not one. Block training scrapers if you choose; always allow retrieval agents (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-Web, Bingbot) — they are what drive citations. The dangerous mistake is a blunt "block all AI bots" rule that catches the retrieval agents along with the training scrapers.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit

How to diagnose AI invisibility

Diagnose AI invisibility in four steps: check the CDN edge layer for AI-bot blocking, fetch your key pages with retrieval-agent user agents to confirm a 200 response, audit robots.txt against the three-class taxonomy, and confirm your sitemap is current — AI crawlers now actively consume sitemaps.

If your brand is not cited anywhere across AI engines, run this diagnostic before touching content.

Step 1: Check the CDN edge layer

Open your CDN dashboard. On Cloudflare: Security → Bots. Look for any "Block AI bots" or AI-crawler managed rule, and any custom firewall rules targeting bot traffic. Confirm that the Class 2 retrieval agents are allowed. This is the step almost everyone skips, and it is the most common cause of total AI invisibility — so do it first.

Step 2: Fetch your pages as a retrieval agent

Test what an AI crawler actually sees. From a terminal, request your key pages with retrieval-agent user-agent strings and confirm you get an HTTP 200 with the real content:

curl -A "OAI-SearchBot" -I https://yoursite.com/your-key-page
curl -A "PerplexityBot" -I https://yoursite.com/your-key-page

A 200 means the crawler can reach the page. A 403, 429, or a challenge page means something — edge layer or robots.txt — is blocking it. Test several pages, not just the homepage.

Step 3: Audit robots.txt against the three-class taxonomy

Read your robots.txt. Confirm no Class 2 retrieval agent is disallowed. Watch especially for a broad User-agent: * rule with restrictive Disallow directives that unintentionally catches AI crawlers. Name the key retrieval agents explicitly (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-Web, Bingbot) with Allow: / so there is no ambiguity.

Step 4: Confirm your sitemap is current

As of 2026, AI crawlers including GPTBot and ClaudeBot actively consume XML sitemaps for content discovery. A stale sitemap — missing new pages, listing dead URLs — now harms AI discovery, not just traditional indexing. Make sure sitemap.xml is current and that every important page is listed.

One honest caveat

Cloudflare has documented Perplexity using undeclared crawlers that rotate IP addresses to bypass robots.txt. The practical takeaway is not "blocking is futile" but rather: blocking AI crawlers is unreliable in both directions, so the sound strategy is open, well-structured content rather than gating. If you want AI visibility, the goal is to make access easy and unambiguous.

The four-step diagnostic — edge layer, fetch-as-agent test, robots.txt audit, sitemap check — finds AI invisibility in minutes. Run it in that order: the edge layer is the most common and most overlooked cause. If retrieval agents get a 200 on your key pages and your sitemap is current, access is not your problem and you can move on to content.

Frequently asked questions

How can my site be invisible to AI if my robots.txt allows everything?

Because crawler access is controlled at two layers, and robots.txt is the second one. The CDN edge layer (Cloudflare and similar) sits above robots.txt and is checked first. If the edge layer blocks an AI crawler, the crawler never reaches your robots.txt — your permissive rules are irrelevant. Since mid-2025 Cloudflare blocks AI crawlers by default on new domains, so many sites are edge-blocked without their owners knowing.

What is the Cloudflare AI-bot block?

Cloudflare offers a managed "Block AI bots" rule, available on all plans, that blocks AI crawlers at the network edge. Since mid-2025 it is also enabled by default on new domains. It blocks crawlers before your robots.txt is read. To allow AI engines to reach your site, check Cloudflare → Security → Bots and confirm retrieval agents are not blocked.

Which AI crawlers should I allow?

Always allow the Class 2 retrieval agents — they drive citations: OAI-SearchBot and ChatGPT-User (ChatGPT), PerplexityBot (Perplexity), Claude-Web and anthropic-ai (Claude), and Bingbot (Bing, which powers ChatGPT and Copilot retrieval). You may block Class 1 training scrapers (GPTBot, CCBot, ClaudeBot, CCBot, Meta-ExternalAgent) if you want to protect content from training use — that is a defensible choice with no citation cost.

What's the difference between training scrapers and retrieval agents?

Training scrapers collect content to train future AI models — they cost bandwidth and return nothing directly. Retrieval agents fetch content in real time to answer a user's query right now, and they are what produce AI citations. Blocking training scrapers is optional and harmless to citations; blocking retrieval agents makes you invisible to that engine's AI search.

How do I test whether AI crawlers can reach my site?

From a terminal, fetch your key pages with retrieval-agent user-agent strings: curl -A "OAI-SearchBot" -I https://yoursite.com/page and curl -A "PerplexityBot" -I https://yoursite.com/page. An HTTP 200 with real content means the crawler can reach the page. A 403, 429, or challenge page means something is blocking it. Test several pages, not just the homepage.

Does my sitemap affect AI visibility?

Yes — as of 2026 it does. AI crawlers including GPTBot and ClaudeBot actively consume XML sitemaps for content discovery. A stale sitemap (missing new pages, listing dead URLs) now harms AI discovery, not just traditional Google indexing. Keep sitemap.xml current and ensure every important page is listed.

Should I block AI crawlers to protect my content?

It is a defensible choice for training scrapers (Class 1) — it protects content from training use at no citation cost. But never block retrieval agents (Class 2) if you want AI visibility. Also note: Cloudflare has documented some crawlers using undeclared, IP-rotating bots to bypass blocks, so blocking is unreliable anyway. If you want AI citations, open, well-structured content beats gating.

What is the single most common cause of AI invisibility?

The CDN edge layer blocking AI crawlers — most often Cloudflare's default or managed AI-bot block — while the site owner only ever checks robots.txt. The block is one layer above robots.txt, on a dashboard the owner has not opened. It is the first thing to check in any AI-visibility diagnosis.

Does Google-Extended need special handling?

Yes. Google-Extended governs Gemini's training and grounding. Blocking it does not affect classic Google Search, but it can reduce Gemini grounding — so if you want Gemini and AI Overviews visibility, allowing Google-Extended is the safer call. It is the one Class 1 crawler where blocking has a potential citation cost.

I fixed crawler access — how long until I'm cited?

Fixing access removes the block; it does not instantly produce citations. Once retrieval agents can reach your pages, they become eligible for citation — then content quality, structure, freshness, and brand mentions determine whether you are actually cited. Expect re-crawl and re-evaluation over days to a few weeks. Access is the prerequisite, not the finish line.

How does this connect to agent-readiness?

Directly. The same edge-layer and robots.txt access that retrieval crawlers need, AI agents browsing on a user's behalf also need. Two-layer crawler access is point five of the agent-readiness checklist. If your site is invisible to AI crawlers, it is also unusable by AI agents — fixing access serves both.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit