— Article

How to Verify AI Crawler Traffic Is Real (and Catch Spoofed Bots)

June 27, 20268 min readLumenGEO Research

AI crawlersbot verificationGPTBotPerplexityBotreverse DNSspoofed botsrobots.txtfirst-party data

A user-agent string is just text the visitor sends — anyone can type GPTBot into it, and plenty do. We know because we logged it: across 3,392 AI-bot hits to lumengeo.co in two weeks, 2.0% were provably spoofed, and one single IP claimed to be ChatGPT-User, GPTBot, OAI-SearchBot, and PerplexityBot all in the same day. So the only way to know an AI crawler is who it claims to be is to verify the source IP — by matching it against the vendor's published IP ranges, or by forward-confirmed reverse DNS (FCrDNS) when no ranges exist. This guide gives you the exact method per engine, with the official source for each.

Last updated: June 2026

Reviewed June 2026

There are exactly two trustworthy ways to verify an AI crawler. (1) Match the request's source IP against the vendor's published IP-range file (OpenAI, Anthropic, Perplexity, and Google publish these). (2) Where no list exists (Bingbot), use forward-confirmed reverse DNS. The user-agent string alone proves nothing — it is attacker-controlled input.

Why you can't trust the user agent

The user-agent header is a field the client fills in. Your server doesn't assign it; the visitor does. A scraper, a competitor, or a malicious bot can send User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot) just as easily as OpenAI's real crawler does. Nothing about the string is cryptographically bound to its claimed owner.

This is not theoretical. In our first-party study of fake AI bot traffic, we checked every AI-bot hit to our server against the relevant vendor's published IP ranges. Across 3,392 hits logged 14–28 June 2026, 2.0% failed verification — the source IP was nowhere near OpenAI, Anthropic, or Perplexity infrastructure despite the user agent insisting otherwise. A further 40% were unverifiable, meaning the claimed bot doesn't expose a verification method we could check (more on that distinction below). The most blatant case: on 19 June, a single IP address rotated through four identities — ChatGPT-User, GPTBot, OAI-SearchBot, and PerplexityBot — in one day. No real crawler does that. That IP was a spoofer wearing whatever costume it thought would get past a filter.

The honest framing matters: unverifiable is not the same as fake. Some genuine, well-behaved bots don't publish IP ranges, so their real traffic lands as "unverifiable." Verification tells you which hits you can trust — not which hits are definitely malicious. If you're going to block, rate-limit, or report on AI traffic, verify first.

Method 1: Match the IP against published ranges

This is the strongest method, and it's the one to use wherever the vendor offers it. Each of these companies publishes a machine-readable file listing the CIDR ranges their crawlers operate from. The logic is simple: take the source IP of the request, and check whether it falls inside any published range for the bot it claims to be. If yes, it's genuine. If no, it's spoofed (or at least, not from that vendor).

The official files, confirmed against each vendor's own documentation as of June 2026:

Vendor	Crawler(s)	Published IP file
OpenAI	GPTBot	`https://openai.com/gptbot.json`
OpenAI	OAI-SearchBot	`https://openai.com/searchbot.json`
OpenAI	ChatGPT-User	`https://openai.com/chatgpt-user.json`
Anthropic	ClaudeBot, Claude-User, Claude-SearchBot	`https://claude.com/crawling/bots.json`
Perplexity	PerplexityBot	`https://www.perplexity.ai/perplexitybot.json`
Perplexity	Perplexity-User	`https://www.perplexity.ai/perplexity-user.json`
Google	Googlebot + common crawlers	`https://developers.google.com/static/crawling/ipranges/common-crawlers.json`

OpenAI documents these at developers.openai.com/api/docs/bots, with a separate file per crawler. Anthropic now publishes a single combined list and states plainly: "If a crawler has a source IP address on this list, it indicates that the crawler is coming from Anthropic" — a reversal of its earlier "we don't publish IP ranges" position. Perplexity documents both files at docs.perplexity.ai and recommends combining user-agent matching and IP verification. Google publishes several CIDR files (common crawlers, special-case crawlers, and user-triggered fetchers), all in CIDR format.

The matching logic, in pseudocode:

1. Read the request's source IP and its claimed user agent.
2. Pick the vendor's IP file for that user agent (e.g. GPTBot -> gptbot.json).
3. Fetch + cache the file (it's a JSON array of CIDR prefixes).
4. For each prefix, test: does the source IP fall inside this CIDR block?
5. If any prefix matches -> verified genuine.
   If none match -> treat as spoofed / not from that vendor.

A concrete check in shell, using grepcidr (or any CIDR library in your language of choice):

# Verify a request that claims to be GPTBot
# (illustrative IP from a current GPTBot range — ranges change, so refresh the file)
IP="132.196.86.20"
curl -s https://openai.com/gptbot.json \
  | jq -r '.prefixes[].ipv4Prefix // .prefixes[].ipv6Prefix' \
  | grepcidr "$IP" \
  && echo "VERIFIED GPTBot" \
  || echo "SPOOFED — not in OpenAI's published range"

Two operational notes. First, cache the files and refresh them on a schedule — vendors update ranges, and a stale list will start failing real bots. Perplexity explicitly recommends an automated refresh. Second, OpenAI, Anthropic, and Perplexity publish both IPv4 and IPv6 prefixes; make sure your matcher handles both address families, or IPv6 traffic will silently fail verification.

Published-IP matching is exact and tamper-proof: a spoofer can copy a user agent but cannot send packets from inside OpenAI's, Anthropic's, Perplexity's, or Google's network. Use this method first for any bot whose vendor publishes ranges, and refresh the lists on a cron so you don't start rejecting real crawlers when the ranges change.

Method 2: Reverse DNS and forward-confirmed reverse DNS (FCrDNS)

Some legitimate crawlers — most notably Bingbot — deliberately do not publish an IP list. Microsoft's reasoning is that its crawl IPs change frequently, so a hardcoded list would go stale. For these, the verification standard is forward-confirmed reverse DNS. It also works as a universal fallback for any bot when you can't find a published range.

FCrDNS is a three-step round trip, and the round trip is the whole point:

Reverse lookup the source IP. This returns a hostname (the PTR record).
Confirm the hostname belongs to the vendor's domain — for Bingbot it must end in .search.msn.com; for Googlebot it must be googlebot.com, google.com, or googleusercontent.com.
Forward lookup that hostname. The A/AAAA record it returns must resolve back to the same IP you started with.

Why all three steps? An attacker can set the reverse-DNS (PTR) record on an IP they control to say anything — including crawl-1.search.msn.com. What they cannot do is make the vendor's authoritative DNS forward-resolve that hostname back to their IP. Step 3 closes the loophole. If the forward lookup doesn't return your original IP, the reverse record was a lie.

Here's the full check for Bingbot, using host (works the same with dig):

# Step 1 — reverse lookup the IP from your logs
$ host 157.55.33.18
18.33.55.157.in-addr.arpa domain name pointer msnbot-157-55-33-18.search.msn.com.

# Step 2 — confirm the hostname ends in .search.msn.com  ✓

# Step 3 — forward lookup the hostname; must return the same IP
$ host msnbot-157-55-33-18.search.msn.com
msnbot-157-55-33-18.search.msn.com has address 157.55.33.18   ✓  VERIFIED

Both ends agree on 157.55.33.18 and the hostname is under search.msn.com, so this is genuine Bingbot. Google works identically — reverse-lookup the IP, confirm the domain is one of Google's three, then forward-confirm it resolves back. Google also publishes CIDR files if you'd rather do IP matching, but reverse DNS is the method that needs no list to maintain.

Per-engine cheat sheet

The fastest reference. Use IP-range matching where it exists; fall back to FCrDNS where it doesn't.

Bot	How to verify	Official source
GPTBot / OAI-SearchBot / ChatGPT-User (OpenAI)	Match source IP against the per-bot CIDR file	`openai.com/gptbot.json`, `/searchbot.json`, `/chatgpt-user.json`
ClaudeBot / Claude-User / Claude-SearchBot (Anthropic)	Match source IP against the published list	`claude.com/crawling/bots.json`
PerplexityBot / Perplexity-User (Perplexity)	Match user agent and source IP against the CIDR file	`perplexity.ai/perplexitybot.json`, `/perplexity-user.json`
Bingbot (Microsoft)	FCrDNS — reverse + forward confirm to `*.search.msn.com` (no IP list published)	Bing Webmaster Tools docs
Googlebot (Google)	FCrDNS to `googlebot.com` / `google.com` / `googleusercontent.com`, or match the published CIDR files	`developers.google.com/static/crawling/ipranges/common-crawlers.json`
Google-Extended (Google)	Not verifiable in logs — it's a robots.txt token, not a crawler (see note)	Google Search Central

About Google-Extended: it is a robots.txt control token, not a fetching crawler. It has no user-agent string of its own and no separate IP — Google's existing crawlers do the fetching, and the token only governs whether your content is eligible for AI training and Gemini/Vertex grounding. You will never see "Google-Extended" in your access logs, so there is nothing to verify. Treat it as a robots.txt directive only. We cover its behaviour in the AI crawler list and the robots.txt guide.

— Free GEO audit

Verified the bots — now check the citations

Knowing real AI bots reach your site is step one. Your free audit shows whether they actually cite you in ChatGPT, Perplexity, and Google AI.

Run my free GEO audit

Or run the AI-crawler check

Red flags that scream "spoofed"

Even before you run a verification check, certain patterns are giveaways. Each of these should drop your trust in a hit and push you to verify the IP:

One IP claiming multiple vendors. Our 19 June example — a single address posing as ChatGPT-User, GPTBot, OAI-SearchBot, and PerplexityBot in one day — is the clearest tell. Real crawlers from different companies never share an IP.
A "browser-driven" agent coming from a datacenter ASN. ChatGPT-User, Perplexity-User, and Claude-User represent a live human action. If the user agent says it's a user-triggered fetch but the IP belongs to a bulk hosting provider (DigitalOcean, OVH, Hetzner) rather than the vendor's range, be suspicious.
Reverse DNS that doesn't match — or doesn't forward-confirm. A PTR record pointing to a hostname outside the vendor's domain, or one that fails the forward-confirm step, is a manufactured identity.
Sudden volume spikes from a single ASN. Genuine crawlers spread load and ramp gradually. Thousands of "GPTBot" hits appearing from one autonomous system in an hour is far more likely to be a scraper borrowing the name.
IPv4-only claims for a vendor that publishes IPv6. Less common, but spoofers often forget that OpenAI and others operate IPv6 ranges too.

The single highest-value red flag is one IP wearing multiple vendor costumes. If you log only one anti-pattern, log that — it caught the most obvious spoofer in our two-week sample and requires no DNS lookups to spot.

A note on Perplexity (verify carefully)

Perplexity's verification story is more contested than the others, so be precise. Perplexity does publish official IP-range files and recommends combining user-agent and IP checks. But in August 2025, Cloudflare published a report alleging Perplexity used "stealth, undeclared crawlers" — rotating through IPs outside its published ranges and impersonating a generic Chrome-on-macOS browser — to fetch pages that had blocked its declared crawler. The practical implication for verification: a hit that matches Perplexity's published ranges is genuine, but the absence of a match does not by itself prove a fetch wasn't Perplexity-related activity. Verify against the published list, and treat un-declared browser-like traffic on its own merits rather than assuming it's covered by Perplexity's official ranges.

Do it without code

You don't need to build any of this to check a single suspicious hit. Two options:

Our free AI-crawler check tool lets you paste an IP and a claimed user agent and tells you whether it verifies against the vendor's published ranges — no setup, no log parsing.
Your CDN's verified-bot features. Cloudflare, Fastly, and others run the IP-range matching and FCrDNS checks for major bots automatically and expose a "verified bot" signal in their dashboards and firewall rules. If you're already behind a CDN, this is the lowest-effort way to get verification at the edge — though always confirm which bots a given provider verifies, since coverage varies.

For the bigger picture on turning verified crawls into measurable outcomes, see how to track AI search traffic and our ongoing AI crawler traffic study.

FAQ

Can AI crawler user agents be faked?

Yes, trivially. The user-agent string is supplied by the client, so any visitor can claim to be GPTBot, ClaudeBot, or PerplexityBot. In our first-party logs, 2.0% of AI-bot hits were provably spoofed — the source IP didn't belong to the vendor named in the user agent. Never make a blocking, allow-listing, or reporting decision on the user-agent string alone.

What's the most reliable way to verify an AI crawler?

Match the request's source IP against the vendor's published IP-range file. OpenAI, Anthropic, Perplexity, and Google all publish machine-readable CIDR lists. If the IP falls inside a published range for the bot it claims to be, it's genuine; if not, it's spoofed. This is tamper-proof because a spoofer can copy a string but cannot send traffic from inside the vendor's network.

How do I verify Bingbot if Microsoft doesn't publish IP ranges?

Use forward-confirmed reverse DNS. Reverse-lookup the source IP and confirm the hostname ends in .search.msn.com, then forward-lookup that hostname and confirm it resolves back to the same IP. Both ends must agree. Microsoft deliberately avoids publishing IPs because they change often, so FCrDNS is the official method for Bingbot. Googlebot can be verified the same way against googlebot.com, google.com, or googleusercontent.com.

Does an unverifiable hit mean the bot is fake?

No. Unverifiable means the claimed bot doesn't expose a verification method you can check, or your tooling doesn't yet support it — not that the hit is malicious. Some genuine, well-behaved crawlers fall into this bucket. Verification tells you which hits you can trust; treat the rest with caution and confirm before acting, rather than assuming they're fake.

How do I verify Google-Extended?

You can't, and you don't need to. Google-Extended is a robots.txt control token, not a crawler. It has no user-agent string and no separate IP — Google's normal crawlers do the fetching, and the token only controls whether your content is eligible for AI training and Gemini/Vertex grounding. It will never appear in your server logs, so there is nothing to verify; treat it purely as a robots.txt directive.