The State of AI Search Stability 2026: 150 Queries, 450 Samples, 30 Verticals
You cannot measure AI-search visibility with one check. We ran 150 commercial buyer queries three times each — 450 samples — and the retrievable web came back different on roughly 1 in 9 of the domains it surfaced. Mean overlap between repeated pulls was high (Jaccard 0.944), but 10.6% of every domain that appeared was sample-dependent: there one moment, gone the next. The top three leaders changed across samples in 11.3% of queries. So a single snapshot is wrong about a tenth of the time — and you have no way of knowing which tenth. The retrievable web is stochastic, and a one-time GEO Score reads that noise as if it were signal.
Most GEO measurement treats a single query as ground truth. You ask ChatGPT once, see who gets cited, and write it down. This study shows why that is a measurement error, not a measurement. The pool of pages AI search engines retrieve from is not fixed — it shifts between identical queries run minutes apart. Some of that shift is small. Some of it flips the leaderboard. And there is no asterisk on a one-shot read telling you which kind you got.
We set out to quantify exactly how unstable the retrievable web is for commercial queries, where the instability concentrates, and what it means for anyone trying to measure AI visibility honestly.
Last updated: June 2026
Across 150 commercial buyer queries sampled 3 times each (450 total samples), 89.4% of domain appearances were stable ("always" — present in all 3 samples) and 10.6% were "sometimes" (present in only 1 or 2). Mean pairwise top-10 overlap was 0.944, but the top-3 leader set changed in 11.3% of queries. Translation: a single AI-search check is right most of the time and silently wrong about 1 time in 9 — and a one-time GEO Score cannot tell the difference between a real result and a sampling fluke.
What we measured and why
This study measures the stability of the organic web-search retrieval landscape under repeated identical sampling — a proxy for how much the candidate pool AI search engines draw from varies between two otherwise-identical queries. It is not a direct log of AI-engine citations. We are stating that limit up front because it governs how every number below should be read.
AI search engines like ChatGPT, Perplexity, and Gemini answer a buyer question by retrieving a set of web pages, reading them, and synthesizing an answer with citations. The retrieval step pulls overwhelmingly from the organic search index. If that index returns a different set of pages each time you query it, then the candidate pool — the raw material every AI answer is built from — is itself moving. That movement is the floor on how stable any downstream AI citation can be. You cannot have a stable answer built on an unstable pool.
So we measured the pool's stability directly. On 2026-05-28 we selected 30 commercial verticals — a deliberate mix of B2B software (CRM, accounting, payroll, project management), considered-purchase services (life insurance, marketing agencies, web design), and physical/consumer categories (mattresses, running shoes, electric cars, smart-home security). For each vertical we ran 5 high-intent buyer queries. That is 150 queries. Then we ran each query three times and recorded the ranked domains returned each time. That is 450 total samples.
For every query we computed:
- Top-10 Jaccard overlap — how much the set of top-10 domains overlapped between repeated samples. 1.0 means identical; 0.0 means completely different.
- Share of answers — for every domain that appeared at all, whether it showed up in all 3 samples ("always") or only 1-2 ("sometimes").
- Top-3 churn — whether the set of top-3 leading domains changed across samples.
- Position volatility — when a domain reappeared, how far its rank moved.
This gives a clean, reproducible measurement of retrieval-layer noise. What it does not give is a record of which page a specific LLM actually cited in a generated answer. We treat the retrieval variance as a lower bound on citation variance — because AI engines add their own re-ranking and selection on top, which can only add noise, not remove it. The full methodology and limitations are at the bottom. Read them before quoting any figure.
The headline: stochasticity is real, but concentrated
Across all 450 samples, 89.4% of domain appearances were "always" (1,108 of 1,239) and 10.6% were "sometimes" (131 of 1,239). Mean pairwise top-10 Jaccard was 0.944, and 89.3% of queries had above-0.8 overlap. The retrievable web is mostly stable — but a real, measurable minority of it is not.
Here is the share-of-answers split across every domain that appeared in the study:
≈ 1 in 9 of the domains a single check surfaces is sample-dependent — there one pull, gone the next.
And here is how query-level overlap was distributed:
| Top-10 Jaccard overlap | Queries | Share |
|---|---|---|
| Above 0.8 (highly stable) | 134 | 89.3% |
| 0.5 to 0.8 (moderately volatile) | 8 | 5.3% |
| Below 0.5 (highly volatile) | 8 | 5.3% |
The honest reading of this data is two-sided, and we will not flatten it into a scare headline.
First, the retrievable web is more stable than the "everything is chaos" framing suggests. Mean overlap of 0.944 is high. Nine in ten queries return nearly the same top-10 every time. If you only ever checked stable queries, a single snapshot would serve you reasonably well.
Second — and this is the part that breaks one-shot measurement — you do not know in advance which queries are the stable ones. 10.6% of every domain that appeared was a coin-flip on whether you'd see it. 5.3% of queries had below-0.5 overlap, meaning more than half their top-10 churned between identical pulls. A single check gives you one draw from a distribution and reports it as if it were the distribution. For the stable majority that is fine. For the volatile tail it is actively misleading — and the snapshot itself carries no warning label.
10.6% of domain appearances and 10.7% of query-level overlap is sample-dependent — meaning roughly 1 in 9 of the things a single AI-search check tells you could change on the next identical check. The average is reassuring (Jaccard 0.944). The variance is the problem: one snapshot cannot distinguish a stable result from a volatile one, so it reports both with the same false confidence.
The leaders churn too — it is not just the long tail
The set of top-3 leading domains changed across samples in 11.3% of queries (17 of 150). When you check "who's winning" with a single query, you are getting it wrong about 1 time in 9 — and not at the bottom of the list, but at the very top where it matters most.
The top 3 is the part everyone actually reads — and it's exactly the part that moves. Screenshot "who does ChatGPT recommend first?" today, and roughly 1 in 9 times you've captured a leaderboard that wasn't there an hour ago and won't be again. Any decision — yours or a competitor's — built on a single screenshot of the leaders is built on a coin that lands the same way only 8 times in 9.
A common defense of one-shot measurement is that the churn lives in the long tail — positions 8, 9, 10 wobble, but the leaders are locked in. The data does not support that. Across 150 queries, the top-3 domain set changed in 17 of them. That is the single most consequential finding for anyone making decisions off a snapshot, because the top-3 is exactly what people read. "Who does ChatGPT recommend first?" is a top-3 question, and the top-3 answer is unstable in more than one out of ten cases.
There is a subtlety worth surfacing, because it reframes what "instability" means here. When a domain does reappear, it barely moves: mean absolute rank change was just 0.10 positions. Instability does not show up as stable members gently re-ordering — domain A and domain B politely swapping ranks 2 and 3. It shows up as domains entering and leaving the set entirely. A competitor is in your top-3 on one pull and absent on the next. That is a far more dangerous kind of noise than re-ordering, because a binary "are they cited?" check — the foundation of most GEO scoring — flips outright. Presence is the noisy variable, not position.
This is why a binary "cited / not cited" snapshot is the most fragile measurement of all. The thing it measures — in-or-out of the set — is precisely the thing that churns most.
Top-3 leaders changed in 11.3% of queries, but reappearing domains moved only 0.10 positions on average. The instability is members entering and leaving the set, not gentle re-ordering. That means a binary "are you cited?" check — the core of most GEO scores — is hitting the noisiest possible variable. Presence flips; rank barely moves.
Where the volatility lives: a few verticals swing hard
Instability is not spread evenly — it clusters. Electric cars (Jaccard 0.37) and smart-home security (0.50) were the volatile outliers, while most software and commercial verticals were near-perfectly stable (Jaccard 1.0). Your category's stability is not a given — it is something you have to measure, not assume.
Here is the per-vertical stability, sorted from most volatile to most stable:
Bars scaled from Jaccard 0.30 (left) to 1.00 (right); exact value shown at right.
Before you trust any AI-visibility number for your business, find out which bar you're standing on. In a 1.00-Jaccard category like payroll software, one check is basically the truth. In a 0.37 category like electric cars, one check is a guess wearing a number. You can't tell which from the result itself — only by running the same query a few times and watching how much the results overlap.
The pattern is not random. The two genuinely volatile verticals — electric cars and smart-home security — are both aggregator-dominated, fast-moving consumer categories where new products, new reviews, and new roundups publish constantly and the retrieval pool reshuffles to match. The near-perfectly stable verticals skew toward established B2B software, where the candidate set is mature and the same handful of domains hold their positions every pull.
The operational lesson is not "ignore stable categories." It is that stability is a per-vertical property you have to verify, not a constant you can assume. If you sell payroll software (Jaccard 0.99), one careful check is nearly as good as ten. If you sell electric cars (Jaccard 0.37), a single check is close to meaningless — you'd need many samples to even estimate your true position. The same one-shot methodology that is roughly adequate in one vertical is dangerously wrong in another, and you cannot know which case you're in without sampling. The right content and measurement strategy depends on where your category sits.
Volatility clusters: electric cars (0.37) and smart-home security (0.50) swing hard while most B2B software sits at 1.0. Stability is a per-vertical property — in a 0.99-Jaccard category one check suffices, in a 0.37-Jaccard category one check is near-meaningless. You cannot tell which you're in without sampling repeatedly first.
See what ChatGPT says about your brand
Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.
Run My Free AuditThe retrieval pool is intermediary-heavy on top of being noisy
Beyond instability, the pool itself is third-party-leaning. Across all 450 samples, brand-owned pages were 41.6% of results (1,534), aggregators/directories 36.8% (1,354), and editorial/listicle sources 21.6% (796) — so intermediaries (aggregators plus editorial) make up 58.4% of what gets surfaced, outweighing brands.
Here is the source-type split across the full dataset:
58.4% — the majority — is content you don’t own. When AI answers “the best [category],” most of what it reads is third-party pages, not your site.
Two things compound here, and the combination is the real story.
First, the majority of what AI search engines retrieve for commercial queries — 58.4% — is content you do not own. That alone means GEO is substantially an off-domain game, a finding that lines up with our companion study on the structure of the AI-citation landscape.
Second, and more relevant to measurement: the volatile verticals were exactly the aggregator-dominated ones. Electric cars and smart-home security, the two below-0.5-Jaccard outliers, both led with aggregators. That is not a coincidence. Aggregator and editorial pages churn faster than brand pages — new roundups publish, rankings get re-sorted, "best of 2026" lists get updated — and that churn propagates straight into the retrieval pool. So the more intermediary-heavy your category, the more sampling noise you should expect, and the less a single check is worth. The two findings — third-party-dominated and stochastic — are the same coin: the parts of the web that move fastest are also the parts AI leans on most.
Why a one-time GEO Score misleads
A single AI-search snapshot draws one sample from a distribution and reports it as a fact. With 10.6% of appearances sample-dependent and the top-3 changing 11.3% of the time, that snapshot carries an error rate it never discloses. The fix is not a better single check — it is repeated sampling.
Walk through what actually happens when you run a one-time check. You query ChatGPT for "best [your category]," see whether you're cited, and record a result. That result is one draw. If your query lives in the stable 89%, your draw is reliable. If it lives in the volatile tail, your draw is a coin flip — and nothing about the single result tells you which case you got. You will treat both with identical confidence, because a snapshot has no error bars.
This produces three specific, expensive failure modes:
False alarms. You check, you're not cited, you panic — and re-running would have shown you cited 2 of 3 times. You just diagnosed a problem that was sampling noise. Worse, you might "fix" content that was fine, then credit the fix when the next check happens to land on a good draw.
False comfort. You check, you're cited, you relax — but you were a "sometimes" domain all along, present in 1 of 3 pulls. Your real share of answers is a third of what your snapshot implied, and you're now under-investing against a gap you cannot see.
Phantom wins and losses over time. You check Monday, you're in. You check Friday, you're out. You conclude something changed — a competitor moved, an algorithm shifted, your content decayed — and you chase a cause that does not exist. The change was variance, not a trend. Distinguishing real movement from noise is impossible with one sample per checkpoint, and real movement is exactly what you're trying to track. (Citations genuinely do shift over time — see why AI citations decay — which makes separating true decay from sampling noise even harder, and even more important.)
The statistical fix is not exotic. It is the same fix every noisy measurement uses: sample repeatedly and report a share, not a binary. Instead of "cited: yes/no," measure "cited in 2 of 3 checks — 67% share of answers." Instead of "rank 4," measure "mean rank 4.2 across 5 samples, present in 5 of 5." That converts a fragile point estimate into a stable distribution with a confidence signal attached. The 0.10-position rank stability we measured is the good news here: once you sample enough to confirm a domain is reliably present, its rank is trustworthy. The hard part is presence, and presence is exactly what repeated sampling pins down.
This is the entire design principle behind continuous, multi-sample share-of-answers monitoring. A single audit is a useful diagnostic snapshot — it tells you roughly where you stand today, and it's the right way to start. But ongoing measurement that drives budget and content decisions has to average across repeated samples, or it is measuring noise. That is the difference between LumenGEO's free one-time audit (a snapshot, honestly labeled as one) and the monitoring subscription (repeated sampling that reports share of answers with the variance built in).
A one-time GEO Score is one draw from a distribution, reported with no error bars. It produces false alarms (noise read as a problem), false comfort (a "sometimes" domain read as stable), and phantom trends (variance read as change). The fix is repeated sampling reported as share-of-answers — "cited in 2 of 3 checks," not "cited: yes." Presence is the noisy variable; sampling is what pins it down.
What to do about it
Use a single audit to orient, then switch to repeated sampling for anything you act on. The volatility data tells you exactly how much sampling each category needs. Here is the practical sequence.
Start with a snapshot — but treat it as a snapshot
A one-time audit is the right first move. It tells you roughly where you stand, which competitors appear, and which queries to watch. Just don't over-trust a single result, especially a binary "not cited" — that's the most fragile reading in the entire dataset. Treat the first audit as a hypothesis, not a verdict.
Sample volatile categories harder than stable ones
If your vertical resembles the stable B2B-software cluster (Jaccard near 1.0), a few checks per query gets you a trustworthy read. If it resembles electric cars or smart-home security (Jaccard 0.37-0.50), you need many more samples before any number means anything. The cheapest way to find out which you are: run the same query three times and look at the overlap yourself. High overlap, sample lightly. Low overlap, sample heavily or don't trust point reads at all.
Measure share of answers, not binary presence
Replace "are we cited?" with "in what fraction of checks are we cited?" A domain present in 5 of 5 samples is in a categorically stronger position than one present in 2 of 5, even though a single lucky check would score them identically. Share of answers is the metric that survives the stochasticity this study documents — and it is the metric this product is built around. The same logic improves how you read competitors and how you structure pages for citation eligibility.
Separate signal from noise before you react
Before you conclude a citation was won or lost, re-sample. If the change holds across multiple fresh pulls, it's real and worth acting on. If it doesn't, it was variance and acting on it would have been a mistake. Continuous monitoring does this automatically — it's checking constantly, so a real change shows up as a sustained shift in share, while noise averages out. That is the core reason ongoing monitoring beats periodic manual checks: not that it checks more often, but that frequency is what lets you tell a trend from a tremor.
Methodology & limitations
This study measured the stability of the organic web-search retrieval landscape under repeated identical sampling, dated 2026-05-28. It is a multi-sample retrieval-landscape proxy for AI-engine retrieval variance — not direct AI-engine citation logging. We spell out the limitations fully because overclaiming would undermine every number above.
What we did. On 2026-05-28 we ran 150 commercial buyer queries (5 each across 30 verticals), and ran each query 3 times, for 450 total samples. For each sample we recorded the ranked domains returned by web-search retrieval. We then computed, per query, the pairwise top-10 Jaccard overlap between samples, whether each appearing domain was "always" (3/3) or "sometimes" (1-2/3), whether the top-3 domain set changed, and the mean absolute rank change of reappearing domains. Domains were classified into brand, aggregator/directory, and editorial/listicle source types.
Limitation 1: This is a proxy, not direct citation logging. We did not log which sources ChatGPT, Perplexity, or Gemini actually cited inside generated answers. We measured the variance of the web-search retrieval pool those engines draw from. We report retrieval-layer variance as a lower bound on downstream citation variance — because AI engines apply their own re-ranking, recency weighting, and selection on top of the organic pool, which can only add variance, not subtract it. Read every figure as "the retrievable pool varied this much," not "AI engine X's citations varied this much." The true citation-level instability is very likely higher than what we report.
Limitation 2: Three samples is a floor, not a ceiling. Three samples per query is enough to detect instability and classify always-vs-sometimes, but it is a coarse estimator of the true distribution. A domain we marked "always" on 3/3 samples could still be a "sometimes" domain we happened to catch three times; a "sometimes" at 1/3 could be more or less stable than that single observation implies. More samples would sharpen every estimate. Our numbers should be read as a conservative demonstration that meaningful stochasticity exists — not as a precise measurement of its exact magnitude.
Limitation 3: Snapshot in time, US/English, commercial intent. This is a single-day (2026-05-28) measurement of US-English commercial buyer queries across 30 verticals, 5 queries each. The existence and rough magnitude of retrieval stochasticity is the robust finding; the exact per-vertical Jaccard values will drift day to day, and other regions, languages, or query types may behave differently. The volatile verticals identified here (electric cars, smart-home security) are illustrative of where instability concentrates, not a permanent list.
Limitation 4: Source-type classification involves judgment. Brand vs. aggregator vs. editorial is a rules-based call that blurs at the margins — a brand domain with a strong editorial arm, or a comparison site that also publishes news. We applied the rubric consistently, but a different rubric would shift the percentages by a few points. Treat the direction (intermediary-heavy, 58.4%) as robust and the exact splits as approximate.
In short: trust that the retrievable web is stochastic and that the instability concentrates in fast-moving, aggregator-heavy verticals. Do not treat any single Jaccard value as a precise constant, and do not read these retrieval-variance numbers as exact AI-citation-variance numbers — they are a documented lower bound, measured once.
Frequently asked questions
Can I measure my AI-search visibility with a single check?
Not reliably. In this study, 10.6% of all domain appearances were sample-dependent — present in only 1 or 2 of 3 identical samples — and the top-3 leaders changed in 11.3% of queries. A single check gives you one draw from a distribution and reports it with no error bars. For stable queries it's roughly fine; for the volatile minority it's a coin flip, and the snapshot can't tell you which case you got. Repeated sampling is the only way to know whether a single result is signal or noise.
What is the Jaccard overlap and why does it matter?
Jaccard overlap measures how much two sets share — here, the set of top-10 retrieved domains between repeated samples of the same query. 1.0 means identical results every time; 0.0 means completely different. Our mean was 0.944, with 89.3% of queries above 0.8. High on average, but with a volatile tail: 5.3% of queries scored below 0.5, meaning more than half their top-10 churned between identical pulls. It matters because it quantifies how much the candidate pool — the raw material of every AI answer — moves on its own.
Does this study log actual ChatGPT or Perplexity citations?
No. We measured the variance of the organic web-search retrieval landscape across repeated samples, dated 2026-05-28 — the candidate pool AI engines draw from — not the citations they generate. We report retrieval variance as a lower bound on citation variance, because AI engines add their own re-ranking and selection on top, which can only increase variance. The real instability inside AI answers is likely higher than the numbers here.
How many times should I sample a query to get a reliable read?
It depends on the vertical's stability, which you have to measure. In near-perfectly-stable categories (Jaccard ~1.0, like most B2B software), a few checks per query gives a trustworthy read. In volatile categories (Jaccard 0.37-0.50, like electric cars or smart-home security), you need many more samples before any number is meaningful. The cheapest first step: run the same query three times and look at the overlap. High overlap means you can sample lightly; low overlap means point estimates can't be trusted.
Which verticals are most unstable?
In our data, electric cars (Jaccard 0.37) and smart-home security (0.50) were the volatile outliers — both fast-moving, aggregator-dominated consumer categories where roundups and rankings change constantly. Most B2B software verticals (accounting, CRM, payroll, website builders, and others) were near-perfectly stable at 1.0. The pattern: the more intermediary-heavy and fast-moving your category, the more sampling noise you should expect.
Is the retrievable web mostly stable or mostly unstable?
Both, depending on which part you look at. On average it's quite stable — 89.4% of domain appearances were "always" present and mean overlap was 0.944. But a real minority is unstable: 10.6% of appearances were sample-dependent and 5.3% of queries had below-0.5 overlap. The danger isn't the average; it's that a single check can't distinguish a stable result from a volatile one, so it reports both with the same false confidence.
Why is a binary "cited / not cited" check the most fragile measurement?
Because instability in this data showed up as domains entering and leaving the result set, not as gentle re-ordering — reappearing domains moved only 0.10 positions on average. Presence is the noisy variable; rank is not. A binary "are you cited?" check measures exactly the thing that churns most, so it flips outright between identical pulls. Measuring share of answers ("cited in 2 of 3 checks") instead of a binary is far more robust.
What is share of answers and why is it better than a GEO Score snapshot?
Share of answers is the fraction of repeated samples in which your brand appears — for example, "cited in 4 of 5 checks (80%)." It's better than a one-time snapshot because it has the stochasticity built in: a domain present in 5 of 5 samples is genuinely stronger than one present in 2 of 5, even though a single lucky check would score them identically. It converts a fragile point estimate into a distribution with a confidence signal, which is what survives the retrieval noise this study documents.
How does this change how I should monitor AI visibility?
Use a one-time audit to orient — it tells you roughly where you stand and which queries to watch — but treat it as a hypothesis, not a verdict. For anything you act on, switch to repeated sampling reported as share of answers, and sample volatile categories harder than stable ones. Before reacting to an apparent change, re-sample to confirm it's real and not variance. Continuous monitoring does this automatically: frequency is what lets you separate a real trend from a one-pull tremor.
Does retrieval instability mean my content optimization doesn't matter?
No — it means you have to measure its effect correctly. Optimization still moves your underlying citation eligibility, but if you measure the effect with single before-and-after checks, sampling noise can hide a real gain or fake one that isn't there. Measure the effect as a change in share of answers across many samples, not as a flip in a single binary check, and genuine improvements become visible while noise averages out.
See what ChatGPT says about your brand
Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.
Run My Free Audit