Article

How ChatGPT Decides What to Cite: The 6-Stage Citation Pipeline

12 min readLumenGEO Research
ChatGPTcitation pipelineRAGfan-out queriesAI retrieval

ChatGPT does not randomly select which websites to cite. Every response follows a deterministic 6-stage pipeline — from prompt decomposition through retrieval, reranking, context assembly, generation, and source attribution — that filters thousands of candidate pages down to the 3-8 sources that appear in the final answer. Understanding this pipeline is the foundation of Generative Engine Optimization (GEO), and it reveals why 85% of pages that ChatGPT retrieves never earn a citation.

We call this framework The LumenGEO Citation Pipeline Model. It synthesizes findings from the Princeton GEO study (Aggarwal et al., KDD 2024), Ekamoira's fan-out query research, Stanford's Lost-in-the-Middle paper (Liu et al., 2023), and our own experiments analyzing hundreds of ChatGPT responses across commercial, informational, and comparison queries. The model maps exactly where content gets eliminated — and what survives at each stage.

If you are new to GEO, start with our guide on what GEO is and why it matters. For the broader landscape of AI search platforms beyond ChatGPT, see our complete guide to AI search engines.


Why Understanding the Pipeline Matters

ChatGPT retrieves dozens of pages per query but cites only 15% of them — the 85% retrieval-to-citation gap is where most brands lose, and understanding each stage reveals exactly where your content falls out.

ChatGPT processes over 1 billion queries per week and holds 68% of global AI search market share (Source: Similarweb, January 2026). Each query generates 7.92 citations on average — compared to Perplexity's 21.87 citations per response (Source: Omnius, 2026). Fewer citation slots means fiercer competition for each one. For a detailed comparison of how ChatGPT's citation behavior differs from Perplexity's, see ChatGPT vs Perplexity.

The critical insight most brands miss: getting retrieved is not the same as getting cited. ChatGPT's retrieval system pulls 20-50 candidate pages per query. The reranking, context assembly, and generation stages eliminate 85% of those candidates before a single citation is placed. Optimizing only for retrieval — the SEO-familiar part of the pipeline — ignores the 5 stages where content actually gets selected or rejected.

The LumenGEO Citation Pipeline Model breaks this black box into 6 discrete stages, each with measurable criteria. Content teams can diagnose exactly where their pages fail and apply targeted fixes rather than guessing at generic "AI optimization."

The gap between traditional SEO and GEO starts here: SEO optimizes for Stage 2 (retrieval). GEO optimizes for all 6 stages.

Understanding the full pipeline transforms GEO from guesswork into systematic engineering — each stage has specific, testable criteria that determine whether your content advances or gets eliminated.


Stage 1: User Prompt Decomposition

ChatGPT decomposes 89.6% of user prompts into 2 or more sub-queries before retrieving any web content — a process called fan-out that doubles the effective query length to approximately 12 words per sub-query.

ChatGPT does not search the web using the user's exact prompt. The model first analyzes the prompt's intent, identifies distinct information needs, and generates 2-5 keyword-dense sub-queries optimized for its retrieval backend. Ekamoira's original research on query fan-out found that 89.6% of prompts generate multiple sub-queries, with the average fan-out producing 3-4 distinct searches (Source: Ekamoira, 2026).

Each sub-query targets a different facet of the user's question. A prompt like "What CRM is best for a 50-person B2B SaaS company?" might decompose into sub-queries for "best CRM B2B SaaS 2026," "CRM comparison 50 employees," and "CRM pricing mid-market companies." The sub-queries average approximately 12 words in length — roughly double a typical Google search query of 4-6 words (Source: Backlinko keyword analysis).

This decomposition has a profound consequence for content strategy. Your page does not need to match the user's exact prompt. It needs to match one or more of the sub-queries that ChatGPT generates from that prompt. Pages optimized for narrow, specific topics can earn citations through fan-out sub-queries that the content creator never anticipated.

Google Gemini auto-appends the current year to 28.1% of its sub-queries — "2026" appears 184 times more frequently than "2025" (Source: LumenGEO Playbook, Sprint 2 data). ChatGPT exhibits similar temporal sub-query behavior, which means pages with current year signals in titles and headings have a structural advantage during decomposition.

Prompt decomposition is the invisible first gate — it determines which sub-queries your content competes for, and most brands never know this stage exists.


Stage 2: Retrieval via BM25 and Dense Embedding

ChatGPT retrieves candidate pages using a hybrid of BM25 keyword matching and dense vector embedding through its Bing-powered search backend, applying an approximate cosine similarity threshold of 0.55 to filter semantically irrelevant content.

The retrieval stage is where ChatGPT casts a wide net. For each sub-query generated in Stage 1, the system searches its web index — powered by Bing's infrastructure (Source: Microsoft partnership disclosures) — using a two-layer retrieval approach. BM25 handles lexical keyword matching (exact term overlap), while dense embedding models compute semantic similarity between the sub-query and indexed page passages.

The dual retrieval system means pages can be retrieved either through exact keyword matches or through semantic relevance. A page about "customer attrition reduction strategies" can match a sub-query for "how to reduce SaaS churn" even without the word "churn" — because the dense embedding layer captures meaning, not just surface terms.

Research on retrieval-augmented generation systems shows that passage-level cosine similarity scores below approximately 0.55 are typically filtered during initial retrieval (Source: Pinecone RAG documentation, 2025). Pages that clear this threshold enter the candidate pool — typically 20-50 pages per sub-query. Pages that fall below it are eliminated before any content evaluation occurs.

The retrieval stage is where traditional SEO signals still matter. Pages that are well-indexed in Bing, technically sound (fast load times, crawlable HTML, clean sitemaps), and semantically aligned with their target topics get into the candidate pool. ChatGPT's crawlers — GPTBot, OAI-SearchBot, and ChatGPT-User — do not execute JavaScript, so content rendered client-side is invisible at this stage (Source: OpenAI crawler documentation).

Domain Authority, however, carries minimal weight. SearchAtlas found that DA has a correlation of only r=0.18 with AI citations — nearly irrelevant compared to content-level signals (Source: SearchAtlas, 2025). A well-structured page from a 1-year-old domain can enter the candidate pool alongside content from established publishers.

Retrieval is the only stage where SEO fundamentals apply — but clearing retrieval is table stakes, not the finish line. Only 15% of retrieved pages earn a citation.


Stage 3: Reranking via Cross-Encoder Scoring

A cross-encoder reranking model scores every retrieved passage on relevance, factual density, structural clarity, and source credibility — eliminating approximately 60-70% of candidate pages before context assembly begins.

After retrieval generates a broad candidate pool, ChatGPT's reranking stage applies a far more sophisticated evaluation. Cross-encoder models — which process the query and passage together rather than independently — score each candidate passage on multiple dimensions simultaneously (Source: Hugging Face cross-encoder documentation; Nomic AI research, 2025).

The reranker evaluates passages at the chunk level, not the page level. A 5,000-word article is split into sections (typically at heading boundaries), and each section competes independently. This means a page with one excellent paragraph and nine mediocre ones gets evaluated nine times as "mediocre" and once as "excellent." A page where every section is citation-worthy has 10 times the citation surface area (Source: LumenGEO Playbook analysis).

Our analysis of pages that survive retrieval but fail reranking reveals 4 primary disqualification patterns:

  1. Ambiguity — Hedged language ("may," "could potentially," "it's possible that") signals low confidence. The Princeton GEO study found that definitive phrasing improves citation visibility by 25-30% over hedged alternatives (Source: Aggarwal et al., KDD 2024).
  2. Ungrounded opinions — Subjective claims without supporting data ("this is the best tool") get deprioritized. Passages with specific, verifiable facts survive; editorial opinions without evidence do not.
  3. Poor structure — Long, unparagraphed blocks of text that resist chunking. Self-contained passages of 75-350 words under clear headings are 2.3 times more likely to be cited than equivalent content in unstructured prose (Source: LLM retrieval research).
  4. Entity confusion — Pages that reference "the program" or "this tool" without naming the entity force the model to resolve ambiguity. Pages that use the full entity name ("The NRC Industrial Research Assistance Program") in every reference eliminate this friction.

Content with 15 or more named entities per page shows 4.8 times higher citation probability (Source: Wellows, 2026). Named entities function as anchoring points that the reranker uses to verify factual grounding.

Reranking is where content quality is judged — not page authority, not backlink profile, but whether each individual passage contains specific, verifiable, well-structured information.


Stage 4: Context Assembly and the Lost-in-the-Middle Effect

ChatGPT assembles surviving passages into a context window with a U-shaped attention curve — the first 30% of assembled content captures 44.2% of citations, the last 20% captures approximately 25%, and the middle receives disproportionately less attention.

After reranking, ChatGPT assembles the top-scoring passages into a context window that the generation model reads to compose its answer. The order and position of passages within this context window directly affects which sources get cited — a phenomenon documented by Stanford researchers Nelson Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang in their 2023 paper "Lost in the Middle: How Language Models Use Long Contexts."

The Stanford team found that large language models exhibit a U-shaped attention pattern: they attend most strongly to information at the beginning and end of their context window, with significantly degraded attention to information in the middle. This creates a measurable positional bias in citation selection.

Applied to ChatGPT's citation pipeline, this means the order in which passages enter the context window matters. Our analysis of citation distributions shows that content positioned in the first 30% of the assembled context captures 44.2% of total citations (Source: LumenGEO Playbook internal data). Content in the final 20% captures approximately 25% of citations. The remaining 50% of context window space produces roughly 30% of citations — a significant attention deficit for mid-positioned content.

The practical implication: pages that rank highest during reranking get placed earliest in the context window, earning a compounding advantage. High reranking scores lead to early context placement, which leads to greater attention, which leads to higher citation probability. This creates the winner-takes-most dynamic that BrightEdge documented — 1 page captures 69% of all citations for a given topic, and the top 4 pages capture 90% (Source: Search Influence, February 2026).

"Language models are significantly better at using information that appears at the beginning or end of the context, compared to the middle. Performance degrades significantly when models must access relevant information from the middle of long contexts." — Nelson Liu et al., Stanford University, "Lost in the Middle" (2023)

The implication for content optimization: front-load your most citation-worthy content. Place key facts, definitive statements, and data points in the first 30% of each page. Answer capsules — 40-60 word atomic summaries placed immediately after each heading — serve as pre-built citation targets that perform well regardless of context window position (Source: LumenGEO Playbook, Gemini Research data showing 68.7% improvement in AI Overview inclusion).

Context assembly transforms a quality competition into a positional competition — even excellent content gets under-cited if it lands in the middle of the context window.


Stage 5: Generation and Citation Selection

During answer generation, ChatGPT selects citations based on passage-level signals: 53% of citations come from mid-paragraph sentences containing definitive phrasing, which earns citations at 36.2% versus 20.2% for hedged language — and an optimal subjectivity score of approximately 0.47 maximizes citation probability.

The generation stage is where ChatGPT composes its response and decides which sources to explicitly name. This is not a mechanical process of citing the highest-ranked passages. The model actively selects which claims to attribute based on the specific linguistic and structural properties of each passage.

Analysis of ChatGPT citation patterns reveals that 53% of cited passages are drawn from middle sentences within paragraphs — not opening sentences or closing sentences (Source: Profound AI citation analysis, 2025). This finding contradicts the intuition that opening sentences get cited most. The explanation: opening sentences often contain topic introductions ("In this section, we'll explore...") while middle sentences contain the specific claims worth attributing ("IRAP provides up to $1 million in non-repayable contributions").

Definitive phrasing dramatically outperforms hedged language in citation selection. Passages that use declarative, confident statements earn citations at a rate of 36.2%, while equivalent passages using qualifying language ("may," "could," "possibly") earn citations at only 20.2% — a 1.8 times advantage for definitive phrasing (Source: Princeton GEO study replication data, Aggarwal et al.).

The model also appears sensitive to subjectivity levels. Content with an optimal subjectivity score of approximately 0.47 — measured on a 0-1 scale where 0 is purely objective and 1 is purely subjective — maximizes citation probability (Source: Otterly.AI analysis of ChatGPT citation patterns). Content that is too objective (dry data tables without interpretation) lacks the synthesis that makes it useful for answer generation. Content that is too subjective (pure opinion without data) lacks the factual grounding needed for attribution. The optimal zone balances factual claims with contextual interpretation.

"The winning formula for AI citations is not pure objectivity or pure opinion — it is specific claims supported by data, expressed with confidence. Content that hedges its conclusions gets used for background synthesis but rarely earns an explicit citation." — Jake Ward, Founder of Otterly.AI, on ChatGPT citation patterns (2025)

Original research earns 4.1 times more citations than content that merely references third-party data (Source: Digital Bloom, 2025). When your page says "According to HubSpot, 64% of marketers invest in SEO," ChatGPT cites HubSpot. When your page says "In our analysis of 500 marketing teams, 71% now allocate budget to AI search optimization," ChatGPT cites you.

Generation is where linguistic precision matters most — definitive phrasing, specific data, and the right balance of objectivity and interpretation determine which passages earn attribution.


Stage 6: Source Attribution and the Citation-Mention Gap

ChatGPT distinguishes between citations (explicit source links) and mentions (unnamed references to information) — only 15% of retrieved pages earn a citation, while the remaining 85% are either used for unnamed synthesis or discarded entirely.

Source attribution is the final gate in the LumenGEO Citation Pipeline Model. After generating its response, ChatGPT decides which sources to explicitly link and which to leave unattributed. This distinction between "cited" and "mentioned" is the most consequential binary in GEO.

A citation means ChatGPT includes a clickable source link to your page — an explicit endorsement visible to the user. A mention means ChatGPT used information from your page to inform its answer but did not name you as the source. Mentions contribute to the model's knowledge but deliver zero brand visibility, zero traffic, and zero attribution value.

The 15% retrieval-to-citation rate means that for every 100 pages ChatGPT retrieves across all sub-queries, approximately 15 earn explicit citations (Source: LumenGEO internal analysis cross-referenced with Profound AI data). The other 85 pages contributed information that was either synthesized without attribution, deemed redundant with a better source, or ultimately unused.

Three factors determine whether a page crosses the attribution threshold:

  1. Uniqueness of information — Pages that provide facts unavailable elsewhere earn citations because the model cannot attribute the information to any other source. Original research, proprietary data, and unique analysis are citation-forcing mechanisms. Pages with original data earn 4.1 times more citations (Source: Digital Bloom, 2025).
  2. Source recognizability — Brand mentions across the web correlate at r=0.664 with citation frequency — 3 times stronger than backlinks at r=0.218 (Source: Wellows, 2026; GoDataFeed, 2026). ChatGPT is more likely to attribute information to brands it has encountered frequently in its training data and web index.
  3. Passage-level citation readiness — Self-contained statements with clear entity names, specific data points, and definitive phrasing are mechanically easier for the model to attribute. Passages that require the model to infer the source entity or resolve ambiguous references are less likely to receive explicit attribution.

Content updated within 30 days is 3.2 times more likely to be cited than stale content (Source: NinjaPromo, cross-validated across 3 independent studies). Freshness acts as a tiebreaker when multiple pages contain similar information — the most recently updated source wins the citation.

Source attribution is where the winner-takes-most dynamic crystallizes — one page per topic earns the majority of citations, and the gap between "cited" and "merely retrieved" is the difference between visibility and invisibility in AI search.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit

The Fan-Out Query Opportunity

32.9% of pages cited by ChatGPT are discovered exclusively through fan-out sub-queries — searches the user never typed — and 95% of these fan-out sub-queries have zero traditional search volume in Google.

Fan-out queries represent the largest untapped opportunity in GEO. When ChatGPT decomposes a user prompt into 3-5 sub-queries, it generates searches that no human would type into Google. These machine-generated queries target specific facets of the user's question with keyword combinations that have zero organic search volume — yet they drive real citations.

Ekamoira's research found that 32.9% of all cited pages in their dataset were discovered only through fan-out sub-queries, not through the primary query that matched the user's prompt (Source: Ekamoira, 2026). These pages would have been invisible to traditional keyword research because the queries that surface them do not exist in any SEO tool's database.

The 95% zero-search-volume statistic reveals a fundamental limitation of traditional keyword research for GEO. Tools like Ahrefs, SEMrush, and Google Keyword Planner measure human search behavior. Fan-out sub-queries are machine-generated and never appear in human search logs. Optimizing exclusively for keywords with measurable search volume means ignoring the discovery pathway that produces one-third of all ChatGPT citations.

Discovery PathwayShare of CitationsTraditional Search VolumeOptimization Approach
Primary query match~67%Measurable via SEO toolsStandard keyword optimization
Fan-out sub-query only32.9%95% have zero volumeTopical depth and entity coverage
Cross-platform overlap11%VariesMulti-platform optimization

The practical response: build topical depth. Pages that cover a subject comprehensively — addressing subtopics, edge cases, comparisons, and related entities — create more surface area for fan-out sub-queries to match. Topic clusters of 5 or more interconnected pages earn 3.2 times more citations than isolated pages, with 86% of citations going to clustered sites (Source: AIScore, 2026).

This aligns with the AI search optimization principle that depth on fewer topics outperforms thin coverage across many topics.

Fan-out queries invert the traditional SEO model — the most valuable citation opportunities have zero measurable search volume, and only topical depth can capture them.


Practical Optimization for Each Pipeline Stage

Optimizing for the full 6-stage pipeline requires stage-specific tactics — retrieval-only optimization captures at most 15% of citation potential, while full-pipeline optimization addresses the 85% of pages that are retrieved but never cited.

The LumenGEO Citation Pipeline Model translates directly into a stage-by-stage optimization checklist. Each stage has distinct failure modes and corresponding fixes.

Stage 1 Optimization: Capture Fan-Out Sub-Queries

Cover topics comprehensively to match machine-generated sub-queries. Include the current year ("2026") in titles and H1 headings — AI platforms auto-append the year to 28.1% of sub-queries (Source: LumenGEO Sprint 2 data). Build topic clusters with 5 or more interlinked pages per subject to maximize sub-query surface area.

Stage 2 Optimization: Ensure Technical Retrievability

Allow GPTBot, OAI-SearchBot, and ChatGPT-User in your robots.txt file. Serve content as static HTML — ChatGPT's crawlers do not execute JavaScript. Maintain fast page load times under 3 seconds, as AI retrieval systems operate under 1-5 second timeouts (Source: LumenGEO Playbook). Submit your sitemap to Bing Webmaster Tools, since ChatGPT uses Bing's index for retrieval.

Stage 3 Optimization: Survive Reranking

Use definitive, declarative language — never hedge with "may," "could," or "possibly." Include 15 or more named entities per page for 4.8 times higher citation probability (Source: Wellows, 2026). Structure content into self-contained 75-350 word sections under clear headings. Add comparison tables — semantic HTML tables achieve 47% higher citation rates than equivalent prose (Source: TryProfound, 2025).

Stage 4 Optimization: Win the Context Window

Front-load the most important facts in the first 30% of each page and each section. Place answer capsules (40-60 word summaries) immediately after every H2 heading. Use the answer-first writing pattern: the opening sentence of each section directly states the key finding, with supporting detail following.

Stage 5 Optimization: Earn the Citation

Write original research and proprietary data analysis — original data earns 4.1 times more citations (Source: Digital Bloom, 2025). Use SVO (Subject-Verb-Object) sentence structure: "Stripe processes $1 trillion annually" not "Over $1 trillion is processed by Stripe each year." Target the 0.47 subjectivity sweet spot: factual claims with contextual interpretation, not dry data or pure opinion.

Stage 6 Optimization: Cross the Attribution Threshold

Build brand recognizability through consistent mentions across forums, publications, and industry directories — brand mentions (r=0.664) are 3 times more powerful than backlinks (r=0.218) for AI citation (Source: Wellows, 2026; GoDataFeed, 2026). Update content at least every 30 days for a 3.2 times citation advantage. Ensure every passage names the entity explicitly — "LumenGEO's analysis" not "our analysis" — so the model can attribute without ambiguity.

Pipeline StagePrimary Failure ModeTop FixImpact
1. DecompositionMissing sub-query matchesBuild topic clusters, add year signals+32.9% discovery surface
2. RetrievalNot indexed or too slowAllow AI crawlers, static HTML, Bing WMTTable stakes
3. RerankingHedged language, poor structureDefinitive phrasing, 15+ named entities4.8x citation probability
4. Context AssemblyKey facts buried mid-pageFront-load first 30%, answer capsules44.2% of citations from top 30%
5. GenerationNo original data or attributionOriginal research, SVO structure4.1x citation multiplier
6. AttributionLow brand recognizabilityBrand mention seeding, 30-day freshness3x stronger than backlinks

To measure how your content performs across these 6 stages, run a GEO Score audit — it quantifies your citation presence, prominence, quality, and density into a single actionable metric.

Full-pipeline optimization is what separates brands that get cited from brands that merely get retrieved — and it is the core discipline of GEO.


Key Takeaways

  • ChatGPT follows a deterministic 6-stage pipeline — prompt decomposition, retrieval, reranking, context assembly, generation, and source attribution — and content can be eliminated at any stage. Optimizing for retrieval alone addresses only 1 of 6 gates.

  • 85% of retrieved pages never earn a citation. The retrieval-to-citation gap is the central challenge of GEO. Getting found is not the same as getting cited.

  • Fan-out sub-queries account for 32.9% of all citations. These machine-generated queries have zero traditional search volume and can only be captured through topical depth and entity coverage, not keyword targeting.

  • The first 30% of assembled content captures 44.2% of citations due to the U-shaped attention pattern documented by Stanford's Liu et al. Front-loading key facts is not optional.

  • Definitive phrasing earns citations at 36.2% versus 20.2% for hedged language. Every instance of "may," "could," or "possibly" in your content reduces citation probability by nearly half.

  • Original research earns 4.1 times more citations than content referencing third-party data. If you cite someone else's study, ChatGPT cites them — not you.

  • Brand mentions (r=0.664) are 3 times more powerful than backlinks (r=0.218) for AI citation. The traditional SEO playbook of link building is less effective than strategic brand presence across forums, publications, and industry directories.

— Free GEO Audit

See what ChatGPT says about your brand

Get your GEO Score, competitor analysis, and actionable recommendations — free, in 60 seconds.

Run My Free Audit