Blog · AEO

The Citation Diversity Index (CDI), explained: why single-source dependence can wreck your AEO

A differentiated metric for GEO · one read that nails the engineering definition of CDI and how to move it from 30 to 70+

May 4, 2026·17 min read·By Maxfound AI research team· Research × Data science
📊 TL;DR · 3-min read
  • 1.CDI (Citation Diversity Index) = Shannon entropy, operationalized · it measures whether your citation sources are spread across many domains or concentrated on a single platform
  • 2.Extreme example: 80 citations all on one site → CDI 0; 80 citations spread across 8 sources → CDI 78. CDI 78 is more realistic than CDI 95
  • 3.Three risks of single-source dependence: platform policy changes / AI crawler rate-limiting (bot blocking) / sudden topical spikes that silence you
  • 4.Five symptoms of CDI < 50: top source share > 60% / only 2-3 of 7+ source types used / long-tail concentrated on one domain
  • 5.A three-layer path from CDI 30 to 70+: identify the dominant source → find fuel domains → ship 8-12 owned + earned pieces a month
#AEO#CDI#Citation#Data science#Education

1. What CDI is · Shannon entropy, operationalized

The Citation Diversity Index (CDI) is a core health metric built by the Maxfound AI team. It measures not how many times you're cited (that's Citation Count) but how many independent sources your citations are spread across, and how evenly.

Why isn't Citation Count enough? Because 80 citations all from the same publisher network is a different thing from 80 citations spread across 8 independent sources. The former is single-point risk; the latter is a healthy distribution — and an answer engine makes a similar judgment internally when it assesses your standing in a category.

The math behind CDI is Shannon entropy. Simplified: aggregate every citation by source domain, compute each domain's share of total citations p_i, then H = -Σ(p_i × log₂(p_i)). We normalize H onto a 0-100 scale — that's CDI.

Intuition: cited 100 times, all from one site → p = 1 → H = 0 → CDI = 0. Cited 100 times, evenly across 4 domains (25% each) → H = 2 → CDI ≈ 67. Cited 100 times, evenly across 16 domains (6.25% each) → H = 4 → CDI ≈ 100 (the theoretical ceiling).

ℹ Tip
Why does this metric matter so much for GEO? In English-language settings, answer engines naturally draw from diverse sources (Reddit, Quora, Wikipedia, Medium, niche communities), so single-source dependence is rarely severe. In markets where one platform dominates content, single-source dependence is the real risk for most brands.

2. 80 citations on one site = CDI 0 · 80 across 8 sources = CDI 78

Let's turn the formula into two concrete, visualizable scenarios.

Scenario A · single-point extreme: a new consumer brand has 80 answer-engine citations — 78 from one platform's columns and 2 from another. p_primary = 0.975 · p_other = 0.025 · H ≈ 0.17 · CDI ≈ 4. This brand's visibility might be high today — but the moment that one platform changes policy, the whole AEO footprint resets to zero overnight.

Scenario B · healthy distribution: another brand also has 80 citations, spread across 8 sources (18 / 14 / 12 / 10 / 8 / 7 / 6 / 5). The p distribution is far more even · H ≈ 2.93 · CDI ≈ 78. Even if one source vanishes overnight, this brand still has seven legs to stand on.

Why is CDI 78 more realistic than CDI 95? Because CDI 95+ implies a near-perfectly even distribution — almost impossible in reality. Every industry has its naturally preferred content platforms. Pushing CDI into the 70-85 band is achievable; pushing it past 95 is over-optimization.

⚠ Note
Practitioner note: CDI 70-85 is the healthy band · CDI > 90 usually means you're underweight on some category-core sources — a different kind of imbalance.

3. Three risks of single-source dependence

Let's break single-source dependence (CDI < 30) into three concrete risk scenarios — all common traps in GEO.

  1. Risk 1 · platform policy changes — major content platforms revise their policies regularly (tighter review, marketing-content classification, commercial-account throttling), and each revision can demote a batch of brands' citation libraries on that platform short-term. Brands betting entirely on one platform face a 1-2 week visibility cliff with every change
  2. Risk 2 · AI crawler rate-limiting / blocking — AI crawlers like GPTBot and others have repeatedly been throttled or blocked by some large platforms. If the platform that hosts 80% of your citation sources blocks a given crawler one day, you simply disappear inside that answer engine
  3. Risk 3 · sudden topical spikes silence you — during holidays, industry black swans, or breaking negative events, a single platform's content-heat structure gets scrambled instantly. If you only have one leg, you're the first to fall behind
⚠ Note
We've seen at least three clients with CDI < 20 whose overall answer-engine visibility dropped from 60% to 12% in a week after one platform policy change — with zero buffer.

4. Five symptoms of CDI < 50

How can you roughly judge your CDI without opening a tool? A five-symptom self-check — hit two or more and you can basically confirm CDI < 50.

  1. Top source share > 60% — aggregate every source domain that cites you across all answer-engine answers; if the #1 domain exceeds 60%, that's single-source dependence
  2. Only 2-3 of 7+ source types used — the standard source-type pool has 7-8 categories (community columns / social content / publishers / vertical trade media / encyclopedias / your own site / government / academic); using only 2-3 is an unhealthy diversity structure
  3. Long-tail citations concentrated on one domain — not just the top source but even your long-tail citations come from different URLs on the same domain — the classic symptom of betting your entire SEO + GEO strategy on one platform
  4. New content only catches fire on one platform — content you ship gets picked up by engines only on platform X, with zero citations elsewhere — your distribution covers a single channel
  5. TTFC (time-to-first-citation) depends abnormally on one platform — every piece that earns a first citation within 7 days comes from the same domain — your fast lane is down to a single route

5. How to move from CDI 30 to 70+

Diagnosis done — here's the action plan. The standard three-layer path from CDI 30 to 70+ is the exact workflow we run with every low-CDI client.

  1. Layer 1 · identify the source you're already dominant on — first see clearly which single domain accounts for 80% of your citations today. That's the "single point." Acknowledge it; don't pretend it isn't there
  2. Layer 2 · find your fuel domains — use the Citation Heist Map to see which source domains your top 5 competitors are cited on that you haven't covered yet. Those are your "fuel domains." Typical fuel: vertical trade media / government or association sites / peer podcast transcripts / encyclopedias and knowledge graphs (Wikipedia, Wikidata)
  3. Layer 3 · ship 8-12 owned + earned pieces a month — force channel diversity; stop betting it all on the dominant platform. Suggested mix: 4 owned (site + newsletter) + 4 earned (contributed to vertical trade media) + 2-3 PR (industry awards / media coverage) + 1 knowledge-graph (encyclopedia entry / Wikidata QID). Run this cadence for 3 months and CDI typically moves from 30 to 65-75
💡 Key point
Demo sandbox (fictional example, not real client data): a restaurant chain starts at CDI 28 (71% dominant on one platform); after a 3-month cadence of 12 pieces a month, CDI rises to 73 and overall Citation Rate climbs 1.6× — diversification pulls faster than deepening a single source.

6. Evidence from the Princeton paper

CDI isn't invented out of thin air. The Princeton team's 2024 KDD GEO paper (Aggarwal et al.) already validated source diversity as one of the key variables behind citation lift.

Core data from the paper: among all content levers, quotation (direct quotes) alone contributes +41% citation lift — the source of the 41% figure cited so often from the CDI angle. The paper also observed that a diverse source mix produces a steadier return curve than deepening a single source — single-source deepening grows faster in the first 2 months but slows sharply by months 3-6, while a diversified play starts slower yet overtakes by 1.4-1.8× after 6 months.

That conclusion aligns with our own observations — a diversified CDI footprint produces a steadier cumulative citation-return curve. The ROI of diversification isn't higher, it's steadier.

7. Next steps

If you want to act after reading this, three escalating moves:

  1. Free · 30-second check — go to maxfound.ai/check, enter your brand name, run a scan across leading answer engines, and get your current CR / CDI / TTFC baselines (no cost, no phone number, no login)
  2. 5 minutes · open the Citation Heatmap widget — start a trial account and open the Citation Heatmap widget in the command center; it tells you your top-3 source share, how many of the 7 source types you cover, and a candidate list of fuel domains
  3. 30 minutes · request a 1:1 demo — go to maxfound.ai/request-demo; the founding team and Customer Success join the call together and map a 3-month CDI strengthening roadmap for you live
By
Maxfound AI research team
Research × Data science

Ready to have AI recommend you?

30-second free check · no login · scanned by 3 real AIs