We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

If you've spent any time in GEO or AEO circles lately, you've heard the advice: "Publish Markdown versions of your pages so AI can read them more easily." Maybe you've also seen people advocating for llms.txt files, a kind of robots.txt but specifically for language models.

The logic sounds solid on the surface. LLMs process text as tokens. Markdown is cleaner text than HTML. So give AI crawlers the clean version, and they'll reward you with citations. Right?

Not quite. And the reason tells you something useful about how AI search optimization for HTML vs Markdown works. Most of the advice out there seems to have it backwards.

Where the Markdown Idea Comes From

The hypothesis has a kernel of truth to it. Large language models do work with plain text during inference. When you paste a messy HTML page into ChatGPT, it handles the content just fine, but it's clearly working harder to parse navigation menus, script tags, and cookie banners away from the actual content.

So the leap seems reasonable: if you provide a clean Markdown version alongside your HTML page, AI systems should prefer it. Less noise, same information, easier to process.

But this conflates two different things. How an LLM processes text in a chat window is not the same as how an AI search engine discovers, retrieves, and selects sources to cite. They're separate systems with separate architectures. And that distinction matters more than I initially expected.

What the Data Shows

Recent controlled experiments have put this theory to the test. In one study, researchers published identical content in both HTML and Markdown formats across two scenarios: one with established pages, one with brand-new content. Both formats were equally discoverable, with identical URL structures and footer links pointing to each.

The results were pretty clear-cut.

Over 14 days, HTML pages received 7.4% of all AI bot traffic. The Markdown files received exactly zero percent. Not a trickle. Not a slow start. Zero visits from any AI crawler.

The breakdown by platform was consistent across the board:

ChatGPT-User (the on-demand fetcher): visited HTML pages exclusively, accounting for 96-97% of AI bot traffic
Gemini Deep Research Fetcher: HTML only
Claude On-Demand Fetcher: HTML only
Perplexity AI Crawler: HTML only

And when it came to citations (the thing that matters for AI search visibility), not a single AI platform cited a .md URL. Every citation pointed to the HTML version.

This pattern holds beyond Markdown files too. A separate 90-day experiment tracking llms.txt adoption found that just 0.1% of AI bot traffic visited the /llms.txt file. Despite growing adoption of the standard (roughly 10% of domains now have one, according to SE Ranking), the crawlers aren't using it.

I'll be honest, when I first heard the Markdown-for-AI pitch, part of me thought there might be something to it. The logic is appealing if you understand how LLMs work internally. But the data told a different story. Author Verification Required: confirm the "I first heard" anecdote is genuine.

Why AI Crawlers Are Built for HTML

Once you understand how AI search engines retrieve information, the Markdown results stop being surprising. Here's the pipeline that runs every time you ask ChatGPT, Perplexity, or Gemini a question that needs current information:

Step 1: You ask a question.

Step 2: The retrieval system searches the web. This is the part most people miss. AI search engines don't maintain their own separate index of .md files and llms.txt pages. They query existing web indexes. Indexes built on HTML.

Step 3: The system fetches the most relevant pages. These are HTML documents, because that's what the web serves.

Step 4: A content extraction layer strips out the noise (navigation, scripts, ads, boilerplate) and isolates the main content.

Step 5: The cleaned content goes into the LLM, which generates an answer with citations back to the source URLs.

This is Retrieval-Augmented Generation (RAG), and it's the architecture behind every major AI search product. The bit worth noting is Step 4: AI systems have already solved the "noisy HTML" problem. They don't need you to solve it for them by publishing a Markdown mirror. Their content extraction is good enough to pull clean text from HTML without help.

And HTML gives them something Markdown can't: metadata that signals relevance and authority.

Schema.org structured data tells an AI retrieval system what type of content a page contains. Is it an article, a FAQ, a product review, a how-to guide? Open Graph tags provide summary information. Canonical URLs prevent duplicate content confusion. The heading hierarchy in rendered DOM gives structural signals about topic coverage. Internal links provide authority and relationship context.

A Markdown file has none of this. It's just text. Clean text, sure, but text without the metadata layer that helps AI systems decide whether to cite you.

There's an analogy here that I reckon works. Asking "why don't AI crawlers read Markdown?" is a bit like asking "why doesn't my GPS use paper maps?" The paper map has the same geographical information. But the GPS needs structured data: coordinates, route metadata, real-time signals. Not a simpler format.

What Seems to Earn AI Citations

So if Markdown mirrors and llms.txt files aren't the answer, what is? The research points to three areas that seem to make a difference for generative engine optimization.

Structured Data Looks Like the Biggest Lever

This is where the evidence is strongest, though it comes with caveats. An analysis of 73 websites found that those with properly implemented structured data were cited in AI responses 3.2 times more often than those without. Pages using FAQPage schema specifically achieved up to 2.7x higher citation rates. A BrightEdge study found that structured data combined with FAQ content blocks drove a 44% increase in AI search citations.

Both Google and Microsoft have confirmed this directly. Google AI Overviews stated in April 2025 that structured data gives a ranking advantage. Microsoft confirmed in March 2025 that schema markup helps Copilot understand and cite content.

Fair warning on the evidence: most of these studies are industry case studies, not peer-reviewed research. A Search Atlas study from late 2024 found no correlation between schema coverage and citation rates. The field is young and the data is mixed. But the direction of evidence, and the platform confirmations, points toward structured data mattering.

We explored this broader pattern of how AI systems evaluate and select sources in our piece on how different AI research approaches handle the same source material. The way models weight evidence is directly connected to how retrieval systems select which pages to fetch in the first place.

Content Structure Matters More Than Content Format

AI retrieval systems prefer content that's already organised in ways they can extract directly. A study of 10,000 queries found that pages with structured lists, direct quotes, and cited statistics had 30-40% higher visibility in AI-generated responses.

What this means in practice:

Clear heading hierarchies that signal topic structure
Enumerated lists and comparisons that AI can lift into answers
FAQ-style content that maps directly to the questions people ask
Statistics with named sources, because AI engines prefer citing evidence-backed claims over unsupported opinions

This is where the relationship between traditional SEO and AEO becomes clear. The same content principles that made pages rank well in Google (clear structure, evidence, topical depth) are what make pages citable in AI search. The difference is that AI systems understand these signals more deeply, not differently. We wrote about this intersection in our post on how to structure documents so AI actually understands them.

Technical Fundamentals Still Apply

The less glamorous stuff matters. AI crawlers, particularly ChatGPT-User (which dominates AI bot traffic), are retrieval crawlers. They fetch specific pages in response to specific user queries, and they need those pages to load fast and serve rendered content.

Server-side rendering is pretty important here. AI crawlers don't reliably execute JavaScript. If your content loads via client-side JS, AI crawlers may see an empty page.
Robots.txt configuration needs to explicitly allow GPTBot, ClaudeBot, and PerplexityBot. Some default configurations block them.
Page speed matters because retrieval crawlers work in real time. They're fetching your page while a user waits for an answer.

The AEO Checklist: Seven Things Worth Doing Instead

Skip the Markdown mirrors. Here's where that time is probably better spent:

Implement schema markup. Article, FAQPage, and HowTo schemas at minimum. These are the structured signals AI retrieval systems use.
Use semantic HTML. Proper heading hierarchy (H1, H2, H3), <article> and <section> elements. Give the content extraction layer clean signals to work with.
Ensure server-side rendering. Your key content needs to be in the initial HTML response, not loaded by JavaScript after the page renders.
Structure content as Q&A where it fits. FAQ sections, clear question-answer patterns, and direct responses to "People Also Ask" queries.
Cite your sources with specifics. "A BrightEdge study of 500 sites found..." beats "studies show..." every time. AI systems prefer citing pages that themselves cite credible evidence.
Allow AI crawlers in robots.txt. Check that GPTBot, ClaudeBot, and PerplexityBot aren't blocked. It's more common than you'd think.
Build internal link authority. AI retrieval systems follow links to assess topical authority, just like traditional search. A well-linked page signals depth.

What This Adds Up To

Markdown mirrors, llms.txt files, and other alternative-format approaches consistently show near-zero impact on AI search visibility. The data from multiple experiments points the same direction: AI search engines are built on web infrastructure, and they want well-structured HTML.

This probably shouldn't be surprising. The entire retrieval layer of AI search, the part that decides which pages to fetch and cite, runs on existing web indexes. Those indexes were built for HTML. The models that generate answers already have content extraction that handles HTML noise well enough. There's not really a gap for Markdown to fill.

The effort spent creating .md mirrors of your pages is better invested in structured data implementation, content quality, and technical fundamentals. Those are the levers that seem to move AI citations.

Start with a structured data audit. Check your robots.txt for AI crawler access. Review your content structure against the checklist above. That's where the real AEO work happens, not in file format experiments. We've been tracking these patterns across our own content too, and the results mirror what the broader research shows. Our deep dive on the AI verification triage — what to always check, what to spot-check, and what to trust covers the other side of this coin: once AI does cite your content, how do you verify what it's saying?

We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

The logic sounds solid on the surface. LLMs process text as tokens. Markdown is cleaner text than HTML. So give AI crawlers the clean version, and they'll reward you with citations. Right?

Not quite. And the reason tells you something useful about how AI search optimization for HTML vs Markdown works. Most of the advice out there seems to have it backwards.

Where the Markdown Idea Comes From

So the leap seems reasonable: if you provide a clean Markdown version alongside your HTML page, AI systems should prefer it. Less noise, same information, easier to process.

What the Data Shows

The results were pretty clear-cut.

Over 14 days, HTML pages received 7.4% of all AI bot traffic. The Markdown files received exactly zero percent. Not a trickle. Not a slow start. Zero visits from any AI crawler.

The breakdown by platform was consistent across the board:

ChatGPT-User (the on-demand fetcher): visited HTML pages exclusively, accounting for 96-97% of AI bot traffic
Gemini Deep Research Fetcher: HTML only
Claude On-Demand Fetcher: HTML only
Perplexity AI Crawler: HTML only

And when it came to citations (the thing that matters for AI search visibility), not a single AI platform cited a .md URL. Every citation pointed to the HTML version.

Why AI Crawlers Are Built for HTML

Step 1: You ask a question.

Step 3: The system fetches the most relevant pages. These are HTML documents, because that's what the web serves.

Step 4: A content extraction layer strips out the noise (navigation, scripts, ads, boilerplate) and isolates the main content.

Step 5: The cleaned content goes into the LLM, which generates an answer with citations back to the source URLs.

And HTML gives them something Markdown can't: metadata that signals relevance and authority.

A Markdown file has none of this. It's just text. Clean text, sure, but text without the metadata layer that helps AI systems decide whether to cite you.

What Seems to Earn AI Citations

So if Markdown mirrors and llms.txt files aren't the answer, what is? The research points to three areas that seem to make a difference for generative engine optimization.

Structured Data Looks Like the Biggest Lever

Content Structure Matters More Than Content Format

What this means in practice:

Clear heading hierarchies that signal topic structure
Enumerated lists and comparisons that AI can lift into answers
FAQ-style content that maps directly to the questions people ask
Statistics with named sources, because AI engines prefer citing evidence-backed claims over unsupported opinions

Technical Fundamentals Still Apply

Server-side rendering is pretty important here. AI crawlers don't reliably execute JavaScript. If your content loads via client-side JS, AI crawlers may see an empty page.
Robots.txt configuration needs to explicitly allow GPTBot, ClaudeBot, and PerplexityBot. Some default configurations block them.
Page speed matters because retrieval crawlers work in real time. They're fetching your page while a user waits for an answer.

The AEO Checklist: Seven Things Worth Doing Instead

Skip the Markdown mirrors. Here's where that time is probably better spent:

Implement schema markup. Article, FAQPage, and HowTo schemas at minimum. These are the structured signals AI retrieval systems use.
Use semantic HTML. Proper heading hierarchy (H1, H2, H3), <article> and <section> elements. Give the content extraction layer clean signals to work with.
Ensure server-side rendering. Your key content needs to be in the initial HTML response, not loaded by JavaScript after the page renders.
Structure content as Q&A where it fits. FAQ sections, clear question-answer patterns, and direct responses to "People Also Ask" queries.
Cite your sources with specifics. "A BrightEdge study of 500 sites found..." beats "studies show..." every time. AI systems prefer citing pages that themselves cite credible evidence.
Allow AI crawlers in robots.txt. Check that GPTBot, ClaudeBot, and PerplexityBot aren't blocked. It's more common than you'd think.
Build internal link authority. AI retrieval systems follow links to assess topical authority, just like traditional search. A well-linked page signals depth.

We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

Where the Markdown Idea Comes From

What the Data Shows

Why AI Crawlers Are Built for HTML

What Seems to Earn AI Citations

Structured Data Looks Like the Biggest Lever

Content Structure Matters More Than Content Format

Technical Fundamentals Still Apply

The AEO Checklist: Seven Things Worth Doing Instead

What This Adds Up To

Continue Reading

The 3% Problem: The AI Literacy Gap Hiding Behind Your Adoption Dashboard

I Gave the Same 15 Sources to Three Different AI Models. They Found Completely Different Things

The AI Verification Triage: What to Always Check, What to Spot-Check, and What to Trust

Deep dives, delivered weekly

We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

We Looked Into the Markdown-for-AI Theory. The Data Wasn't Kind.

Where the Markdown Idea Comes From

What the Data Shows

Why AI Crawlers Are Built for HTML

What Seems to Earn AI Citations

Structured Data Looks Like the Biggest Lever

Content Structure Matters More Than Content Format

Technical Fundamentals Still Apply

The AEO Checklist: Seven Things Worth Doing Instead

What This Adds Up To

Continue Reading

The 3% Problem: The AI Literacy Gap Hiding Behind Your Adoption Dashboard

I Gave the Same 15 Sources to Three Different AI Models. They Found Completely Different Things

The AI Verification Triage: What to Always Check, What to Spot-Check, and What to Trust

Deep dives, delivered weekly