
Publishing Markdown mirrors of your web pages for AI search visibility is a waste of time. Here's why AI crawlers stick to HTML, and what you should focus on instead.
If you've spent any time in GEO or AEO circles lately, you've heard the advice: "Publish Markdown versions of your pages so AI can read them more easily." Maybe you've also seen people advocating for llms.txt files, a kind of robots.txt but specifically for language models.
The logic sounds solid on the surface. LLMs process text as tokens. Markdown is cleaner text than HTML. So give AI crawlers the clean version, and they'll reward you with citations. Right?
Not quite. And the reason tells you something useful about how AI search optimization for HTML vs Markdown works. Most of the advice out there seems to have it backwards.
The hypothesis has a kernel of truth to it. Large language models do work with plain text during inference. When you paste a messy HTML page into ChatGPT, it handles the content just fine, but it's clearly working harder to parse navigation menus, script tags, and cookie banners away from the actual content.
So the leap seems reasonable: if you provide a clean Markdown version alongside your HTML page, AI systems should prefer it. Less noise, same information, easier to process.
But this conflates two different things. How an LLM processes text in a chat window is not the same as how an AI search engine discovers, retrieves, and selects sources to cite. They're separate systems with separate architectures. And that distinction matters more than I initially expected.
Recent controlled experiments have put this theory to the test. In one study, researchers published identical content in both HTML and Markdown formats across two scenarios: one with established pages, one with brand-new content. Both formats were equally discoverable, with identical URL structures and footer links pointing to each.
The results were pretty clear-cut.
![]()
Over 14 days, HTML pages received 7.4% of all AI bot traffic. The Markdown files received exactly zero percent. Not a trickle. Not a slow start. Zero visits from any AI crawler.
The breakdown by platform was consistent across the board:
And when it came to citations (the thing that matters for AI search visibility), not a single AI platform cited a .md URL. Every citation pointed to the HTML version.
This pattern holds beyond Markdown files too. A separate 90-day experiment tracking llms.txt adoption found that just 0.1% of AI bot traffic visited the /llms.txt file. Despite growing adoption of the standard (roughly 10% of domains now have one, according to SE Ranking), the crawlers aren't using it.
I'll be honest, when I first heard the Markdown-for-AI pitch, part of me thought there might be something to it. The logic is appealing if you understand how LLMs work internally. But the data told a different story. Author Verification Required: confirm the "I first heard" anecdote is genuine.
Once you understand how AI search engines retrieve information, the Markdown results stop being surprising. Here's the pipeline that runs every time you ask ChatGPT, Perplexity, or Gemini a question that needs current information:
Step 1: You ask a question.
Step 2: The retrieval system searches the web. This is the part most people miss. AI search engines don't maintain their own separate index of .md files and llms.txt pages. They query existing web indexes. Indexes built on HTML.
Step 3: The system fetches the most relevant pages. These are HTML documents, because that's what the web serves.
Step 4: A content extraction layer strips out the noise (navigation, scripts, ads, boilerplate) and isolates the main content.
Step 5: The cleaned content goes into the LLM, which generates an answer with citations back to the source URLs.
This is Retrieval-Augmented Generation (RAG), and it's the architecture behind every major AI search product. The bit worth noting is Step 4: AI systems have already solved the "noisy HTML" problem. They don't need you to solve it for them by publishing a Markdown mirror. Their content extraction is good enough to pull clean text from HTML without help.
And HTML gives them something Markdown can't: metadata that signals relevance and authority.
![]()
Schema.org structured data tells an AI retrieval system what type of content a page contains. Is it an article, a FAQ, a product review, a how-to guide? Open Graph tags provide summary information. Canonical URLs prevent duplicate content confusion. The heading hierarchy in rendered DOM gives structural signals about topic coverage. Internal links provide authority and relationship context.
A Markdown file has none of this. It's just text. Clean text, sure, but text without the metadata layer that helps AI systems decide whether to cite you.
There's an analogy here that I reckon works. Asking "why don't AI crawlers read Markdown?" is a bit like asking "why doesn't my GPS use paper maps?" The paper map has the same geographical information. But the GPS needs structured data: coordinates, route metadata, real-time signals. Not a simpler format.
So if Markdown mirrors and llms.txt files aren't the answer, what is? The research points to three areas that seem to make a difference for generative engine optimization.
![]()
This is where the evidence is strongest, though it comes with caveats. An analysis of 73 websites found that those with properly implemented structured data were cited in AI responses 3.2 times more often than those without. Pages using FAQPage schema specifically achieved up to 2.7x higher citation rates. A BrightEdge study found that structured data combined with FAQ content blocks drove a 44% increase in AI search citations.
Both Google and Microsoft have confirmed this directly. Google AI Overviews stated in April 2025 that structured data gives a ranking advantage. Microsoft confirmed in March 2025 that schema markup helps Copilot understand and cite content.
Fair warning on the evidence: most of these studies are industry case studies, not peer-reviewed research. A Search Atlas study from late 2024 found no correlation between schema coverage and citation rates. The field is young and the data is mixed. But the direction of evidence, and the platform confirmations, points toward structured data mattering.
We explored this broader pattern of how AI systems evaluate and select sources in our piece on how different AI research approaches handle the same source material. The way models weight evidence is directly connected to how retrieval systems select which pages to fetch in the first place.
AI retrieval systems prefer content that's already organised in ways they can extract directly. A study of 10,000 queries found that pages with structured lists, direct quotes, and cited statistics had 30-40% higher visibility in AI-generated responses.
What this means in practice:
This is where the relationship between traditional SEO and AEO becomes clear. The same content principles that made pages rank well in Google (clear structure, evidence, topical depth) are what make pages citable in AI search. The difference is that AI systems understand these signals more deeply, not differently. We wrote about this intersection in our post on how to structure documents so AI actually understands them.
The less glamorous stuff matters. AI crawlers, particularly ChatGPT-User (which dominates AI bot traffic), are retrieval crawlers. They fetch specific pages in response to specific user queries, and they need those pages to load fast and serve rendered content.
Skip the Markdown mirrors. Here's where that time is probably better spent:
<article> and <section> elements. Give the content extraction layer clean signals to work with.Markdown mirrors, llms.txt files, and other alternative-format approaches consistently show near-zero impact on AI search visibility. The data from multiple experiments points the same direction: AI search engines are built on web infrastructure, and they want well-structured HTML.
This probably shouldn't be surprising. The entire retrieval layer of AI search, the part that decides which pages to fetch and cite, runs on existing web indexes. Those indexes were built for HTML. The models that generate answers already have content extraction that handles HTML noise well enough. There's not really a gap for Markdown to fill.
The effort spent creating .md mirrors of your pages is better invested in structured data implementation, content quality, and technical fundamentals. Those are the levers that seem to move AI citations.
Start with a structured data audit. Check your robots.txt for AI crawler access. Review your content structure against the checklist above. That's where the real AEO work happens, not in file format experiments. We've been tracking these patterns across our own content too, and the results mirror what the broader research shows. Our deep dive on the AI verification triage — what to always check, what to spot-check, and what to trust covers the other side of this coin: once AI does cite your content, how do you verify what it's saying?
Your AI adoption dashboard says 73%. Your team's output says otherwise. The enterprise AI problem has shifted from access to proficiency, and the gap is wider than most leaders think.
Different AI models find different things in the same documents. Here's what the research actually shows, and why model choice is a research methodology decision, not a feature checkbox.
92% of users don't verify AI outputs. Here's a framework for knowing when that's fine and when it'll burn you
AI patterns, workflow tips, and lessons from the field. No spam, just signal.