Retrieval & Data

Topic Clustering

Topic clustering is the grouping of related prompts, AI answers, or content pieces into thematic clusters — used in AEO to consolidate prompt-level visibility data into actionable narratives about how AI describes a category.

Why it matters

Prompt-level data is too granular to act on; brand-aggregate data is too coarse. Topic clustering is the middle layer that turns 'we got mentioned 47 times across 200 prompts' into 'AI describes us as fast but expensive — here's the prompt cluster where price comes up.'

How it works

Embeddings convert each prompt and each AI response into a vector representation. A clustering algorithm (k-means, HDBSCAN, hierarchical clustering) groups vectors by semantic similarity. The resulting clusters represent themes — buyer-intent groups, feature-comparison groups, problem-discovery groups.

Why it beats pure aggregation

Without clustering, you have a flat list of prompts and answers. With clustering, you can ask:

Which themes drive most of our share-of-voice?
Which themes describe us most negatively?
Which themes do competitors dominate that we don't appear in at all?
How is theme distribution shifting over time?

These questions can't be answered from prompt-level data alone, and they're the questions that produce content priorities and PR strategy.

Practical considerations

Cluster granularity — too few clusters lose detail; too many fragment the data. 10-30 clusters is typical for category-level analysis.
Stability over time — re-clustering at every measurement run produces noise. Lock the cluster definitions and just classify new prompts into them.
Human labelling — clusters need human-readable names to be actionable. Auto-generated labels from LLMs work as a starting point but need editorial review.

Related terms

EmbeddingsDense numerical representations of text (or other data) in a high-dimensional vector space, where similar meanings are placed closer together.Semantic SearchA search technique that understands the meaning and intent behind queries rather than matching exact keywords, using vector embeddings to find conceptually relevant results even when different words are used.Vector DatabaseA database optimized for storing, indexing, and querying high-dimensional vector embeddings, enabling fast similarity search at scale.Share of Voice(SoV)Share of voice (in the context of AI search) is the percentage of AI-generated answers about a category that mention a given brand, measured against named competitors over the same prompt set.AI VisibilityAI visibility is the measurable presence and accuracy of a brand inside AI-assistant responses — covering how often it's mentioned, in what tone, with what facts, and against which competitors.