Community Detection

Also known as: graph clustering, network community detection, modularity clustering

Community Detection
Community detection is a graph algorithm that identifies clusters of densely connected nodes inside a network or knowledge graph. In GraphRAG systems, it groups related entities and concepts so the model can summarize each cluster and answer questions that span multiple topics.

Community detection is a graph algorithm that groups densely connected nodes into clusters of related concepts, letting systems like GraphRAG summarize each group and answer questions that span an entire knowledge graph.

What It Is

A knowledge graph can hold thousands of entities — people, products, events, ideas — all linked by relationships. If a question asks something broad like “What are the main themes in this dataset?”, searching node by node misses the bigger picture. Community detection solves that problem. It scans the graph and groups nodes that talk to each other more than they talk to outsiders, giving you a map of natural topic clusters instead of a flat list of facts.

The algorithm looks at edges — the connections between nodes — and measures how tightly each subgroup is wired together. A community is a region where internal links are dense and external links are sparse. Think of it like detecting friend groups at a wedding: people in the same group exchange more glances and conversations with each other than with strangers across the room. Algorithms such as Leiden and Louvain do this mathematically, optimizing a score called modularity that rewards tight internal connections and penalizes loose ones.

Once communities are found, each one can be summarized. In GraphRAG, the system writes a short description of every cluster — for example, “supply chain disruptions in 2023” or “researchers studying transformer architectures.” When a user asks a global question, the model reads cluster summaries first, picks the relevant ones, and only dives deeper where it needs to. This turns a noisy graph into a layered table of contents.

Communities are not fixed labels. Re-running detection after new nodes are added can shift boundaries, merge groups, or split them. Most production GraphRAG systems rebuild communities periodically rather than on every write, balancing freshness against the cost of re-indexing.

How It’s Used in Practice

When someone uses GraphRAG to query a corpus — say, internal company documents or research papers — community detection runs during the indexing stage, before any questions arrive. The system extracts entities, builds relationships, then partitions the graph into communities at multiple resolution levels. Microsoft’s GraphRAG implementation, for example, produces a hierarchy: small communities nested inside larger ones. At query time, the model walks down the hierarchy to find the level that best matches the question’s scope.

This matters most for “global” questions — the kind a vector search struggles with. “What are the main risks discussed across these contracts?” cannot be answered by retrieving the top few most similar paragraphs. It needs an aggregated view. Community summaries give the model that view in a single read, then point it to the underlying nodes if more detail is required.

Pro Tip: Tune the resolution parameter before you tune your prompts. A lower resolution gives you a few large communities (good for executive-summary questions); a higher resolution gives many small ones (good for detail-heavy questions). Most teams ship one fixed setting and then wonder why broad questions feel shallow — running both levels and letting the query type pick is a cheap upgrade.

When to Use / When Not

ScenarioUseAvoid
Answering broad “what are the main themes” questions
Single-fact lookups already covered well by vector search
Summarizing a large document collection at multiple zoom levels
Real-time graphs that change every few seconds
Multi-hop reasoning that crosses topical clusters

Common Misconception

Myth: Community detection finds the “true” structure of the data, so the clusters it returns are the right answer. Reality: Different algorithms — and even different runs of the same algorithm — produce different communities, especially near boundaries. The clusters are useful approximations, not ground truth, and re-indexing can reshuffle them.

One Sentence to Remember

Community detection turns a tangled knowledge graph into navigable neighborhoods, which is what lets GraphRAG answer questions about the whole corpus rather than just the slice that matches a keyword.

FAQ

Q: How is community detection different from clustering? A: Clustering groups items by distance in a vector space. Community detection groups nodes by how their edges connect. The same data can land in very different groups depending on which method you choose.

Q: Which algorithm does Microsoft GraphRAG use? A: It uses the Leiden algorithm, which improves on the older Louvain method by guaranteeing that every community it returns is internally well connected. Leiden also runs hierarchically, producing nested communities at multiple resolution levels.

Q: Do I need community detection if I already have vector search? A: For narrow, fact-style questions, vector search alone is fine. Community detection helps when you need to summarize across topics, follow themes through a graph, or answer global questions that no single passage contains on its own.

Expert Takes

Community detection rests on a simple statistical idea — a community is a region of the graph where internal edge density exceeds what random chance would predict. Algorithms such as Leiden formalize this through modularity scores. The math doesn’t promise the correct partition; it promises a locally optimal one. That distinction matters. Two valid runs can split the same graph differently, and neither is wrong. Not ground truth. Useful structure.

When you specify a GraphRAG pipeline, community detection is one of the parameters you cannot skip thinking about. Resolution level, refresh cadence, hierarchy depth — each one is a contract between your indexing job and your query layer. Skip the spec, and you ship a system where broad questions silently degrade. Treat community detection settings the way you treat schema migrations: written down, versioned, reviewed before deployment, and noted in the changelog.

Community detection is the feature that lets a knowledge graph scale past the toy demo. Vector search wins the easy queries; whoever owns the global-question layer owns the enterprise contract. That is where community summaries earn their keep. Teams shipping GraphRAG without thinking through the community layer are leaving strategic ground unclaimed. The companies that win the next phase of retrieval are the ones already running this in production.

A community is a choice masquerading as a discovery. The algorithm decides which connections matter, which nodes belong together, which voices get bundled into one summary. Whoever sets the resolution parameter is, in effect, drawing borders on the map of knowledge a model will later answer from. Who reviews those borders? Who notices when an inconvenient topic gets folded into a larger, blander cluster?