Locality Sensitive Hashing

Also known as: LSH, locality-sensitive hash, LSH algorithm

Locality Sensitive Hashing: A family of randomized algorithms that map similar data points to the same hash buckets with high probability, enabling approximate nearest neighbor search in high-dimensional spaces without scanning every item — a key index structure in vector similarity search pipelines.

Locality Sensitive Hashing (LSH) is a family of algorithms that speed up similarity search by hashing similar data points into the same buckets, trading perfect accuracy for much faster lookups in high-dimensional spaces.

What It Is

When you search for something similar — a product image, a sentence meaning, or a customer profile — the system needs to compare your query against potentially millions of stored items. Checking every single one (brute force) works fine for small datasets, but becomes painfully slow as data grows. Locality Sensitive Hashing exists to solve this exact bottleneck in vector similarity search: finding similar items without checking everything.

Think of it like sorting books into themed shelves at a library. Instead of reading every book to find ones about cooking, you walk to the cooking section. LSH works the same way — it assigns each data point a short hash code designed so that items which are “nearby” in meaning or value land in the same bucket. When a query arrives, the system only checks items in the matching bucket, skipping vast portions of the dataset entirely.

In practice, LSH typically uses multiple hash tables rather than just one. Each table applies a different set of hash functions, creating multiple chances for similar items to land together. If two items share a bucket in any table, they become candidates for exact distance comparison. More tables mean higher recall (fewer missed neighbors) but also more memory and preprocessing time. This multi-table approach is what gives LSH its tunable accuracy-speed trade-off.

According to Indyk & Motwani, who introduced LSH at STOC 1998, the technique achieves sublinear query time — meaning search speed grows much slower than your dataset size — after a one-time preprocessing step that scales roughly with the data. The core guarantee is probabilistic: similar items have a high probability of receiving the same hash, while dissimilar items are unlikely to collide. This is the opposite of traditional hash functions (like those used in databases), where any small change in input produces a completely different hash.

Several variants of LSH exist for different distance metrics. According to Wikipedia, common families include random hyperplane LSH for cosine similarity, p-stable LSH for Euclidean distance, and MinHash for Jaccard similarity. The choice depends on which distance metric your vector similarity search pipeline uses — a decision that directly shapes retrieval quality.

How It’s Used in Practice

In the context of vector similarity search, LSH serves as one of the index structures that sits between raw embedding vectors and your application. When a vector database receives a query embedding, it needs a fast way to narrow down candidates before computing exact distances. LSH provides that first-pass filter — hashing the query vector and retrieving only the items that share its bucket.

According to FAISS Docs, LSH is available as IndexLSH in the FAISS library. Teams building search or recommendation systems sometimes start with LSH because it’s straightforward to implement and reason about. However, most production vector databases have shifted toward graph-based methods like HNSW, which tend to offer better recall-speed trade-offs for real-world workloads.

Pro Tip: If you’re evaluating index structures for a new vector search project, benchmark LSH against HNSW on your actual data before committing. LSH shines when memory is tight (hash codes are compact), but HNSW usually wins on recall at the same query speed.

When to Use / When Not

Scenario	Use	Avoid
Large dataset with strict memory constraints	✅
Production search needing highest possible recall		❌
Quick prototype for approximate nearest neighbor search	✅
Small dataset where brute force is fast enough		❌
Batch deduplication of documents or images	✅
Real-time search requiring sub-millisecond latency		❌

Common Misconception

Myth: LSH guarantees finding the exact nearest neighbor, just faster. Reality: LSH is an approximate method. It trades perfect recall for speed — some true nearest neighbors may end up in different buckets and get missed. The “approximate” in approximate nearest neighbor search is the whole point: you accept a small accuracy loss to avoid scanning every item in your dataset.

One Sentence to Remember

LSH is a shortcut for similarity search that groups similar items into the same bucket using special hash functions — a rough index that gets you close enough, fast enough, without checking everything in your vector store.

FAQ

Q: How does LSH differ from regular hashing? A: Regular hashing minimizes collisions — similar inputs get different hashes. LSH deliberately maximizes collisions for similar inputs so they land in the same bucket for fast similarity lookup.

Q: Is LSH still used in production vector databases? A: LSH remains available in libraries like FAISS, but most production systems have shifted to graph-based methods like HNSW, which typically achieve better recall at comparable query speeds.

Q: What types of similarity does LSH support? A: Different LSH variants handle different distance metrics: random hyperplane LSH for cosine similarity, p-stable LSH for Euclidean distance, and MinHash for Jaccard similarity between sets.

Sources

Indyk & Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality - Foundational paper introducing LSH for sublinear approximate nearest neighbor search
FAISS Docs: Welcome to Faiss Documentation - Reference documentation for Facebook AI Similarity Search library including IndexLSH

Expert Takes

MONA

LSH rests on a simple but fundamental mathematical property: hash functions can be designed so collision probability correlates with distance. The closer two points are in high-dimensional space, the more likely they receive the same hash. This probabilistic guarantee turns an exhaustive search into a bucket lookup. The theoretical elegance is that you get sublinear query time — a rare feat when dimensions grow large.

MAX

The practical question with LSH isn’t whether it works — it’s whether it’s the right index for your access pattern. LSH produces compact binary codes, which is great for memory. But tuning the number of hash tables and hash functions to hit your target recall takes experimentation. In most cases, HNSW gives you better recall out of the box with less parameter tuning. Pick LSH when memory budget is the binding constraint, not query speed.

DAN

Vector search is quietly becoming infrastructure that every product team depends on — recommendations, semantic search, content matching. If you’re evaluating index structures, LSH is worth understanding even if you end up choosing HNSW. The teams that understand the trade-off between hash-based and graph-based approaches make better architectural decisions. LSH’s memory efficiency still matters in edge deployment and cost-sensitive environments where every megabyte counts.

ALAN

Approximate methods like LSH encode a decision: how much accuracy are you willing to sacrifice for speed? That’s a technical choice with real consequences. When a recommendation system misses a relevant result because the hash function placed it in the wrong bucket, someone doesn’t see content they should have. The “approximate” label sounds harmless, but in applications affecting what people see, read, or buy, it deserves scrutiny.

Back to Glossary