Vector Database Indexes: HNSW, IVF, PQ, and the Latency-Recall Frontier

Vector Database Indexes: HNSW, IVF, PQ, and the Latency-Recall Frontier

Vector databases exist because “nearest neighbor” is easy to say and expensive to do at scale. The moment you have millions of vectors, high dimensionality, filters, and real latency targets, brute force similarity becomes a cost sink. Indexes are the bridge between semantic search as a concept and semantic search as a service.

The essential trade is not complicated to name, but it is complicated to manage:

Value WiFi 7 Router
Tri-Band Gaming Router

TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650

TP-Link • Archer GE650 • Gaming Router
TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650
A nice middle ground for buyers who want WiFi 7 gaming features without flagship pricing

A gaming-router recommendation that fits comparison posts aimed at buyers who want WiFi 7, multi-gig ports, and dedicated gaming features at a lower price than flagship models.

$299.99
Was $329.99
Save 9%
Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.
  • Tri-band BE11000 WiFi 7
  • 320MHz support
  • 2 x 5G plus 3 x 2.5G ports
  • Dedicated gaming tools
  • RGB gaming design
View TP-Link Router on Amazon
Check Amazon for the live price, stock status, and any service or software details tied to the current listing.

Why it stands out

  • More approachable price tier
  • Strong gaming-focused networking pitch
  • Useful comparison option next to premium routers

Things to know

  • Not as extreme as flagship router options
  • Software preferences vary by buyer
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.
  • Higher recall tends to cost more latency and more memory.
  • Lower latency tends to cost recall, especially on hard queries.
  • Better compression tends to cost accuracy unless the data is well behaved.
  • More filtering tends to cost performance unless the index was designed for it.

HNSW, IVF, and product quantization are three families of tools for negotiating that trade space. Understanding them at a systems level helps you choose an index that can survive growth rather than only pass a benchmark.

What an ANN index is really doing

Approximate nearest neighbor (ANN) search is often described as “finding close vectors quickly.” In practice, an ANN index is doing three things:

  • **Reducing the search space** so you do not evaluate every vector.
  • **Structuring memory access** so the CPU or GPU can stay busy instead of waiting on random reads.
  • **Providing tunable knobs** that let you pay more compute for more recall when you need it.

The knobs matter because workloads are not stable. Today’s traffic might be mostly short queries and tomorrow’s might be long questions that require broader recall. The best index choice is the one whose knobs map cleanly to your production constraints.

Index choice also interacts tightly with how you design hybrid retrieval and metadata filtering. A system that must filter aggressively on tenant, permissions, or content type needs an index strategy that respects structured constraints without turning every query into a slow path. This is why the architectural view in Index Design: Vector, Hybrid, Keyword, Metadata is a prerequisite for index tuning that actually sticks.

HNSW: graphs as a shortcut through space

Hierarchical Navigable Small World (HNSW) indexes build a graph over vectors. Search becomes a walk: start somewhere, move to neighbors that look closer, repeat. The “hierarchical” part adds layers that allow coarse navigation first and fine navigation later.

HNSW tends to feel good in practice because:

  • it offers strong recall at practical latencies
  • it supports incremental insertions reasonably well
  • it has intuitive knobs for build quality and search breadth

The cost is memory. Graphs have overhead. If you are operating in a memory-constrained environment, HNSW can become the wrong tool even if it “wins” in a recall benchmark.

A systems-level way to think about HNSW is that it buys latency by buying structure. You spend memory to avoid scanning.

Operational knobs that matter

HNSW tuning often comes down to two questions:

  • How much structure do you want to build?
  • How broad do you want to search at query time?

Build-time parameters control how connected the graph becomes. Query-time parameters control how much of that graph you actually explore. The best practice is to treat these as a budgeted policy, not as a one-time config. If your traffic spikes, you need a safe degradation path that preserves correctness even if recall drops.

That degradation path is not only an index question. It is also a queuing and concurrency question. When many requests arrive at once, the index can be “fast” in isolation but still deliver slow outcomes because of contention. That is why the production framing in Scheduling, Queuing, and Concurrency Control belongs in the same mental model as “ANN search.”

IVF: clusters first, then search inside

Inverted file (IVF) approaches start by clustering vectors. At query time, you find the closest clusters and only search inside those partitions. IVF can be powerful because it forces structure onto the search space and turns one big problem into smaller ones.

IVF shines when:

  • you can build the index offline and rebuild periodically
  • the dataset is large enough that partitioning yields real wins
  • you can tolerate a bit of complexity in managing centroids and partitions

IVF also pairs naturally with compression because the partitions allow localized representations.

The main risk is cluster mismatch: if the query lands near the boundary between clusters, or if the clustering is not aligned with query distribution, you can miss relevant points unless you search many clusters. That is the IVF version of the recall-latency frontier.

Product quantization: compression as an indexing tool

Product quantization (PQ) and related quantization techniques compress vectors so that similarity can be approximated with much cheaper math and much less memory. This is where “vector database” becomes a hardware story: memory bandwidth and cache behavior start to matter more than floating point throughput.

Compression helps when:

  • memory is the limiting factor
  • you need to fit more of the working set into RAM
  • you need to reduce IO pressure on the hottest paths

The risk is that compression can erase subtle differences in embedding space, especially when the domain is dense and semantically fine-grained. If you compress too aggressively, you may keep “similar” items but lose the truly best items. The right way to handle this is to treat compression as a stage, not as the final word:

  • compressed retrieval for broad recall
  • higher-fidelity reranking for final ordering

That is why embedding strategy and index strategy are inseparable. If your embeddings are noisy or poorly matched to the domain, compression will amplify that weakness. See Embedding Selection and Retrieval Quality Tradeoffs for how embedding choices shape retrieval outcomes.

The latency-recall frontier in practice

The phrase “latency-recall frontier” is useful because it forces honesty. You are not searching for the “best index.” You are searching for the best point on a frontier given your constraints.

A practical way to evaluate an index is to produce curves and compare them:

  • Recall at various cutoffs
  • p50, p95, p99 latency under realistic concurrency
  • Memory footprint at target scale
  • Build time and rebuild cost
  • Update behavior (insertions, deletions, compactions)

The evaluation has to match your truth. If your system uses metadata filters heavily, a benchmark without filters is misleading. If your system uses hybrid scoring, a benchmark that only measures dense retrieval misses the operational reality. Use Retrieval Evaluation: Recall, Precision, Faithfulness as a measurement anchor so you do not confuse “nearest neighbor accuracy” with “retrieval quality that supports answers.”

Filtering, sharding, and multi-tenancy

In production, filters are not a feature. They are the boundary between a working system and a liability.

When you apply filters, you can end up with two different query worlds:

  • The “unfiltered” world where the ANN index is efficient.
  • The “filtered” world where the index degenerates because only a small subset is eligible.

There are several strategies to avoid degeneration:

  • Partition by tenant or major filter dimension so filters become routing rather than post-filtering.
  • Build separate indexes for different content types when the distributions differ sharply.
  • Use hybrid index designs where lexical or metadata-first retrieval narrows the candidate set before vector similarity.

Each of these strategies changes operational cost. Partitioning multiplies index management. Multiple indexes multiply build pipelines. Hybrid retrieval multiplies tooling. This is where cost becomes part of design, not a later concern. Operational Costs of Data Pipelines and Indexing frames the economics and operational costs you inherit when you choose an index strategy.

Choosing an index by workload shape

A useful way to decide is to classify the workload rather than the technology:

Workload traitWhat it pushes you towardWhy
Frequent incremental updatesHNSW-like approachesgraph supports insertions more naturally
Huge static corpusIVF + compressionrebuild offline, search partitions
Tight memory budgetPQ-heavy designsreduce working set, reduce bandwidth
Heavy structured filteringpartitioning + hybrid routingavoid filtered slow paths
Very low latency SLOcareful tuning + caching + concurrency controltail latency becomes the enemy

No row in this table is a guarantee. It is a reminder that index choice is a system choice.

Index maintenance as a reliability problem

Indexes are not static artifacts. They age as the corpus changes, as embedding distributions drift, and as metadata policies tighten. A production plan should include:

  • **Rebuild triggers** tied to measurable drift or distribution change, not only calendar time.
  • **Backfill strategies** so new documents become searchable without waiting for a full rebuild.
  • **Delete semantics** that match your retention rules, including tombstones and compaction policies.
  • **Snapshot and restore** procedures that can recover from corruption, bad deployments, or infrastructure failure.

These practices are easiest when the index format and its build pipeline are treated as first-class infrastructure. When they are treated as an internal detail, reliability incidents arrive as surprises.

A good rule is to assume that your index will fail at the worst time and to ensure the system has a graceful fallback: a smaller safety index, a lexical-only mode, or a cached result path that keeps critical queries alive long enough to repair the main service.

Further reading on AI-RNG

More Study Resources

Books by Drew Higgins

Explore this field
Vector Databases
Library Data, Retrieval, and Knowledge Vector Databases
Data, Retrieval, and Knowledge
Chunking Strategies
Data Curation
Data Governance
Data Labeling
Document Pipelines
Embeddings Strategy
Freshness and Updating
Grounding and Citations
Knowledge Graphs