When Your Vector Search Feels Slow — What’s Really Going On
Index choices, hardware trade-offs, and embedding effects that determine real-world latency
Vector databases get blamed for slow similarity search all the time — but the database is just one piece of the puzzle. This post walks through the practical levers (index structure, hardware, metrics, and model choices) you can tune to actually make vector search fast and predictable. Expect clear rules-of-thumb, runnable snippets, and an actionable checklist. This post predominantly reflects my personal interpretation of various cases encountered by other engineers.
Why “fast database” is an incomplete answer
Picking Redis vs. Milvus vs. Qdrant rarely explains why your production queries are slow. Latency emerges from the whole stack: indexing algorithm, memory layout, how you use hardware (CPU/GPU/SSD), distance metric, and the properties of your embeddings. Optimize just one layer and you’ll still be bottlenecked by another.
Index structures: what they give you (and what they cost)
HNSW — multi-layer graph navigation (fast, memory-hungry)
HNSW builds multiple graph layers with long-range links on top and dense local links below. Searches start high and greedily descend, which gives logarithmic-ish hops and excellent latency for small-to-medium datasets. Tradeoff: memory overhead (links per node × layers) can be tens of bytes per vector — it’s a memory-for-speed design.
Memory formula (conceptual):
total = vector_bytes + (M × 2 × sizeof(int) × num_vectors × avg_layers)
With typical defaults (e.g., M≈16), overhead is non-trivial for very large corpora.
When to use: < 1M vectors and you need single-digit-ms latencies.
IVF — cluster-first probing (balanced)
IVF clusters vectors (k-means centroids), assigns vectors to buckets, then searches only a subset (nprobe) of clusters. This reduces examined items dramatically compared to brute force, but tuning nlist/nprobe matters and depends on your data distribution (uniform vs. clustered). A typical starting heuristic is sqrt(N) clusters for uniform data; clustered/text embeddings often benefit from a higher nlist.
import numpy as np
# For uniformly distributed data
optimal_nlist = int(np.sqrt(num_vectors))
# For clustered data (common with text embeddings)
optimal_nlist = int(4 * np.sqrt(num_vectors))
Product Quantization (PQ) — compression for scale
PQ compresses vectors by splitting them into subspaces and mapping each subvector to a 1-byte centroid ID. Compression ratios can be extreme (e.g., 1536-d float32 → ~96 bytes) at the cost of some accuracy. PQ enables very large datasets on constrained RAM but increases complexity in tuning and search-quality tradeoffs.
Decision framework (practical rules)
< 1M vectors, < 5ms requirement → HNSW in memory.
1M–100M vectors, 10–20ms OK → IVF with tuned nlist/nprobe.
100M+ with low RAM → IVF + PQ (or distributed PQ).
1B+ → distributed, specialized systems and sharding strategies.
Hardware: CPU vs GPU vs Disk — the real tradeoffs
The GPU paradox
GPUs shine for big batches. For single or low-concurrency queries the CPU (with vectorized instructions) can be faster because GPU overhead includes transfer time to/from device memory. Benchmarks show a typical breakeven at tens to low hundreds of concurrent queries — below that, a well-optimized CPU path often wins. Test with your workload before moving to GPU.
Example numbers (illustrative):
Single-query 1M dataset (1536d): CPU ≈ 2–3ms, GPU (including transfer) ≈ 5ms.
Batch performance flips in favor of GPU as concurrency grows.
Memory vs NVMe — don’t underestimate the gap
Random-access latency differences matter a lot for graph-based indexes: RAM (~100 ns) vs NVMe SSD (~100 µs) is ~1,000×. Properly tuned NVMe systems with prefetching and large pages can still reach acceptable latencies (20–30 ms for well-optimized million-scale searches) but random IO kills latency if your hot structures don’t fit in memory.
Tips:
Keep hot layers (e.g., HNSW upper layers) in RAM.
Use large OS page sizes and memory-mapped files with madvise.
Prefetch aggressively for predictable access patterns.
CPU vectorization — cheap win
SIMD (AVX2/AVX-512) speeds up distance ops by factors of 8–16× on modern x86. Use or validate libraries that expose SIMD kernels (or compile with the right flags). Example: scalar vs AVX-512 dot-product snippets below illustrate the pattern.
float dot_product_scalar(float* a, float* b, int d) {
float sum = 0.0f;
for (int i = 0; i < d; i++) {
sum += a[i] * b[i];
}
return sum;
}
# OR
float dot_product_avx512(float* a, float* b, int d) {
__m512 sum = _mm512_setzero_ps();
for (int i = 0; i < d; i += 16) {
__m512 va = _mm512_loadu_ps(&a[i]);
__m512 vb = _mm512_loadu_ps(&b[i]);
sum = _mm512_fmadd_ps(va, vb, sum);
}
return _mm512_reduce_add_ps(sum);
}
Distance metrics & dimensionality — the math that bites
Cost per query scales with d (vector dimension). Dot product is O(d); Euclidean and cosine have constant-factor differences. If you use cosine, pre-normalize vectors at insert time to avoid repeated extra ops — this can cut computation substantially.
Curse of dimensionality: as d increases, distances concentrate (nearest/furthest ratios approach 1). High-dimensional spaces reduce index effectiveness and can hurt both quality and speed. Consider dimensionality reduction when intrinsic dimensionality is much lower than raw features (PCA, random projection, or lightweight autoencoders).
from sklearn.decomposition import PCA
# Keep 95% variance; check intrinsic dimensionality
pca = PCA(n_components=0.95)
pca.fit(vectors)
print(f"Intrinsic dimensionality: {pca.n_components_}")
Embeddings: the overlooked performance lever
Embedding choice sets d, distribution (sparsity/entropy), and numeric stability. For example, some hosted models emit 1536-d vectors while some open-source models use 384 or 768 — that’s a 2–4× cost and memory multiplier. Also analyze vector norms and sparsity; some models produce sparser vectors that can be exploited for specialized indexes. Test quantization/precision reduction: many models tolerate lower precision without practical quality loss.
import numpy as np
def analyze_vectors(vectors):
norms = np.linalg.norm(vectors, axis=1)
print(f"Avg norm: {np.mean(norms):.2f} ± {np.std(norms):.2f}")
sparsity = np.sum(np.abs(vectors) < 0.01) / vectors.size
print(f"Sparsity: {sparsity:.2%}")
Quick practical checklist (run this today)
Profile first: measure single-query and batch latencies, CPU/GPU times, and p50/p95/p99.
Decide index family: HNSW for low-latency small corpora; IVF(+PQ) for large-scale constrained-RAM.
Tune cluster parameters: experiment with nlist and nprobe based on data distribution.
Keep hot structures in RAM: upper HNSW layers, centroids, frequently-accessed codebooks.
Measure GPU breakeven: test small-vs-batch queries to see transfer overhead impact.
Pre-normalize if you use cosine; consider reducing dimensionality if intrinsic dim ≪ raw dim.
Enable SIMD / optimized kernels or use libraries that do.
🔍 TL;DR Summary
Vector DB speed is a systems problem: index algorithm + hardware + embeddings + metrics all matter.
HNSW = fastest latency but higher RAM overhead; IVF/PQ = better for huge datasets.
GPU helps for large batches; CPU + SIMD often wins for single or low-concurrency queries.
Pre-normalize for cosine similarity and consider dimensionality reduction to fight the curse of dimensionality.
Measure before you switch: benchmark with your embeddings and real query patterns — the breakeven points depend on your workload.