Vajra BM25 started as an experiment in using category theory to organize a search engine's code. The previous post covered the BM25 lexical search side of that story. Vajra also ships a native HNSW vector index, written in pure Python and NumPy, that I've been comparing against ZVec, Alibaba's high-performance vector database backed by the Proxima C++ engine.
The initial comparison produced numbers worth understanding. After six targeted fixes, the picture changed materially.
ZVec: What C++ Disk-Backed HNSW Looks Like
ZVec (v0.2.0) uses Proxima, Alibaba's production vector search engine, as its C++ backend. Its architecture is worth understanding because it explains why the performance is so different from a Python implementation.
ZVec stores the vector index on disk using memory-mapped files. zvec.open() is essentially an mmap syscall: it maps the on-disk file into the process's virtual address space and returns in milliseconds. No data is loaded into RAM at open time. The OS page cache handles the rest: when a query accesses a region of the index for the first time, the kernel loads the corresponding disk page into RAM, making it available at memory speed for all subsequent accesses.
The insert and HNSW construction pipeline is pure C++. Distance computations use SIMD (Single Instruction, Multiple Data) intrinsics: CPU instructions that apply the same operation to multiple values simultaneously. On modern x86-64 hardware with AVX2, a single SIMD instruction computes eight float32 dot-product contributions in parallel. Proxima's distance kernels (the low-level functions that compute cosine, L2, or inner product distances between vectors) are hand-tuned to issue these instructions in tight loops with no Python overhead.
The warmup step in the benchmark discards the first 10 query results. When the index is freshly opened, the data pages have not yet been loaded into the OS page cache. The first few queries trigger page faults and disk reads, adding latency that does not represent steady-state performance. After 10 queries, the frequently accessed pages (entry point, upper-layer graph nodes, and the hot region of layer-0 neighbours) are resident in RAM. The post-warmup latency is what ZVec delivers to a running service.
On 10,000 Wikipedia documents (all-MiniLM-L6-v2, 384d, cosine, M=16, ef_construction=200):
| Metric | ZVec |
|---|---|
| Build time | 0.47 s |
| p50 query | 0.31 ms |
| p99 query | 0.53 ms |
| QPS | 3,058 |
| Recall@10 | 1.000 |
The build time includes all inserts, HNSW graph construction, and on-disk persistence. The query numbers are post-warmup, with the relevant pages in cache.
Vajra's Vector Search
Vajra's NativeHNSWIndex implements HNSW in pure Python and NumPy. Its internal architecture uses a coalgebraic abstraction for search: HNSWNavigationCoalgebra unfolds a beam search from an HNSWSearchState, which is a frozen dataclass containing the candidates heap, results heap, and visited set. This keeps the search logic clean, composable, and testable in isolation.
The index stores vectors in a NumPy float32 array, supports cosine, L2, and inner product metrics, and serializes to disk with pickle. It integrates directly with Vajra's VajraVectorSearch and HybridSearchEngine (BM25 + vector fusion) classes.
The initial benchmark results against ZVec revealed four problems.
The Baseline: Where Things Stood
Running the same 10,000-document benchmark:
| Metric | ZVec | Vajra (v0.4.1) | Gap |
|---|---|---|---|
| Build time | 0.47 s | 97.77 s | 207× |
| p50 query | 0.31 ms | 2.63 ms | 8.5× |
| p99 query | 0.53 ms | 4.30 ms | 8.1× |
| QPS | 3,058 | 374 | 8.2× |
| Recall@10 | 1.000 | 0.987 | −1.3% |
The 207× build gap was informative. A C++ HNSW implementation is not 200× more clever algorithmically: both run the same graph construction. A gap of that size points to something wrong in the Python implementation, not a fundamental language barrier. The 8× query gap was larger than a Python-vs-C++ interpreter comparison alone would predict, which pointed to the same conclusion. Profiling confirmed it: most of the gap came from specific implementation defects, not the algorithm.
Root Causes: Four Problems
Problem 1: O(N²) Array Growth During Build
The most damaging bug. _insert_vector() was called once per node during add(). Each call did this:
# Called N times. Every call copies ALL existing vectors.
self.graph.vectors = np.vstack([self.graph.vectors, vector])
For N=10,000 with 384-dimensional float32 vectors, the total bytes copied are:
sum(i * 1536 bytes for i in range(10000)) ≈ 73 GB
This explains almost the entire 97-second build time. At 10 GB/s effective NumPy throughput inside a Python loop, 73 GB of copies takes roughly 7 seconds from memory bandwidth alone. The Python interpreter overhead per call added the rest.
The fix: add() already receives all N vectors as a single array. Pre-allocate the full block once.
n_new = len(ids)
existing = len(self.graph.ids)
if self.graph.vectors is None:
self.graph.vectors = np.empty((n_new, self._dimension), dtype=np.float32)
else:
self.graph.vectors = np.concatenate(
[self.graph.vectors, np.empty((n_new, self._dimension), dtype=np.float32)]
)
self.graph.vectors[existing : existing + n_new] = vectors
The insertion loop then calls _insert_node(existing + i, ids[i]), which reads from the pre-allocated array rather than extending it. N incremental vstack calls become one allocation and one bulk fill.
Problem 2: Redundant Norm Computation on Pre-Normalised Vectors
After fixing the vstack issue, build time dropped from 97 seconds to around 83 seconds. That improvement was smaller than expected, so profiling continued.
Running cProfile on a 2,000-document build and sorting by tottime (time spent inside each function, excluding callees) produced this:
ncalls tottime filename:function
2132 1.459s _search_layer
28918263 57.3s np.linalg.norm ← 28 million calls
313447 2.198s score_batch_normalized
Twenty-eight million calls to np.linalg.norm during a 2,000-document build. Scaled to 10,000 documents, that accounts for almost the entire remaining build time.
Tracing where these calls came from led to score_batch, the cosine distance function called inside the beam search:
# Called ~14 million times during build at N=10k.
# Recomputes np.linalg.norm on vectors that are already L2-normalised.
def score_batch(self, query, vectors):
query_norm = query / (np.linalg.norm(query) + self.eps)
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
vectors_norm = vectors / (norms + self.eps)
return (vectors_norm @ query_norm).astype(np.float32)
The general case for cosine similarity requires normalising both the query and the stored vectors before taking the dot product. score_batch does this correctly for the general case. The problem: add() already normalises all incoming vectors to unit length before inserting them into the graph. Every vector in self.graph.vectors is already L2-normalised. The np.linalg.norm(vectors, axis=1) line in score_batch was recomputing norms on data that was provably unit-length, throwing away the work done at insertion time on every distance call during construction.
The fix: when the metric is cosine, store the distance function as a specialised lambda that skips normalisation entirely. For two unit vectors, cosine similarity reduces to a plain dot product:
cos(a, b) = (a · b) / (||a|| × ||b||) = a · b (when ||a|| = ||b|| = 1)
# In __init__, for metric="cosine":
_scorer = CosineSimilarity()
self._distance_single = lambda a, b: float(1.0 - float(np.dot(a, b)))
self._distance_batch = lambda q, vs: 1.0 - _scorer.score_batch_normalized(q, vs)
Where score_batch_normalized is a single line: return (vectors @ query).astype(np.float32, copy=False).
The copy=False matters: without it, each call creates a redundant array copy even when the result is already float32. At 1.5 million calls during a 10k build, those copies add up.
Problem 3: Per-Neighbour Python Dispatch in the Beam Search
The beam search in _search_layer originally computed one distance call per neighbour:
for neighbor in self.graph.get_neighbors(current, level):
if neighbor not in visited:
dist = self._distance_single(query, self.graph.vectors[neighbor])
# ... heap update
At ef_construction=200 with M=16 neighbours per node, each node insertion triggers roughly 200 × 16 = 3,200 individual Python function calls for distance computation. At N=10,000, that is 32 million individual calls.
The fix: collect all unvisited neighbours first, then compute distances in one batch call:
unvisited = [n for n in layer_dict.get(current, []) if n not in visited]
if not unvisited:
continue
visited.update(unvisited)
neighbor_dists = dist_fn(query, vectors[unvisited])
One NumPy matrix-vector multiply replacing 16 individual calls. BLAS vectorises the dot products; Python call overhead drops by 16×.
Note the layer_dict here: caching self.graph.layers[level] as a local dict avoids the get_neighbors() function call overhead on every beam step (~2 million calls at N=10k). Same pattern for vectors = self.graph.vectors and dist_fn = self._distance_batch.
Problem 4: Immutable Coalgebra State in the Query Hot Path
The build problems explained the gap at construction time. The 8× query latency gap had a different root cause: the HNSWNavigationCoalgebra implementation.
To understand why, it helps to understand what the coalgebra is doing. In Vajra's design, HNSW search is modelled as a coalgebraic unfolding: starting from an initial state, a transition function produces the next state at each step until a termination condition is met. This is a clean, mathematical way to structure the beam search. The state at each step is an HNSWSearchState, a frozen Python dataclass with three fields:
candidates: the priority queue of nodes to explore next (the "open set")results: the current best-ef results found so farvisited: the set of nodes already evaluated
HNSWSearchState is frozen: a frozen dataclass cannot be modified after creation. This is intentional. Immutable state objects are composable, independently testable, and safe to reason about. You can snapshot any intermediate state during a search, replay it, or pass it to a separate function for analysis. This makes the coalgebra a good abstraction for the documented public interface.
The performance cost is in the transition step. To advance from step k to step k+1, the coalgebra creates a brand-new HNSWSearchState. The visited field is a frozenset, Python's immutable set type. Updating it to include the newly visited node requires constructing a completely new frozenset from scratch:
# Called ~800 times per query (ef=50, M=16 neighbours per step).
return [HNSWSearchState(
candidates=tuple(candidates), # new tuple: O(ef) allocation
results=tuple(results), # new tuple: O(ef) allocation
visited=frozenset(visited), # new frozenset: O(|visited|) copy — grows each step
...
)]
At step k, frozenset(visited) has k elements to copy into the new object. The cost is O(k) per step. Over a full beam search with ef=50 and M=16 neighbours per expansion, the search runs for roughly 800 steps:
Total copy cost: sum(k for k in range(800)) = 319,600 element-copy operations per query
That is, the visited-set maintenance alone accounts for 319,600 integer copy operations per query, growing quadratically with the number of steps. Add 800 tuple allocations for candidates, 800 for results, and the Python memory allocator is doing significant work on every query call, entirely in overhead rather than in useful distance computation.
The coalgebra is the right abstraction for Vajra's architecture. The fix was to add a parallel fast path that implements the identical algorithm with mutable state, bypassing the coalgebra in the hot query loop while keeping it available as the documented public interface for tests and extension.
_beam_search_fast() implements the same algorithm with mutable state:
def _beam_search_fast(self, query, entry, ef):
layer_0 = self.graph.layers[0] if self.graph.layers else {}
vectors = self.graph.vectors
dist_fn = self._distance_batch
d0 = float(dist_fn(query, vectors[entry : entry + 1])[0])
candidates = [(d0, entry)] # min-heap
results = [(-d0, entry)] # max-heap, capped at ef
visited = {entry}
n_results = 1
results_full = (n_results >= ef)
worst = d0
while candidates:
dist, current = heapq.heappop(candidates)
if results_full and dist > worst:
break
unvisited = [n for n in layer_0.get(current, []) if n not in visited]
if not unvisited:
continue
visited.update(unvisited)
dists = dist_fn(query, vectors[unvisited])
for n, nd_raw in zip(unvisited, dists):
nd = float(nd_raw)
heapq.heappush(candidates, (nd, n))
if results_full:
if nd < worst:
heapq.heapreplace(results, (-nd, n))
worst = -results[0][0]
else:
heapq.heappush(results, (-nd, n))
n_results += 1
worst = -results[0][0]
if n_results >= ef:
results_full = True
return [(idx, -d) for d, idx in sorted(results, reverse=True)]
The n_results counter and results_full flag also eliminate ~19 million len(results) calls that showed up in the pre-fix profiler (the len() check ran on every iteration of the inner loop; once the results heap is full it stays full, so a boolean flag is sufficient).
search() was rewritten to use _greedy_search_layer for the upper-layer descent and _beam_search_fast for the layer-0 beam search. The coalgebra is still constructed and stored in self.coalgebra after every add() call; it is just not used in the query hot path.
Additional Fix: Inline _prune_connections
One smaller fix worth mentioning: _prune_connections() was a separate method called for every selected neighbour during node insertion, roughly 330,000 times at N=10k. Each call recomputed neighbor distances to check if pruning was needed, even when most nodes did not need pruning.
Inlining the pruning logic directly in _insert_node eliminated the function-call overhead and allowed caching layers[l] as a local reference:
layer = self.graph.layers[l]
for neighbor_idx, _ in selected:
nbrs = layer[neighbor_idx]
if len(nbrs) > M:
dists = self._distance_batch(vectors[neighbor_idx], vectors[nbrs])
keep = np.argsort(dists)[:M]
layer[neighbor_idx] = [nbrs[i] for i in keep]
Results After All Six Fixes
The same 10,000-document benchmark, version 0.5.0:
| Metric | ZVec | Vajra v0.4.1 | Vajra v0.5.0 | Improvement |
|---|---|---|---|---|
| Build time | 0.47 s | 97.77 s | 17.03 s | 5.7× faster |
| p50 query | 0.31 ms | 2.63 ms | 0.51 ms | 5.2× faster |
| p95 query | 0.42 ms | 3.77 ms | 0.71 ms | 5.3× faster |
| p99 query | 0.53 ms | 4.30 ms | 0.86 ms | 5.0× faster |
| QPS | 3,058 | 374 | 1,911 | 5.1× higher |
| Recall@10 | 1.000 | 0.987 | 0.997 | +1.0% |
The gap to ZVec:
- Build: 207× → 36×. The remaining 36× is the Python-vs-C++ interpreter cost for 10,000 HNSW insertions. Getting below ~12 seconds in Python would require Numba JIT on the inner beam search loop.
- Query: 8.5× → 1.6×. The remaining 1.6× reflects Python's heapq overhead. At ef=50 with M=16, each query runs roughly 800 heap operations. CPython heapq.heappush costs ~300ns per call on M1; 800 × 300ns = 240ms of irreducible heap overhead. ZVec's C++ std::priority_queue at ~10ns per op does the same 800 operations in 8ms.
The query gap is now at the CPython interpreter floor for this algorithm, not an implementation defect.
What This Demonstrates
None of these six changes required restructuring the algorithm, adding Numba, writing a C extension, or compromising the categorical abstractions that make Vajra's code clean. They were all standard Python engineering: profile first, find the real bottleneck, fix the specific thing.
The original 8× query gap looked like a fundamental Python-vs-C++ problem. It was mostly a frozenset copy in the wrong place. The 207× build gap looked like an algorithmic problem. It was a single np.vstack call inside a loop.
Vajra v0.5.0 is on PyPI. The benchmark code is at github.com/aiexplorations/vajra_bm25 and the ZVec comparison scripts are at github.com/aiexplorations/zvec_vajra_benchmark.