Vajra Search with a Rust backend (v0.2.1): Greater Build and Latency Performance

Vajra-BM25 v0.5.0 was a major improvement over the first Python vector implementation. Query behavior became much better, and build time dropped substantially. But once we started running larger Wikipedia slices repeatedly, HNSW construction was still spending too much time in Python-level hot loops.

After a lot of deliberation, I decided to move on from the Python-only implementation of vajra-bm25, to a Rust-based backend for the Vajra Search package. The resulting vajra-search is now at v0.2.1, and published on PyPI. It is the next step forward in building a performant search package which can perform lexical search, vector search and hybrid search, all while maintaining the underlying category theory abstractions we set out to build Vajra with. That said, I didn't want to rock the boat and move away from Python. The Python API stays stable as a result. However, the heavy Approx Nearest Neighbours (ANN) path moves into Rust. This post explains what changed in practice, with short code snippets and benchmark evidence.

Release Status

Package line: vajra-search 0.2.1
PyPI release workflow: tag v0.2.1 -> build wheels/sdist -> publish
If PyPI still shows 0.2.0, that means publication is pending from the release pipeline, not that the codebase is still on 0.2.0.

The Journey So Far

Earlier posts in this series:

Benchmark Lineage: BM25S, Pyserini, Tantivy, and Beyond

Before the Rust vector backend work, Vajra was benchmarked as a lexical engine against multiple systems. That benchmark discipline is the foundation for the current Rust measurements.

Phase	Engines compared	Why it mattered
BM25 phase (`vajra_bm25`)	Vajra, BM25S, Tantivy, Pyserini (and early Rank-BM25 baselines)	Established measurement discipline across latency, QPS, and ranking metrics
Vector phase (`v0.4.1` -> `v0.5.0`)	Vajra Python ANN vs ZVec	Exposed build-path bottlenecks and query-path wins
Rust phase (`vajra-search v0.2.x`)	Vajra Rust HNSW vs ZVec, plus lexical/vector/hybrid internal mode benchmarks	Preserved Python API while moving critical loops to Rust

A few concrete takeaways from that BM25 phase still inform this Rust phase:

BM25S helped define a high-performance Python baseline for lexical retrieval.
Pyserini/Lucene remained an accuracy-focused reference in the evaluation stack.
Tantivy represented Rust-first search trade-offs around durable index architecture.
Vajra was measured repeatedly against these systems to keep implementation claims grounded in comparable runs, not one-off numbers.

If you want the detailed lexical benchmark tables and methodology, use the original BM25 technical report and post:

Why Rust After v0.5.0

As you may recollect from the earlier blog posts, we performed a number of optimizations from 0.4.1 to bring Vajra-BM25 to 0.5.0. This improved the build time, and also kept latency and recall @ k metrics tractable. After the v0.5.0 improvements though, three costs were still dominant:

Build path cost in graph construction. HNSW insertion (required for vector and hybrid search modes) does repeated expansion, scoring, and pruning; this runs many times per vector.
Heap-heavy query traversal. Candidate frontier updates are branch-heavy and frequent.
Distance math in tight loops. Small inefficiencies multiply quickly.

Rust was selected for three practical reasons: predictable memory layout, low-overhead mutation in tight loops, and compiled distance kernels. There is interop with Python and the memory-efficient language that Rust is enables the above problems to be addressed better. The objective was not a language rewrite, though, since I needed the package to still be usable by Pythonistaas. The objective was to move the critical path while keeping Python usage unchanged.

Python and Rust Integration

Python is still the developer-facing API. Rust handles index mutation, search traversal, and persistence.

This split keeps performance work isolated in Rust while preserving a stable Python-facing API.

How Indexing Works (The Build Critical Path)

When a new vector is inserted, the index does this:

In practice, build time mostly depends on:

frontier width during construction (ef_construction),
graph degree (M),
per-step distance and pruning cost.

For that reason, quality, fast, and instant are explicit operating modes with measurable trade-offs, not naming-only presets.

What Changed in v0.2.1 (Detailed)

The two implementation plans were:

Most changes were concentrated in the HNSW construction path:

Area	Change	Mechanism
Beam search	scratch reuse (`BeamSearchScratch`)	avoids repeated heap/set allocation in insertion-time search
Neighbor selection	bounded candidate windows + scratch output buffers	reduces pruning work and temporary `Vec` churn
Distance kernels	unrolled `dot_f32` / `l2_sq_f32`	better compiler vectorization behavior
Build scheduling	snapshot planning in micro-batches/epochs	parallelizes planning work while preserving deterministic commit
Profile tuning	non-heuristic construction caps in fast path	lowers unnecessary construction beam cost
Build config	`target-cpu=native` in local benchmark builds	ensures SIMD-relevant codegen on host CPU

Three small examples capture the spirit of the work.

1) Reuse search scratch instead of allocating every call

Before, beam-search internals rebuilt heap/visited structures repeatedly during insertion-time searches.
Now, the buffers are reused:

let candidates = beam_search_fast_with_scratch(
    &self.graph,
    vector,
    ep,
    ef_c,
    lc,
    &self.metric,
    &mut self.build_scratch.beam,
);

This reduces allocator pressure where the same operation runs thousands of times per batch.

2) Keep neighbor selection outputs in reusable buffers

Instead of allocating fresh output vectors for every prune/select call, the code writes into preallocated buffers:

select_neighbors_heuristic_into(
    &self.graph,
    &self.metric,
    candidates,
    m,
    &mut self.build_scratch.heuristic_selected,
    &mut self.build_scratch.heuristic_out,
);

The algorithm is unchanged. The overhead around it is smaller.

3) Make planning parallel, keep commit deterministic

Batch insertion now performs snapshot planning in parallel, then commits in order:

let plans: Vec<PlannedInsertion> = (chunk_start..chunk_end)
    .into_par_iter()
    .map(|i| plan_insertion_on_snapshot(...))
    .collect();

This gives parallel speedup without giving up deterministic graph mutation order.

Why these changes target build time

At 50k vectors, construction cost is dominated by repeated execution of:

layer-0 beam expansion (ef_construction frontier maintenance),
distance evaluation over candidate neighbors,
bidirectional edge wiring + overfull-node pruning.

The implemented changes do not change HNSW semantics. They reduce overhead around the same operations.

Implementation note: what did not change

The public Python API remained stable (NativeHNSWIndex, VajraVectorSearch, hybrid APIs).
The work focused on internals in crates/vajra-hnsw, with PyO3 bindings unchanged at the usage level.

50k build-time change from this pass:

Profile	Before (s)	After (s)	Gain
quality	78.62	52.17	33.6%
fast	24.46	18.44	24.6%
instant	5.94	4.22	29.0%

Build before vs after (50k)

These changes materially reduce build time, but build latency remains the main gap versus ZVec.

ZVec vs Vajra Rust on Wikipedia Slices

Benchmarks were run on 1k, 10k, 20k, and 50k slices with fixed embedding and metric settings.

At 50k:

Engine/Profile	Build (s)	p50 (ms)	QPS	Recall@10
ZVec	2.628	0.796	1251.4	0.999
Vajra quality	53.075	0.205	4736.8	0.998
Vajra fast	18.308	0.170	5721.9	0.912
Vajra instant	4.186	0.071	12938.8	0.671

Key observations from this table:

ZVec is still much faster to build.
Vajra is faster on measured query latency and throughput in this setup.
Recall follows profile aggressiveness, exactly as expected.

Scaling plots:

Build scaling

Latency scaling

Recall scaling

Lexical, Vector, and Hybrid from One Python API

The package exposes BM25 lexical, vector ANN, and hybrid retrieval from the same Python surface.

At 50k documents (deterministic mode benchmark):

Mode	Build (s)	p50 (ms)	QPS
lexical	0.241	0.499	1122.1
vector	1.947	0.011	87301.0
hybrid	2.190	0.585	986.0

Python modes at 50k

This benchmark is intentionally narrow and controlled. For cross-engine comparisons, the Wikipedia + MiniLM setup is more representative.

Build Profile Semantics

The three profiles are intended as operational modes:

quality: maximize recall stability at higher construction cost.
fast: reduce construction cost while retaining useful recall.
instant: minimize startup/build latency, accepting larger recall loss.

In production terms, this is a control-plane decision:
start quickly with instant or fast, then promote to quality when rebuild completes.

What the current numbers imply

For 50k vectors in the current setup:

Build latency is still the principal gap versus ZVec.
Query latency/QPS are already competitive for Vajra in these runs.
Recall behavior matches design intent: quality near reference, fast middle, instant lowest.

The practical takeaway is straightforward: the next phase should continue to focus on construction-path parallelization and pruning efficiency, because that is where the largest remaining gap sits.

Reproducibility: Where the Wikipedia Data Comes From

The data process follows the same benchmark discipline used in vajra_bm25:

Load Wikipedia corpora through ir_datasets (WikIR sets like wikir/en78k, with fallbacks).
Normalize documents into stable JSONL snapshots.
Generate and lock embeddings (all-MiniLM-L6-v2, 384d).
Reuse fixed slices (1k, 10k, 20k, 50k) across all runs.

Useful references:

Concluding Remarks

This phase does not close the full build-time gap, but it does establish a clear and useful baseline:

Python remains the interface layer.
Rust is the execution layer for HNSW build/search kernels.
Profile selection is an explicit control for build-latency/recall trade-offs.

That is a good place to continue from. This way, we have a stable API, measurable profile trade-offs, and a construction path that is now much easier to optimize incrementally. I spent some time building out the test pipeline for this, so that we can provably measure recall and other performance, and this has become a repeatable asset.