Vajra-BM25 v0.5.0 was a major improvement over the first Python vector implementation. Query behavior became much better, and build time dropped substantially. But once we started running larger Wikipedia slices repeatedly, HNSW construction was still spending too much time in Python-level hot loops.
After a lot of deliberation, I decided to move on from the Python-only implementation of vajra-bm25, to a Rust-based backend for the Vajra Search package. The resulting vajra-search is now at v0.2.1, and published on PyPI. It is the next step forward in building a performant search package which can perform lexical search, vector search and hybrid search, all while maintaining the underlying category theory abstractions we set out to build Vajra with. That said, I didn't want to rock the boat and move away from Python. The Python API stays stable as a result. However, the heavy Approx Nearest Neighbours (ANN) path moves into Rust. This post explains what changed in practice, with short code snippets and benchmark evidence.
Release Status
- Package line:
vajra-search 0.2.1 - PyPI release workflow: tag
v0.2.1-> build wheels/sdist -> publish - If PyPI still shows
0.2.0, that means publication is pending from the release pipeline, not that the codebase is still on0.2.0.
The Journey So Far
flowchart LR
A["Vajra BM25 benchmark work"] --> B["Vajra vector v0.4.1\nPython baseline"]
B --> C["Vajra vector v0.5.0\nprofiling-driven fixes"]
C --> D["Vajra Rust backend v0.2.x\ncustom HNSW + PyO3"]
C --> E["ZVec comparison showed\nremaining structural cost"]
E --> D
Earlier posts in this series:
- Vajra BM25: Building a Search Engine with Category Theory (Dec 24, 2025)
- Re-engineering Vajra's vector search with inspiration from ZVec (Feb 22, 2026)
- From Vajra v0.4.1 to v0.5.0 to Rust backend (Feb 23, 2026)
Why Rust After v0.5.0
As you may recollect from the earlier blog posts, we performed a number of optimizations from 0.4.1 to bring Vajra-BM25 to 0.5.0. This improved the build time, and also kept latency and recall @ k metrics tractable. After the v0.5.0 improvements though, three costs were still dominant:
- Build path cost in graph construction. HNSW insertion (required for vector and hybrid search modes) does repeated expansion, scoring, and pruning; this runs many times per vector.
- Heap-heavy query traversal. Candidate frontier updates are branch-heavy and frequent.
- Distance math in tight loops. Small inefficiencies multiply quickly.
Rust was selected for three practical reasons: predictable memory layout, low-overhead mutation in tight loops, and compiled distance kernels. There is interop with Python and the memory-efficient language that Rust is enables the above problems to be addressed better. The objective was not a language rewrite, though, since I needed the package to still be usable by Pythonistaas. The objective was to move the critical path while keeping Python usage unchanged.
Python and Rust Integration
Python is still the developer-facing API. Rust handles index mutation, search traversal, and persistence.
flowchart TB
subgraph PY["Python surface"]
P1["vajra_search.vector.NativeHNSWIndex"]
P2["VajraVectorSearch / HybridSearchEngine"]
end
subgraph FFI["PyO3 boundary"]
F1["PyHnswIndex <-> HnswIndex"]
end
subgraph RS["Rust core"]
R1["index.rs"]
R2["graph.rs"]
R3["beam_search_fast + greedy_descent"]
R4["distance.rs"]
R5["coalgebra.rs (reference/parity path)"]
R6["bincode save/load"]
end
P2 --> P1 --> F1 --> R1
R1 --> R2
R1 --> R3
R3 --> R4
R5 -. "parity checks" .-> R3
R1 --> R6
This split keeps performance work isolated in Rust while preserving a stable Python-facing API.
How Indexing Works (The Build Critical Path)
When a new vector is inserted, the index does this:
flowchart TD
A["Insert vector + assign level"] --> B["Greedy descent from entry point"]
B --> C["Layer-wise beam expansion"]
C --> D["Neighbor selection"]
D --> E["Bidirectional edge wiring"]
E --> F["Prune overfull neighbor lists"]
F --> G["Update entry point if needed"]
In practice, build time mostly depends on:
- frontier width during construction (
ef_construction), - graph degree (
M), - per-step distance and pruning cost.
For that reason, quality, fast, and instant are explicit operating modes with measurable trade-offs, not naming-only presets.
What Changed in v0.2.1 (Detailed)
The two implementation plans were:
Most changes were concentrated in the HNSW construction path:
| Area | Change | Mechanism |
|---|---|---|
| Beam search | scratch reuse (BeamSearchScratch) |
avoids repeated heap/set allocation in insertion-time search |
| Neighbor selection | bounded candidate windows + scratch output buffers | reduces pruning work and temporary Vec churn |
| Distance kernels | unrolled dot_f32 / l2_sq_f32 |
better compiler vectorization behavior |
| Build scheduling | snapshot planning in micro-batches/epochs | parallelizes planning work while preserving deterministic commit |
| Profile tuning | non-heuristic construction caps in fast path | lowers unnecessary construction beam cost |
| Build config | target-cpu=native in local benchmark builds |
ensures SIMD-relevant codegen on host CPU |
Three small examples capture the spirit of the work.
1) Reuse search scratch instead of allocating every call
Before, beam-search internals rebuilt heap/visited structures repeatedly during insertion-time searches.
Now, the buffers are reused:
let candidates = beam_search_fast_with_scratch(
&self.graph,
vector,
ep,
ef_c,
lc,
&self.metric,
&mut self.build_scratch.beam,
);
This reduces allocator pressure where the same operation runs thousands of times per batch.
2) Keep neighbor selection outputs in reusable buffers
Instead of allocating fresh output vectors for every prune/select call, the code writes into preallocated buffers:
select_neighbors_heuristic_into(
&self.graph,
&self.metric,
candidates,
m,
&mut self.build_scratch.heuristic_selected,
&mut self.build_scratch.heuristic_out,
);
The algorithm is unchanged. The overhead around it is smaller.
3) Make planning parallel, keep commit deterministic
Batch insertion now performs snapshot planning in parallel, then commits in order:
let plans: Vec<PlannedInsertion> = (chunk_start..chunk_end)
.into_par_iter()
.map(|i| plan_insertion_on_snapshot(...))
.collect();
This gives parallel speedup without giving up deterministic graph mutation order.
Why these changes target build time
At 50k vectors, construction cost is dominated by repeated execution of:
- layer-0 beam expansion (
ef_constructionfrontier maintenance), - distance evaluation over candidate neighbors,
- bidirectional edge wiring + overfull-node pruning.
The implemented changes do not change HNSW semantics. They reduce overhead around the same operations.
Implementation note: what did not change
The public Python API remained stable (NativeHNSWIndex, VajraVectorSearch, hybrid APIs).
The work focused on internals in crates/vajra-hnsw, with PyO3 bindings unchanged at the usage level.
50k build-time change from this pass:
| Profile | Before (s) | After (s) | Gain |
|---|---|---|---|
| quality | 78.62 | 52.17 | 33.6% |
| fast | 24.46 | 18.44 | 24.6% |
| instant | 5.94 | 4.22 | 29.0% |
These changes materially reduce build time, but build latency remains the main gap versus ZVec.
ZVec vs Vajra Rust on Wikipedia Slices
Benchmarks were run on 1k, 10k, 20k, and 50k slices with fixed embedding and metric settings.
At 50k:
| Engine/Profile | Build (s) | p50 (ms) | QPS | Recall@10 |
|---|---|---|---|---|
| ZVec | 2.628 | 0.796 | 1251.4 | 0.999 |
| Vajra quality | 53.075 | 0.205 | 4736.8 | 0.998 |
| Vajra fast | 18.308 | 0.170 | 5721.9 | 0.912 |
| Vajra instant | 4.186 | 0.071 | 12938.8 | 0.671 |
Key observations from this table:
- ZVec is still much faster to build.
- Vajra is faster on measured query latency and throughput in this setup.
- Recall follows profile aggressiveness, exactly as expected.
Scaling plots:
Lexical, Vector, and Hybrid from One Python API
The package exposes BM25 lexical, vector ANN, and hybrid retrieval from the same Python surface.
At 50k documents (deterministic mode benchmark):
| Mode | Build (s) | p50 (ms) | QPS |
|---|---|---|---|
| lexical | 0.241 | 0.499 | 1122.1 |
| vector | 1.947 | 0.011 | 87301.0 |
| hybrid | 2.190 | 0.585 | 986.0 |
This benchmark is intentionally narrow and controlled. For cross-engine comparisons, the Wikipedia + MiniLM setup is more representative.
Build Profile Semantics
The three profiles are intended as operational modes:
quality: maximize recall stability at higher construction cost.fast: reduce construction cost while retaining useful recall.instant: minimize startup/build latency, accepting larger recall loss.
In production terms, this is a control-plane decision:
start quickly with instant or fast, then promote to quality when rebuild completes.
What the current numbers imply
For 50k vectors in the current setup:
- Build latency is still the principal gap versus ZVec.
- Query latency/QPS are already competitive for Vajra in these runs.
- Recall behavior matches design intent:
qualitynear reference,fastmiddle,instantlowest.
The practical takeaway is straightforward: the next phase should continue to focus on construction-path parallelization and pruning efficiency, because that is where the largest remaining gap sits.
Reproducibility: Where the Wikipedia Data Comes From
The data process follows the same benchmark discipline used in vajra_bm25:
- Load Wikipedia corpora through
ir_datasets(WikIR sets likewikir/en78k, with fallbacks). - Normalize documents into stable JSONL snapshots.
- Generate and lock embeddings (
all-MiniLM-L6-v2, 384d). - Reuse fixed slices (1k, 10k, 20k, 50k) across all runs.
Useful references:
vajra_bm25/benchmarks/download_wikipedia.pyir_benchmark_datazvec_vajra_benchmarkvajra_search_engine/reproduction.md
Concluding Remarks
This phase does not close the full build-time gap, but it does establish a clear and useful baseline:
- Python remains the interface layer.
- Rust is the execution layer for HNSW build/search kernels.
- Profile selection is an explicit control for build-latency/recall trade-offs.
That is a good place to continue from. This way, we have a stable API, measurable profile trade-offs, and a construction path that is now much easier to optimize incrementally. I spent some time building out the test pipeline for this, so that we can provably measure recall and other performance, and this has become a repeatable asset.