Vajra-BM25 v0.5.0 was a major improvement over the first Python vector implementation. Query behavior became much better, and build time dropped substantially. But once we started running larger Wikipedia slices repeatedly, HNSW construction was still spending too much time in Python-level hot loops.

After a lot of deliberation, I decided to move on from the Python-only implementation of vajra-bm25, to a Rust-based backend for the Vajra Search package. The resulting vajra-search is now at v0.2.1, and published on PyPI. It is the next step forward in building a performant search package which can perform lexical search, vector search and hybrid search, all while maintaining the underlying category theory abstractions we set out to build Vajra with. That said, I didn't want to rock the boat and move away from Python. The Python API stays stable as a result. However, the heavy Approx Nearest Neighbours (ANN) path moves into Rust. This post explains what changed in practice, with short code snippets and benchmark evidence.

Release Status

  • Package line: vajra-search 0.2.1
  • PyPI release workflow: tag v0.2.1 -> build wheels/sdist -> publish
  • If PyPI still shows 0.2.0, that means publication is pending from the release pipeline, not that the codebase is still on 0.2.0.

The Journey So Far

flowchart LR
    A["Vajra BM25 benchmark work"] --> B["Vajra vector v0.4.1\nPython baseline"]
    B --> C["Vajra vector v0.5.0\nprofiling-driven fixes"]
    C --> D["Vajra Rust backend v0.2.x\ncustom HNSW + PyO3"]

    C --> E["ZVec comparison showed\nremaining structural cost"]
    E --> D

Earlier posts in this series:

  1. Vajra BM25: Building a Search Engine with Category Theory (Dec 24, 2025)
  2. Re-engineering Vajra's vector search with inspiration from ZVec (Feb 22, 2026)
  3. From Vajra v0.4.1 to v0.5.0 to Rust backend (Feb 23, 2026)

Why Rust After v0.5.0

As you may recollect from the earlier blog posts, we performed a number of optimizations from 0.4.1 to bring Vajra-BM25 to 0.5.0. This improved the build time, and also kept latency and recall @ k metrics tractable. After the v0.5.0 improvements though, three costs were still dominant:

  1. Build path cost in graph construction. HNSW insertion (required for vector and hybrid search modes) does repeated expansion, scoring, and pruning; this runs many times per vector.
  2. Heap-heavy query traversal. Candidate frontier updates are branch-heavy and frequent.
  3. Distance math in tight loops. Small inefficiencies multiply quickly.

Rust was selected for three practical reasons: predictable memory layout, low-overhead mutation in tight loops, and compiled distance kernels. There is interop with Python and the memory-efficient language that Rust is enables the above problems to be addressed better. The objective was not a language rewrite, though, since I needed the package to still be usable by Pythonistaas. The objective was to move the critical path while keeping Python usage unchanged.

Python and Rust Integration

Python is still the developer-facing API. Rust handles index mutation, search traversal, and persistence.

flowchart TB
    subgraph PY["Python surface"]
      P1["vajra_search.vector.NativeHNSWIndex"]
      P2["VajraVectorSearch / HybridSearchEngine"]
    end

    subgraph FFI["PyO3 boundary"]
      F1["PyHnswIndex <-> HnswIndex"]
    end

    subgraph RS["Rust core"]
      R1["index.rs"]
      R2["graph.rs"]
      R3["beam_search_fast + greedy_descent"]
      R4["distance.rs"]
      R5["coalgebra.rs (reference/parity path)"]
      R6["bincode save/load"]
    end
    P2 --> P1 --> F1 --> R1

    R1 --> R2
    R1 --> R3
    R3 --> R4
    R5 -. "parity checks" .-> R3
    R1 --> R6

This split keeps performance work isolated in Rust while preserving a stable Python-facing API.

How Indexing Works (The Build Critical Path)

When a new vector is inserted, the index does this:

flowchart TD
    A["Insert vector + assign level"] --> B["Greedy descent from entry point"]
    B --> C["Layer-wise beam expansion"]
    C --> D["Neighbor selection"]
    D --> E["Bidirectional edge wiring"]
    E --> F["Prune overfull neighbor lists"]
    F --> G["Update entry point if needed"]

In practice, build time mostly depends on:

  1. frontier width during construction (ef_construction),
  2. graph degree (M),
  3. per-step distance and pruning cost.

For that reason, quality, fast, and instant are explicit operating modes with measurable trade-offs, not naming-only presets.

What Changed in v0.2.1 (Detailed)

The two implementation plans were:

  1. rust_index_build_improvement.md
  2. build_improvement_plan_claude.md

Most changes were concentrated in the HNSW construction path:

Area Change Mechanism
Beam search scratch reuse (BeamSearchScratch) avoids repeated heap/set allocation in insertion-time search
Neighbor selection bounded candidate windows + scratch output buffers reduces pruning work and temporary Vec churn
Distance kernels unrolled dot_f32 / l2_sq_f32 better compiler vectorization behavior
Build scheduling snapshot planning in micro-batches/epochs parallelizes planning work while preserving deterministic commit
Profile tuning non-heuristic construction caps in fast path lowers unnecessary construction beam cost
Build config target-cpu=native in local benchmark builds ensures SIMD-relevant codegen on host CPU

Three small examples capture the spirit of the work.

1) Reuse search scratch instead of allocating every call

Before, beam-search internals rebuilt heap/visited structures repeatedly during insertion-time searches.
Now, the buffers are reused:

let candidates = beam_search_fast_with_scratch(
    &self.graph,
    vector,
    ep,
    ef_c,
    lc,
    &self.metric,
    &mut self.build_scratch.beam,
);

This reduces allocator pressure where the same operation runs thousands of times per batch.

2) Keep neighbor selection outputs in reusable buffers

Instead of allocating fresh output vectors for every prune/select call, the code writes into preallocated buffers:

select_neighbors_heuristic_into(
    &self.graph,
    &self.metric,
    candidates,
    m,
    &mut self.build_scratch.heuristic_selected,
    &mut self.build_scratch.heuristic_out,
);

The algorithm is unchanged. The overhead around it is smaller.

3) Make planning parallel, keep commit deterministic

Batch insertion now performs snapshot planning in parallel, then commits in order:

let plans: Vec<PlannedInsertion> = (chunk_start..chunk_end)
    .into_par_iter()
    .map(|i| plan_insertion_on_snapshot(...))
    .collect();

This gives parallel speedup without giving up deterministic graph mutation order.

Why these changes target build time

At 50k vectors, construction cost is dominated by repeated execution of:

  1. layer-0 beam expansion (ef_construction frontier maintenance),
  2. distance evaluation over candidate neighbors,
  3. bidirectional edge wiring + overfull-node pruning.

The implemented changes do not change HNSW semantics. They reduce overhead around the same operations.

Implementation note: what did not change

The public Python API remained stable (NativeHNSWIndex, VajraVectorSearch, hybrid APIs).
The work focused on internals in crates/vajra-hnsw, with PyO3 bindings unchanged at the usage level.

50k build-time change from this pass:

Profile Before (s) After (s) Gain
quality 78.62 52.17 33.6%
fast 24.46 18.44 24.6%
instant 5.94 4.22 29.0%

Build before vs after (50k)

These changes materially reduce build time, but build latency remains the main gap versus ZVec.

ZVec vs Vajra Rust on Wikipedia Slices

Benchmarks were run on 1k, 10k, 20k, and 50k slices with fixed embedding and metric settings.

At 50k:

Engine/Profile Build (s) p50 (ms) QPS Recall@10
ZVec 2.628 0.796 1251.4 0.999
Vajra quality 53.075 0.205 4736.8 0.998
Vajra fast 18.308 0.170 5721.9 0.912
Vajra instant 4.186 0.071 12938.8 0.671

Key observations from this table:

  1. ZVec is still much faster to build.
  2. Vajra is faster on measured query latency and throughput in this setup.
  3. Recall follows profile aggressiveness, exactly as expected.

Scaling plots:

Build scaling

Latency scaling

Recall scaling

Lexical, Vector, and Hybrid from One Python API

The package exposes BM25 lexical, vector ANN, and hybrid retrieval from the same Python surface.

At 50k documents (deterministic mode benchmark):

Mode Build (s) p50 (ms) QPS
lexical 0.241 0.499 1122.1
vector 1.947 0.011 87301.0
hybrid 2.190 0.585 986.0

Python modes at 50k

This benchmark is intentionally narrow and controlled. For cross-engine comparisons, the Wikipedia + MiniLM setup is more representative.

Build Profile Semantics

The three profiles are intended as operational modes:

  • quality: maximize recall stability at higher construction cost.
  • fast: reduce construction cost while retaining useful recall.
  • instant: minimize startup/build latency, accepting larger recall loss.

In production terms, this is a control-plane decision:
start quickly with instant or fast, then promote to quality when rebuild completes.

What the current numbers imply

For 50k vectors in the current setup:

  1. Build latency is still the principal gap versus ZVec.
  2. Query latency/QPS are already competitive for Vajra in these runs.
  3. Recall behavior matches design intent: quality near reference, fast middle, instant lowest.

The practical takeaway is straightforward: the next phase should continue to focus on construction-path parallelization and pruning efficiency, because that is where the largest remaining gap sits.

Reproducibility: Where the Wikipedia Data Comes From

The data process follows the same benchmark discipline used in vajra_bm25:

  1. Load Wikipedia corpora through ir_datasets (WikIR sets like wikir/en78k, with fallbacks).
  2. Normalize documents into stable JSONL snapshots.
  3. Generate and lock embeddings (all-MiniLM-L6-v2, 384d).
  4. Reuse fixed slices (1k, 10k, 20k, 50k) across all runs.

Useful references:

  1. vajra_bm25/benchmarks/download_wikipedia.py
  2. ir_benchmark_data
  3. zvec_vajra_benchmark
  4. vajra_search_engine/reproduction.md

Concluding Remarks

This phase does not close the full build-time gap, but it does establish a clear and useful baseline:

  • Python remains the interface layer.
  • Rust is the execution layer for HNSW build/search kernels.
  • Profile selection is an explicit control for build-latency/recall trade-offs.

That is a good place to continue from. This way, we have a stable API, measurable profile trade-offs, and a construction path that is now much easier to optimize incrementally. I spent some time building out the test pipeline for this, so that we can provably measure recall and other performance, and this has become a repeatable asset.

References

  1. Vajra BM25 benchmark report
  2. Vajra BM25 architecture post (Dec 24, 2025)
  3. Vajra vector search v0.4.1 -> v0.5.0 post (Feb 22, 2026)
  4. Vajra Rust architecture post (Feb 23, 2026)
  5. ZVec repository
  6. Databricks glossary: Lambda Architecture
  7. Rust build-improvement plans