Building Vajra 3D Shape Search: Text Queries over Indexed Point Clouds

I have been thinking about search as a primitive for a while. Most search systems start with documents. That is natural enough: documents already have words, sections, titles, and a structure that maps easily to lexical search. But a lot of the physical world does not arrive as prose. It arrives as shapes, parts, assemblies, scans, meshes, drawings, and point clouds.

If I have a repository of 3D objects, I do not want to browse it like a filesystem. I want to type something like hex bolt, donut shaped object, flat washer, chair with a back, or gear-like part, and get the appropriate shape records back. I may then want to inspect the result visually, rotate it, compare it to neighbors, or use it as a starting point for a CAD or PLM workflow. Although I have used the first person here to describe the use case, the retrieval process is more common and more of an everyday problem than the less common problem of generating point clouds and 3D objects. It is understandable that a lot of people would want to think about using these models to generate point clouds or other 3D objects, but often, retrieval is the more important problem for most companies, large or small. Ergo, Vajra 3D Shape Search.

And that is an important distinction here. Vajra 3D is not a 3D generator. It does not take text and hallucinate a mesh. It searches a repository of already-indexed 3D shapes. The demo then renders the retrieved objects as point clouds so that the result is easy to inspect in the browser.

That distinction matters, because generation is an act of synthesis - you're taking a latent representation of the user's prompt and trying to generate geometry from it. However, search is an act of retrieval - you're looking to find relevant objects based on text input, from a repository of known 3D objects. For engineering workflows, retrieval is often the safer and more useful operation. If a vendor sends a part, a PLM system has a part number, and a geometry repository has known assets, the first question is not, "Can we invent a part?" The question is, "Can we find the right existing object?"

The Formulation Shift

The hard part is not drawing a shape. The hard part is retrieving the right indexed shape from language. Behind this simple requirement, there are a lot of interesting ideas on how to bring about text based 3D object retrieval via a search system like Vajra search.

To build Vajra 3D, I framed the problem this way:

Each object has a point cloud, which is a collection of (x,y,z) coordinates that together represent a 3D object.
Each object also has labels, aliases, descriptions, and metadata. This is crucial, and Vajra relies on these details.
A model maps point clouds and text into a shared embedding space. Side note: Building the core hypothesis around this model, and using the model in the demo I've built for this post were key work prior to this blog post.
Vajra indexes the object embeddings with HNSW.
Vajra indexes the metadata text with BM25.
Hybrid search combines the two rankings.
The frontend renders the selected result as a point cloud.

That gives us something very practical: a search box for a 3D object repository.

The user experiences this as a search field and a viewer. Underneath it, the system is doing both semantic retrieval and lexical retrieval over the same shape repository.

What The Demo Does

The public demo has two Vajra pages on rajeshrs.in:

Vajra Docs Search, which searches a documentation corpus.
Vajra 3D Shape Search, which searches indexed synthetic 3D shapes.

The 3D page is intentionally direct. You type a query, choose a mode, and inspect the results. The selected result is rendered as a point cloud with mouse-based rotation and zoom.

The three modes are:

Mode	What it searches	Why it is useful
`dense`	Text query embedding against point-cloud embeddings	Finds shapes by semantic similarity in the model space
`lexical`	Labels, aliases, descriptions, and metadata with BM25	Finds exact or near-exact terms like `hex bolt` or `washer`
`hybrid`	RRF fusion of dense and lexical ranks	Uses both signals, which is usually what a search UI should do

The visual point cloud is not the model output. It is the representation of the retrieved object. That is the right mental model for this demo: search first, visualization second.

Building The Model

For this first version, I wanted something small enough to train and reason about, but real enough to test the whole retrieval loop. The model is a compact text-to-point-cloud dual encoder. One side consumes text. The other side consumes point clouds. Both sides produce 128-dimensional embeddings in the same space.

The current model was trained on synthetic shape data. The catalog has 148 shape classes across primitive shapes, household-like objects, fasteners, and mechanical components. Each class has text aliases and descriptions. That matters because users do not all type the canonical label. Someone might type torus, ring, or donut shaped object, and the system should still have a chance of landing near the same class.

The point clouds are generated procedurally for the purpose of this demo and these were used to train the model. Of course, a real world fine-tune of this model based on the same architecture, could potentially involve a lot of data engineering, in addition to synthetic data like I've used. For training, the model consumes normalized point clouds. For display, the demo uses a denser point cloud so the browser rendering is easier to see. In the current demo setup:

Component	Current choice
Shape classes	148
Public demo corpus	296 searchable records
Embedding dimension	128
Model point input	1024 normalized points
Display point cloud	2048 points
Model status	Closed source at this time

Which brings me to something important. The model details are deliberately not the star of the post. The core idea is the retrieval system around it. Still, the model is doing the key translation: it gives text and shape a shared coordinate system. Without that, text search over geometry would collapse back to keyword matching.

There is another design choice here that I think is important. The model artifact is not public at this time. It is baked into the backend container for the demo, not served as a downloadable file. The reason is simple: this is a showcase of the Vajra 3D retrieval path and the deployed demo, not a model release. A public model release would need a separate model card, dataset statement, evaluation suite, and license decision. I'm not ready with all this quite yet.

Building The Index

Once the shape repository exists, indexing is straightforward in concept.

For each shape record, the system builds two retrieval views:

A dense vector view from the point-cloud embedding.
A lexical document view from metadata text.

The dense view goes into Vajra's HNSW index. The lexical view goes into Vajra's BM25 engine. These are separate retrieval channels over the same object IDs.

This split is useful because 3D search has two kinds of user intent.

Sometimes the query is named and precise: hex bolt, flat washer, torus. BM25 is excellent at that. Sometimes the query is descriptive: round object with a hole, long cylindrical fastener, box-like object with a top surface. The embedding model has a better chance there.

Hybrid search is the practical compromise. It lets exact names help when they are present, but it does not require every useful query to match a phrase in the metadata.

How Vajra Works Here

Vajra started with lexical search and then grew into vector and hybrid retrieval. I have written about the earlier stages in a few posts:

The 3D demo uses the same retrieval philosophy, but the object being searched is different. Instead of document chunks, the result is a shape record.

At a high level:

BM25 scores how well the query terms match the object's text metadata.
HNSW searches nearby object vectors in the embedding space.
RRF combines the rank lists without pretending BM25 scores and vector distances are naturally calibrated.

That last point is easy to miss. BM25 scores and cosine similarities are not the same kind of number. Reciprocal Rank Fusion avoids forcing them into one artificial scale. It asks a more stable question: which objects are ranked highly by one or both retrieval systems?

This is why Vajra is a good fit for the 3D problem. The search engine does not need to know that an embedding came from a point cloud. It needs a vector, an ID, metadata, and a ranking strategy. The model handles the modality bridge. Vajra handles retrieval.

Deployment

The deployed demo is intentionally small. The site is static and hosted through Netlify. The Vajra backend runs in a Railway container. The same backend now serves both the documentation search API and the 3D search API.

The public 3D corpus is capped at 296 records: two deterministic variants for each of the 148 classes. That is enough to demonstrate the search mechanics without turning the container into a data warehouse. The object embeddings and display point clouds are precomputed, so the runtime does not need to generate training data or run training jobs.

This is also why I did not use Hugging Face for the model in this demo. Hugging Face is useful when the goal is model distribution. Here, the goal is to showcase a protected application behavior. The model is an implementation detail inside the backend container.

That does not make the model magically impossible to copy. Any deployed software has a threat model. But it does mean the demo does not publish the artifact, does not expose embeddings through the API, and does not provide a download path. The public surface is search results and capped preview point clouds.

What I Learned

The most useful lesson is similar to what I saw while building Vidai: the formulation matters more than the model size.

With Vidai, the key move was to stop asking the neural network to do arithmetic. The neural network only had to parse mathematical notation; exact computation could be delegated to symbolic code.

With Vajra 3D, the key move is to stop asking the model to be the whole search engine. The model only has to map text and shape into a useful shared space. Retrieval, fusion, metadata, serving, security, and visualization are separate engineering problems.

Perhaps these lessons showcase something vital about the purposeful use of AI: problem framing and searching the problem space helps us figure out where to use specific, simpler-to-build capabilities. We don't need throw big models and agents at all problems.

Problem	Common framing	Inversion of framing
Text to 3D	Generate a shape from text	Retrieve existing indexed shape records
3D model	Build the entire product around the model	Use the model to produce embeddings and use the embeddings instead
Ranking	Trust one score	Fuse lexical and dense rank lists

That separation gives the system room to grow. A better point-cloud model can replace the current one. A CAD ingestion pipeline can replace synthetic generation. More metadata can enrich BM25. Vajra's HNSW and hybrid retrieval path stays the spine of the application. These are perhaps some current/future targets for this work - more on that below.

What Comes Next

This version is a proof of the pipeline, not the end state.

One next interesting step is ingestion of real 3D assets. Someone who wants to use Vajra may bring STEP, IGES, STL, GLB, OBJ, or point-cloud files. The pipeline should normalize those assets, sample point clouds, attach PLM metadata, run the embedding model, and index the resulting records in Vajra.

That would move the system from synthetic demonstration to industrial retrieval.

There are several open directions:

CAD and B-rep aware ingestion for STEP and IGES.
Better point-cloud models, possibly distilled from stronger open 3D-language systems.
Larger shape corpora with meaningful intra-class variation.
Search evaluation against real user queries.
PLM-aware metadata ranking for part numbers, materials, suppliers, and assemblies.
Similarity search from a query point cloud, not just from text.

The current demo is deliberately smaller than all of that, it was literally built this weekend over a few hours of spare time I had. But it proves the important thing first: text can retrieve indexed 3D objects, and Vajra can sit underneath that retrieval loop. And there's a lot possible in this (3D) space!

Try It

You can try the Vajra 3D search demo here:

Vajra 3D Shape Search

Note that there is also a regular search demo on this site for documentation. This demonstrates Vajra Search's basic capabilities (lexical, vector, hybrid search).

The Python package for the search engine is here:

vajra-search on PyPI

The 3D embedding model itself is closed source at this time. The public demo exposes the search behavior, not the model artifact.

If you want the background for the search engine side, these are the relevant earlier posts:

Vajra 3D is the same search philosophy pointed at a different kind of object. Documents were the first corpus. Shapes are the next one, and perhaps this bridges the bits-to-atoms gap somewhat. See you in my next post!