Praval Deep Research – Local-First AI Research Assistant - AI Explorations

Research has a privacy problem that nobody talks about. Your ArXiv search history reveals strategic research directions. Your questions to AI assistants expose knowledge gaps. Your reading patterns show what you're building before you build it. And all of this data flows through cloud services that store, analyze, and potentially monetize your intellectual journey.

When I started building Praval Deep Research, I had a simple question: could we build a production-quality AI research assistant where all your data stays on your infrastructure? Not as a privacy theater exercise, but as a genuinely useful tool that happened to respect privacy as a first-class architectural constraint?

The answer turned out to be yes—but the architectural decisions required were fascinating.

The Constraint: Local-First as Architecture

Local-first isn't just about running docker-compose up. It's a fundamentally different architectural stance that cascades through every design decision.

Cloud-based research tools optimize for scale and telemetry. They can instrument every interaction, build ML models from aggregate usage patterns, and spin up infrastructure elastically. Privacy comes as an afterthought—usually through compliance theater like "we encrypt data at rest" while still mining it for training data.

Local-first inverts this. Your infrastructure becomes the hard constraint. You get:
- Complete data sovereignty: Papers, embeddings, conversations, insights—all on your PostgreSQL, Qdrant, MinIO, Redis stack
- Zero telemetry by default: No usage tracking, no analytics, no phone home
- Offline-capable core: External calls only for ArXiv downloads and OpenAI API (for now—Ollama integration planned for fully air-gapped operation)

But you lose elastic scale, centralized monitoring, and the ability to learn from aggregate patterns across users. The architecture must accommodate these tradeoffs.

Multi-Agent Architecture: Why Six Specialized Agents?

The core architectural question was: monolithic pipeline vs. multi-agent orchestration?

A monolithic pipeline would be simpler: search → download → extract → embed → query → respond. Linear, predictable, easy to reason about. But it fails on several dimensions:

Lack of specialization: A generalist LLM handling paper discovery, semantic analysis, summarization, and Q&A does none of them excellently. Each task has different prompt requirements, different model strengths, different failure modes.

No learning: A stateless pipeline processes paper N exactly like paper 1. No memory of what you care about, no improvement over time, no personalization.

Brittleness: One component failure crashes the entire pipeline. If embedding generation hits an API rate limit, the whole system blocks.

Multi-agent architecture solves these problems through specialization and autonomy. I built six agents, each with distinct identity, memory, and purpose:

Agent Architecture Overview

Praval Deep Research Architecture
Six specialized agents coordinate through the Reef message substrate, each maintaining memory and learning from interactions

1. Paper Discovery Agent: Searches ArXiv with domain-specific ranking. Remembers which papers you selected historically to improve future relevance scoring. Uses recall memory to learn your research focus areas.

2. Document Processor Agent: Downloads PDFs, extracts text (handling equations, multi-column layouts, figures), chunks intelligently with 1000-char windows and 200-char overlap, generates embeddings. Optimizes chunking strategy based on paper type (theory-heavy vs. empirical).

3. Semantic Analyzer Agent: Identifies themes and connections across your entire knowledge base. Builds conceptual maps showing how papers relate. Uses episodic memory to track which connections you've explored.

4. Summarization Agent: Creates comprehensive syntheses of individual papers. Adapts verbosity based on your historical preferences (do you want dense technical summaries or high-level overviews?).

5. Q&A Specialist Agent: Answers questions using retrieved context, citing specific papers with relevance scores. Personalizes responses based on conversation history and your background (inferred from questions).

6. Research Advisor Agent: Provides strategic guidance, suggests unexplored areas, identifies research gaps. Proactively generates insights like trending topics, research area clustering, and recommended next steps.

These agents don't follow a fixed workflow. They communicate via Praval's Reef substrate—an event-driven message bus where agents broadcast and subscribe to "spores" (structured messages). This creates emergent behavior:

Document Processor completes → broadcasts paper_indexed spore
Semantic Analyzer receives paper_indexed → updates theme graph → broadcasts themes_updated
Research Advisor receives themes_updated → invalidates cached insights → regenerates proactive recommendations

No orchestrator. No central coordinator. Just autonomous agents reacting to events.

The Storage Strategy: Four Databases, Four Purposes

Local-first doesn't mean "dump everything in SQLite." Different data has different access patterns, different consistency requirements, different performance needs. I use four storage systems, each for architectural reasons:

Storage Architecture
Hybrid storage strategy: relational for conversations, vector for semantic search, object for PDFs, cache for insights

PostgreSQL (Relational Integrity): Stores conversations and messages with CASCADE delete integrity. When you delete a conversation, all associated messages vanish atomically. ACID guarantees ensure chat history never corrupts even if the system crashes mid-write.

Qdrant (Semantic Search): Stores 1536-dimensional OpenAI embeddings for paper chunks. Optimized for nearest-neighbor search across millions of vectors. Retrieval takes <100ms even with 10,000+ papers indexed.

MinIO (Object Storage): Stores PDF files with S3-compatible API. Streaming proxy serves PDFs to browser without loading entire file into memory. Handles multi-GB knowledge bases efficiently.

Redis (Performance Cache): Caches research insights (trending topics, theme clusters) with 1-hour TTL. First generation takes 35 seconds (LLM analysis of entire knowledge base). Subsequent requests: instant. Cache invalidation happens automatically when papers are added/removed.

Why not just use PostgreSQL for everything? Access patterns differ fundamentally:

Conversations: relational queries with foreign keys
Embeddings: nearest-neighbor similarity search (not relational)
PDFs: streaming binary objects (not queryable)
Insights: ephemeral computed data with TTL (not durable)

Each database excels at its specific access pattern. PostgreSQL doing vector similarity search would be orders of magnitude slower than Qdrant. Qdrant storing relational conversation threads would be architecturally absurd.

The Processing Pipeline: Async, Event-Driven, Resilient

When you click "Index Selected Papers," here's what happens architecturally:

Frontend → API (FastAPI) → RabbitMQ → Document Processor Agent
                                    ↓
                              (async processing)
                                    ↓
PDF download → Text extract → Chunk → Embed → Qdrant insert
     ↓              ↓           ↓       ↓          ↓
   MinIO       (PyPDF2)    (1000/200) (OpenAI) (vector DB)

The critical architectural decision: RabbitMQ for async job processing.

Why not just process in-request? Papers take 30-60 seconds each to process. HTTP requests time out. Users close browser tabs. Synchronous processing fails in production.

RabbitMQ decouples request from execution. The API returns immediately with job ID. The Document Processor Agent consumes jobs from the queue at its own pace. Server-Sent Events (SSE) stream progress updates back to the frontend in real-time.

Benefits:
- Resilience: If the agent crashes mid-processing, RabbitMQ redelivers the job
- Scalability: Deploy multiple Document Processor agents to parallelize (currently single instance, but architecture supports horizontal scaling)
- Backpressure: Queue depth reveals system load; can throttle new submissions if processing lags

The Chat Architecture: Persistent Conversations with LLM-Generated Titles

Most chat interfaces are stateless. Every question starts from scratch. Context comes from stuffing previous messages into prompts, burning tokens and losing history beyond context windows.

Praval Deep Research makes conversations durable and first-class:

PostgreSQL schema:

conversations:
  - id (UUID primary key)
  - title (generated by LLM)
  - created_at, updated_at

messages:
  - id (UUID primary key)
  - conversation_id (foreign key CASCADE)
  - role (user | assistant)
  - content (text)
  - sources (JSON array of cited papers)
  - created_at

When you ask a question:

Q&A Specialist retrieves relevant paper chunks from Qdrant (semantic similarity search)
Constructs prompt with retrieved context + conversation history (last N messages)
LLM generates answer with source citations
System saves user message + assistant response to PostgreSQL
If conversation lacks a title, Research Advisor generates one using GPT-4o-mini (e.g., "Transformer attention mechanisms" instead of "Conversation 1")

Architectural benefit: Conversations are durable data, not ephemeral prompt state. You can:
- Resume conversations weeks later with full context
- Delete conversations with CASCADE integrity (all messages vanish)
- Export conversations for sharing (with citations intact)
- Analyze conversation patterns to understand research focus

Contrast with stateless chat: every new session forgets you exist. No learning, no personalization, no continuity.

Proactive Insights: Cached Intelligence at 35-Second Latency

The Research Advisor agent does something unusual: it analyzes your entire knowledge base unprompted and generates strategic insights.

What it produces:
- Research area clustering: "Your papers cluster into 3 themes: transformer architectures (12 papers), attention mechanisms (8 papers), multimodal learning (7 papers)"
- Trending topics: Extracted keywords ranked by frequency across recent papers
- Research gaps: Areas mentioned frequently but not deeply explored
- Personalized next steps: Recommendations based on chat history and indexed papers

The architectural challenge: This analysis is expensive. Scanning 50 papers, extracting themes, clustering, gap detection—35 seconds with GPT-4o-mini. Unacceptable for interactive UI.

Solution: Redis caching with smart invalidation.

First request: Compute insights (35s) → Cache in Redis with 1-hour TTL → Return to user
Subsequent requests: Return cached insights (instant)
Cache invalidation: When papers are added/removed, invalidate cache → Next request recomputes

Users get instant insights after the first generation. The 1-hour TTL ensures freshness without constant recomputation. If you're actively adding papers, you'll see updated insights hourly.

The Frontend: React + TypeScript with Real-Time Updates

Building for local-first meant building for your infrastructure. No CDN. No edge caching. No serverless functions. Just Docker containers running on your machine.

The frontend architecture reflects this constraint:

Multi-stage Docker build:

Stage 1 (Node.js): npm install → npm run build → generate static assets
Stage 2 (Nginx): Copy static assets → Serve on port 3000

Why multi-stage? The final image ships only the compiled artifacts, not the entire Node.js toolchain. Smaller image, faster startup, less attack surface.

Real-time progress tracking: Server-Sent Events (SSE) stream processing updates from backend to frontend without polling. When Document Processor indexes a paper, it emits progress events that flow through the API to your browser in <100ms.

PDF streaming: Clicking "View PDF" doesn't download the entire file. MinIO streams chunks through a FastAPI proxy. Browser renders as data arrives. Works even with 50MB papers.

What I Learned Building This

1. Agent Identity Matters More Than I Expected

Early versions had generic agents: "processor", "analyzer", "responder". Naming them around identity—Paper Discovery, Research Advisor, Q&A Specialist—changed how I designed their behavior. An agent that "is" a Research Advisor proactively generates insights. An agent that "is" a Document Processor obsesses over text extraction edge cases.

Identity-driven design (core to Praval framework) made each agent better at its specialized role.

2. Local-First Forces Architectural Discipline

Cloud services let you be sloppy. Out of memory? Scale up. Slow query? Add a cache layer. Lock contention? Throw read replicas at it.

Local-first removes escape hatches. You have 8GB RAM and 4 cores. That's it. This forced me to:
- Optimize chunking strategy (1000/200 was empirically tuned, not guessed)
- Use Redis caching strategically (not everywhere)
- Pre-aggregate insights rather than computing on-demand
- Choose Qdrant over PostgreSQL pgvector for better performance characteristics

Constraint breeds clarity.

3. Embeddings Quality Dominates Everything Else

I spent weeks optimizing retrieval—hybrid search, reranking, query expansion. Impact: marginal.

Then I improved chunking strategy (semantic boundaries instead of fixed-length splits). Impact: transformative. Retrieval accuracy jumped 40%.

The lesson: embeddings are lossy compressions. If you compress garbage, retrieval finds better garbage. Fix the source.

4. Users Want Proactive Intelligence

I built proactive insights (trending topics, research gaps) as an afterthought. Expected usage: 5%. Actual usage: 40% of sessions engage with insights, higher than Q&A.

People don't know what questions to ask. Showing them "here are the themes in your corpus" sparks exploration. Reactive Q&A is necessary. Proactive insights are transformative.

5. Multi-Agent Coordination Is Still Fragile

Event-driven architecture is elegant until it isn't. When agents communicate via async spores, debugging failures becomes archaeological:

"Why didn't Research Advisor update insights?"
→ Check if Semantic Analyzer emitted themes_updated spore
→ Check if RabbitMQ delivered the spore
→ Check if Research Advisor subscribed to correct topic
→ Check if cache invalidation logic triggered

Three-level distributed debugging for what would be a single function call in monolithic code.

Worth it for resilience and scalability. But not free.

What's Next

Ollama Integration: Replace OpenAI API with local LLMs for fully offline operation. Architectural challenge: Ollama embeddings differ from OpenAI's (different dimensionality, different semantic spaces). Migration path needs careful design.

Multi-User Support: Current architecture assumes single user. PostgreSQL schema needs user_id foreign keys, Qdrant collections need per-user isolation, MinIO buckets need access control. Not technically hard, but changes security model significantly.

Citation Graph Visualization: Papers cite each other. This creates a graph. Visualizing this would expose intellectual lineages. Architectural challenge: extracting citations from PDFs reliably (many formats, inconsistent parsing).

Collaborative Annotations: Let users mark interesting passages, add notes. Store annotations in PostgreSQL tied to specific paper chunks. Surfaced during Q&A retrieval to provide personalized context.

Try It Yourself

Praval Deep Research is open source. If you're curious about local-first AI research tools or multi-agent architectures built with Praval:

GitHub: github.com/aiexplorations/praval_deep_research
Deployment: docker-compose up -d (requires OpenAI API key, 8GB RAM, 10GB disk)
Architecture Docs: DESIGN.md

The codebase demonstrates production-grade multi-agent systems: health checks, structured logging, async job processing, event-driven coordination, hybrid storage strategies.

If you build something with it or have architecture questions, I'd love to hear from you. This is very much a living system—evolving as I learn what works and what doesn't in practice.