Praval Analytics – Reimagining Business Intelligence Through AI Agents - AI Explorations

The factory floor has a dashboard problem. Not a "dashboards don't exist" problem—they're everywhere, meticulously crafted by BI teams, updated nightly via ETL jobs. The problem is that nobody uses them.

Quality engineers walking production lines don't carry laptops. They have phones and questions: "Why did Line A's defect rate spike?" Plant managers in weekly reviews don't navigate five-level dashboard hierarchies. They want to ask "Compare this week's OEE to last month" and get an answer, not a navigation exercise.

Traditional BI assumes users sit at desks, know which dashboard to consult, and can translate business questions into filter combinations. Reality is messier. When I built Praval Analytics, the architectural question wasn't "how do we build better dashboards?" It was "can we eliminate dashboards entirely?"

The answer: yes, but only if you rethink the entire architecture around conversation as the primary interface.

The Brittleness of Traditional BI

Before diving into the architecture, it's worth understanding why traditional BI fails in manufacturing environments.

Schema Fragility: Manufacturing databases evolve constantly. Suppliers add new material grades. Equipment gets upgraded. New defect types emerge. In traditional BI, each schema change cascades through ETL pipelines, breaks dashboard queries, and triggers emergency fixes by data teams.

Static Workflow Assumptions: BI tools assume questions follow predictable paths. "Start at Line Performance dashboard → drill into defect analysis → filter by shift → export to Excel." But real questions are messy: "Show me springback defects for door panels from Supplier B during night shifts after die changeovers." That's not a dashboard path. That's a sentence.

Maintenance Bottlenecks: Every business change requires specialized teams to update pipelines, rebuild transformations, and modify dashboards. Want to add a new KPI? Submit a ticket, wait two weeks, get a dashboard update. By then, the question has changed.

The fundamental mismatch: BI delivers static visualizations to users who think in conversations.

The DiRC Framework: Discover, Reason, Coordinate

I structured Praval Analytics around DiRC—Discover-Reason-Coordinate—to replace ETL with agent-driven intelligence.

Discover: Autonomous Schema Understanding

Instead of manually mapping source databases, AI agents autonomously explore them.

Traditional approach: Data engineers study source systems, document schemas, write transformation SQL, build data models. Weeks of manual work. When schemas change, rinse and repeat.

DiRC approach: Discovery agents scan PostgreSQL Foreign Data Wrappers, identify tables and columns, recognize semantic patterns (def_cnt → defect count, prs_ln_a → Press Line A), build a conceptual model automatically.

When a new production system comes online, discovery agents detect it. When columns rename, agents update their semantic mappings. No human intervention.

Why this matters architecturally: Schema changes no longer break the system. Agents adapt. The semantic layer (Cube.js) provides a stable API that agents query, while discovery agents keep the underlying mappings current.

Reason: Context-Aware Intelligence

Generic chatbots hallucinate when asked "What's our OEE?" Praval Analytics agents understand OEE—that it's Availability × Performance × Quality, that comparing Monday shifts requires accounting for weekend maintenance, that springback defects correlate with material grades.

This requires domain-specialized agents:

Manufacturing Advisor Agent: Knows production processes, equipment relationships, quality standards. Translates "springback issues" into "search defect_type='springback' + correlate with material_grade + check die_condition".

Analytics Specialist Agent: Understands metrics definitions (OEE, first-pass yield, cycle time), knows which Cube.js cubes serve which questions, constructs optimized queries with appropriate measures and dimensions.

Quality Inspector Agent: Performs statistical process control analysis, identifies anomalies, suggests root causes based on pattern correlation (defect spike + material change + die wear = likely cause chain).

These aren't generic LLMs. They're specialized agents with manufacturing domain knowledge encoded in their system prompts and memory.

Coordinate: Event-Driven Collaboration

The architectural innovation: no central orchestrator.

When you ask "Why did defect rates spike on Line A?", here's what happens:

Agent Coordination Flow
Five agents collaborate through event-driven Spore messages without central orchestration

Manufacturing Advisor receives question → enriches with domain context ("Line A" = 800T press, produces door panels) → broadcasts domain_enriched_request Spore
Analytics Specialist receives Spore → queries Cube.js for defect trends, shift data, material correlations → broadcasts data_ready Spore
Simultaneously (parallel execution):
Visualization Specialist receives data_ready → prepares time-series defect chart + shift comparison
Quality Inspector receives data_ready → performs anomaly detection → identifies springback defect correlation with Material Grade HC340LA from Supplier B
Report Writer receives outputs from all agents → synthesizes into narrative: "Line A defect rates increased 23% due to springback issues on door outer left panels, correlated with HC340LA coils from Supplier B. Recommend reviewing coil certification and inspecting Die 002 for wear."

Total time: 3 seconds. Agents work in parallel, not sequentially.

Critical architectural point: This only works because agents communicate via Praval's Reef substrate—an event bus where Spores (structured messages) flow without central routing. If one agent fails, others continue functioning. No single point of failure.

The Data Architecture: Bridging Legacy and Semantic Layers

Manufacturing companies don't greenfield their data infrastructure. You inherit decades of accumulated systems: ERP databases, SCADA historians, MES platforms, quality management systems. Each with different schemas, different naming conventions, different update frequencies.

Data Architecture
Foreign Data Wrappers connect source systems to unified warehouse, dbt transforms to analytics-ready marts, Cube.js provides semantic API

Layer 1: Source Databases (The Reality)

In Praval Analytics demo, I simulated this with two PostgreSQL databases:
- Press Line A DB: Door outer panel production (800T press)
- Press Line B DB: Bonnet outer panel production (1200T press)
- Die Management DB: Die changeover events, condition assessments
- Material Tracking DB: 126 coils from 3 suppliers

Real manufacturing environments have dozens of source systems. The architecture must handle heterogeneity.

Layer 2: Data Warehouse with Foreign Data Wrappers

Architectural decision: Use PostgreSQL Foreign Data Wrappers instead of traditional ETL.

Why? Foreign Data Wrappers let you query remote databases as if they were local tables. No data duplication. No complex ETL orchestration. No staleness from batch updates.

CREATE FOREIGN TABLE press_line_a_production
SERVER press_line_a_fdw
OPTIONS (schema_name 'public', table_name 'production');

Now agents can query press_line_a_production directly. When source data changes, queries see fresh data instantly. No ETL lag.

Tradeoff: Query performance depends on source database responsiveness. For real-time dashboards, this could be problematic. But for conversational analytics where 3-second response time is acceptable, it works.

Layer 3: dbt Transformation Layer

Raw source data isn't analysis-ready. It needs cleaning, joining, aggregating. dbt (data build tool) handles this as version-controlled SQL transformations:

4 staging models: Clean source data (handle nulls, standardize formats)
2 intermediate models: Business logic (calculate OEE, join dimensions)
3 mart models: Analytics-ready fact tables optimized for specific query patterns

Why dbt instead of stored procedures? Version control. Testability. Documentation as code. When transformations change, you see exactly what changed in git diff. Tests run automatically. Documentation generates from model definitions.

Agents don't query raw source data. They query marts. This insulates them from schema changes in source systems.

Layer 4: Cube.js Semantic Layer

This is the architectural keystone. Cube.js sits between agents and data, providing a consistent API regardless of underlying schema changes.

Three cubes:

PressOperations: Production-level grain (one row per part produced)
- Measures: OEE, defect counts, costs, cycle time, tonnage
- Dimensions: Part family, press line, die, material, shift, operator, defect type

PartFamilyPerformance: Aggregated by part type
- Measures: First-pass yield, rework rate, total costs
- Dimensions: Part family, material grade

PressLineUtilization: Aggregated by line
- Measures: Overall OEE, shift productivity
- Dimensions: Press line, shift

Agents query Cube.js using semantic names ("weekly defect rate trends"). Cube.js translates to optimized SQL, handles joins, manages pre-aggregations for performance.

Why agents need this: Without a semantic layer, agents would have to know table structures, join keys, aggregation logic. Every schema change would require updating agent prompts. Cube.js decouples agents from database details.

The Five-Agent Architecture

Each agent has a specialized role. This isn't arbitrary—specialization emerged from production failures in early versions.

Manufacturing Advisor: The Domain Expert

Early mistake: I had a generic "input processor" agent that blindly forwarded user questions to the analytics agent. Results were terrible. "Show me springback issues" got treated like a generic text search.

Solution: Manufacturing Advisor agent with domain knowledge.

System prompt includes:
- Equipment hierarchies (Line A = 800T press, produces doors)
- Defect types (springback, wrinkle, splits, scratches)
- Material grades (HC340LA, DC06, SPCC)
- Manufacturing relationships (defect patterns correlate with material + die condition)

When user asks "springback issues on Line A", Manufacturing Advisor:
1. Recognizes "springback" as a sheet metal defect caused by elastic recovery
2. Knows Line A produces door outer panels (more susceptible to springback than bonnets)
3. Enriches query: "Search defect_type='springback' for part_family='Door_Outer*' from press_line='Line_A', correlate with material_grade and die_condition"
4. Broadcasts enriched request as Spore

This agent doesn't retrieve data. It translates human language to manufacturing concepts.

Analytics Specialist: The Query Translator

Receives domain-enriched requests from Manufacturing Advisor. Has deep knowledge of:
- Cube.js schema (which cubes, which measures, which dimensions)
- Query optimization (pre-aggregations, time-series patterns)
- When to use PressOperations (detail-level queries) vs PartFamilyPerformance (aggregated)

Constructs Cube.js queries:

{
  measures: ['PressOperations.defectRate', 'PressOperations.defectCount'],
  dimensions: ['PressOperations.defectType', 'PressOperations.materialGrade'],
  filters: [
    { member: 'PressOperations.defectType', operator: 'equals', values: ['springback'] },
    { member: 'PressOperations.partFamily', operator: 'startsWith', values: ['Door_Outer'] }
  ],
  timeDimensions: [
    { dimension: 'PressOperations.productionDate', granularity: 'day', dateRange: 'last 7 days' }
  ]
}

Executes query, receives results, broadcasts data_ready Spore.

Visualization Specialist: The Chart Selector

Receives: Structured data from Analytics Specialist
Decides: What chart type communicates insights best

Defect trends over time → line chart
Press line comparison → side-by-side bar chart
Shift productivity → stacked area chart showing contribution
Material grade correlation → scatter plot with regression

Architectural constraint: Mobile-first design. Charts must work on factory floor phones, not just desktop monitors. This means simplified layouts, large touch targets, minimal clutter.

Outputs chart specification (Recharts format for React frontend).

Quality Inspector: The Anomaly Detective

Runs in parallel with Visualization Specialist (this is why event-driven matters—parallel execution).

Receives: Same data as Visualization Specialist
Performs:
- Statistical process control analysis (detect outliers beyond 3σ)
- Pattern correlation (do defect spikes correlate with material changes? die wear? shift changes?)
- Root cause hypotheses based on domain rules

For example, detecting:
- Springback defects + Material Grade HC340LA from Supplier B → supplier material quality issue
- Defect spike after die changeover → setup/alignment problem
- Gradual quality degradation → die wear

Output: List of anomalies with suggested root causes, ranked by confidence.

Report Writer: The Synthesizer

Receives: Outputs from all agents (chart spec, data, anomalies, root causes)
Produces: Narrative explanation in plain language

Knows:
- Engineers want root causes, not just symptoms
- Mobile consumption requires concise prose (not verbose paragraphs)
- Recommendations must be specific and actionable
- Follow-up questions drive continuous exploration

Example output:

"Line A defect rates increased 23% yesterday due to springback issues affecting door outer left panels. Analysis shows strong correlation with Material Grade HC340LA from Supplier B. Quality Inspector detected this pattern across 8 production runs. Recommend: (1) Review coil certification for affected batches, (2) Inspect Die 002 for wear, (3) Consider die maintenance if issue persists."

Follow-up questions:
- "Show me all Supplier B material defect history"
- "Compare Die 002 condition to other dies"
- "What's the cost impact of this defect spike?"

Architectural benefit: Report Writer is the only agent that talks to users. All others communicate via Spores. This separation makes it easy to swap Report Writer implementations (e.g., different verbosity levels, different languages) without touching other agents.

Why No Orchestrator? The Case for Event-Driven

Traditional architectures would use an orchestrator: a central service that calls agents sequentially, waits for responses, coordinates flow.

I explicitly avoided this. Here's why:

Single point of failure: If orchestrator crashes, entire system stops. With event-driven, individual agents can fail without cascading.

Sequential latency: Orchestrator calls Analytics Specialist → waits for data → calls Visualization Specialist → waits for chart → calls Quality Inspector → waits for analysis. Total: 8+ seconds. Event-driven: Analytics broadcasts data, Viz + Quality run in parallel. Total: 3 seconds.

Tight coupling: Orchestrator needs to know all agents, their APIs, their expected inputs/outputs. Adding a new agent requires updating orchestrator logic. Event-driven: new agent subscribes to relevant Spores. No central changes needed.

No graceful degradation: If Quality Inspector fails in orchestrated flow, whole response fails. In event-driven, Report Writer synthesizes from available outputs. Missing one agent's input reduces answer quality but doesn't crash the system.

The tradeoff: Debugging distributed event flows is harder than tracing orchestrated calls. When something goes wrong, you're hunting through RabbitMQ message logs across multiple agents. Worth it for the resilience and performance benefits, but not free.

The Real-Time Question Pipeline

Let's trace a complete user interaction architecturally:

User (phone, factory floor): "Why did Line A OEE drop yesterday?"

Frontend → Backend API (FastAPI):

POST /analytics/query
{
  "question": "Why did Line A OEE drop yesterday?",
  "conversation_id": "<uuid>"
}

Manufacturing Advisor Agent:
- Receives question via Reef subscription
- Domain enrichment: "Line A" = 800T press, OEE = Availability × Performance × Quality
- Broadcasts Spore:

{
  "type": "domain_enriched_request",
  "original_question": "Why did Line A OEE drop yesterday?",
  "context": {
    "press_line": "Line_A",
    "equipment": "800T press",
    "part_family": "Door outer panels",
    "metrics": ["OEE", "availability", "performance", "quality_rate"],
    "time_range": "yesterday"
  }
}

Analytics Specialist Agent:
- Receives domain_enriched_request Spore
- Queries Cube.js PressLineUtilization cube:
- Measures: OEE, availability, performance, quality_rate
- Filters: press_line='Line_A', date=yesterday
- Comparison: vs. 7-day average
- Receives data: OEE 78.3% vs avg 82.1% (-3.8 pts)
- Breakdown: availability 92% (↓2%), performance 94% (↓1%), quality 90% (↓0.8%)
- Broadcasts Spore:

{
  "type": "data_ready",
  "data": { ... },
  "insights": "OEE declined primarily due to availability drop (2% reduction)"
}

Simultaneously:

Visualization Specialist:
- Receives data_ready
- Selects chart type: time-series line chart (OEE trend) + bar chart (component breakdown)
- Generates chart spec for frontend

Quality Inspector:
- Receives data_ready
- Correlates availability drop with production events
- Finds: 3 die changeovers yesterday vs typical 1.5 per week
- Each changeover: ~30 min downtime
- Root cause hypothesis: Increased changeover frequency reduced available production time

Report Writer:
- Waits for all agents to respond (3-second timeout, uses whatever's available)
- Receives: data, chart spec, root cause analysis
- Synthesizes narrative:

"Line A OEE dropped to 78.3% yesterday, down 3.8 points from the 82.1% weekly average. The decline was primarily driven by reduced availability (92% vs typical 94%), caused by 3 die changeovers compared to the usual 1.5 per week. Each changeover resulted in approximately 30 minutes of downtime."

"Performance and quality rates remained relatively stable, indicating the issue is operational (changeover frequency) rather than equipment degradation or quality problems."

Recommendations:
- Review changeover schedule: Were 3 changeovers necessary or could they be consolidated?
- If changeovers were planned, consider optimizing changeover procedure to reduce downtime
- Monitor availability over next week to see if pattern continues

Follow-up questions:
- "Show me changeover frequency trends over the past month"
- "Compare Line A and Line B changeover downtime"
- "What's the typical changeover time for each die?"

Backend → Frontend: Returns JSON with narrative + chart specs
Frontend: Renders response with charts + follow-up question buttons

Total latency: 3 seconds from question to answer.

The Technology Stack: Why These Choices?

Every technology decision was architectural, not arbitrary.

PostgreSQL for source databases: Manufacturing data is inherently relational. Equipment hierarchies, bill of materials, quality inspection workflows—these are relational concepts. NoSQL would fight the problem domain.

Foreign Data Wrappers over ETL: Real-time data access without duplication. ETL introduces lag and staleness. Manufacturing decisions happen on factory floors, not in nightly batch cycles.

dbt for transformations: Version-controlled SQL beats stored procedures for maintainability. Tests run automatically. Documentation generates from code. Changes are traceable.

Cube.js for semantic layer: Decouples agents from database schemas. Pre-aggregations provide sub-100ms query performance even on complex joins. REST API integrates easily with FastAPI backend.

Praval framework for agents: Built for identity-driven, memory-enabled, event-driven multi-agent systems. Reef substrate handles Spore routing. Memory APIs let agents learn from interactions. OpenTelemetry integration provides observability.

RabbitMQ for message bus: Reliable, durable message delivery. Agents can fail and restart without losing in-flight Spores. Scales horizontally (add more consumer agents for parallel processing).

FastAPI for backend: Async by default, perfect for event-driven architecture. Type hints provide auto-generated API docs. SSE support for real-time streaming to frontend.

Next.js for frontend: React with server-side rendering. TypeScript for type safety. API routes for simple backend-for-frontend pattern.

What I Learned Building This

1. Domain Knowledge Is the Unlock

Generic LLMs can't reason about manufacturing without context. Early versions produced plausible-sounding nonsense: "OEE declined due to reduced efficiency" (circular), "Consider optimizing production" (meaningless).

Manufacturing Advisor agent with domain knowledge transformed results. It knows springback is a sheet metal defect, that it correlates with material grade and tonnage, that it affects certain part geometries more than others.

Lesson: Multi-agent systems need at least one agent that deeply understands the problem domain. Generalist approaches fail.

2. Parallel Agent Execution Matters Enormously

Sequential processing (Advisor → Analytics → Viz → Quality → Report): 8+ seconds
Parallel processing (Analytics broadcasts, Viz + Quality run simultaneously): 3 seconds

For conversational interfaces, 3 seconds feels instant. 8 seconds feels broken. Users abandon queries.

Event-driven architecture enables parallelism naturally. Orchestrators impose sequencing.

3. Semantic Layers Prevent Agent Drift

Without Cube.js enforcing metric definitions, different agents calculated OEE differently:
- Manufacturing Advisor: "OEE from press_operations table"
- Analytics Specialist: "OEE = (output / target) * quality_rate"
- Quality Inspector: "OEE = availability * performance * quality"

Three different calculations → contradictory insights → user confusion.

Cube.js enforces one canonical definition. All agents query the same semantic API. Consistency guaranteed.

4. Users Want Root Causes, Not Just Data

Early versions returned: "Defect rate increased 23%." Factually correct. Utterly useless.

After adding Quality Inspector's root cause analysis: "Defect rate increased 23% due to springback issues correlated with Material Grade HC340LA from Supplier B."

Engagement tripled. Users care about why, not just what.

5. Event-Driven Coordination Is More Complex Than Orchestration

Debugging: "Why didn't Report Writer include quality analysis?"
- Check if Quality Inspector emitted analysis Spore
- Check if RabbitMQ delivered it
- Check if Report Writer subscribed to correct topic
- Check timeout values (did Report Writer time out before Quality Inspector responded?)

With orchestration, you'd just step through function calls.

Worth the complexity for resilience and parallelism. But not free.

What's Next

Proactive Anomaly Alerts: Quality Inspector currently runs reactively (user asks question → analysis runs). Next: continuous monitoring mode. Quality Inspector subscribes to production_data_updated events, scans for anomalies, broadcasts alerts proactively. Plant managers get notifications before they ask.

Multi-Turn Reasoning: Current architecture handles single-turn queries well. Complex questions ("Compare Line A to Line B, identify which part family drives the difference, then show material grade impact") require multi-turn agent coordination. Planning: first agent breaks question into subqueries, subsequent agents handle each sequentially.

Agent Explanation Mode: Users can't see why agents chose specific chart types or analysis approaches. Planning: each agent includes reasoning in Spores. Frontend exposes "Why this chart?" button that shows Visualization Specialist's decision logic.

Feedback Loop: Agents currently don't learn from user corrections. If user corrects an insight ("Actually that material grade is fine, we verified it"), agents should remember and adjust future analysis. Planning: store corrections in agent episodic memory, surface during similar future queries.

Extended Domain Support: Currently manufacturing-specific. The architecture (domain agent → semantic layer → specialized analysts → synthesis) applies to other domains: retail (sales analytics), healthcare (patient outcome analysis), finance (risk assessment). Planning: make Manufacturing Advisor pluggable, define domain-specific semantic layers.

Try It Yourself

Praval Analytics is open source and fully containerized:

GitHub: github.com/aiexplorations/praval_mds_analytics
Deployment: docker-compose up -d (starts 6 containers: agents, frontend, databases)
Architecture Docs: AGENT_ARCHITECTURE.md

The codebase demonstrates:
- Multi-agent event-driven coordination via Praval Reef
- Hybrid storage (PostgreSQL + Cube.js semantic layer)
- Domain-specialized agents with memory
- Production observability (structured logging, health checks)
- Real-world manufacturing data model (press lines, defects, materials)

If you're exploring conversational analytics, multi-agent architectures, or Praval framework for production systems, I'd love to hear what you build.

The future of business intelligence isn't better dashboards. It's eliminating dashboards entirely and replacing them with conversations that understand context, reason about causality, and deliver insights at the speed of questions.