I was reading about Alexander Grothendieck recently, and there was an anecdote that despite being one of the premier mathematicians of his era, contributing a credible and large volume of work in algebra, he often struggled with arithmetic. While I am no comparison to Grothendieck, being neither a mathematician nor very algebraically accomplished in the mathematician's sense of the term, I can claim to have had the same kind of deficiency. Growing up, I was never the sharpest knife in the drawer with respect to arithmetic, once to the chagrin of my father and some relatives who basked in the "centum" glory often associated with kids of an older generation taught the three Rs: Reading, wRiting, and aRithmetic.

Now that I have gotten that out of the way, let me come to the purpose of this post: introducing Vidai, a neural engine for mathematics. Vidai (விடை) is Tamil for "answer," reminiscent of the middle school experiences that lead to sweaty palms for kids learning arithmetic, who also have to contend with the occasional test or exam demanding these answers. Fortunately for such kids, it is a running joke also that ChatGPT, of all things, cannot do mathematics. "Count the number of Rs in Strawberry" has long failed ChatGPT, for instance and routinely does the rounds whenever a new SOTA model is released. Overall, it looks like ChatGPT is not quite there with the three Rs (pun intended).

But why build such a neural engine now, you might ask me, and the answer to that lies in the specific triggers I had during the new year's break. On one of Machine Learning Street Talk's recent episodes (this one), I was watching Petar Veličković from Google DeepMind, among other discussing graph neural networks and category theory for neural networks. Petar casually pointed out a paradox that stuck with me: language models often get basic multiplications wrong even though in the process of generating any given response, they perform millions of multiplications internally and quite routinely. A weird analogue indeed to how our brains may also be performing such mathematics routinely between the vast neuronal connections in our complex brains, all the while getting basic arithmetic sums wrong! Nature works in strange ways. The irony of all this was not lost on me. Here are systems executing vast amounts of arithmetic correctly at the matrix level while failing at the basic, symbolic level. It got me thinking about what kind of model architecture would actually work for arithmetic.

Vidai is a purpose-built encoder-decoder model for mathematical expressions that sidesteps this problem entirely. I spent part of the new year's break trying to build a transformer model that could perform arithmetic operations. This post discusses how I went about this problem.

The Structure of Mathematical Expressions

Starting from the understanding of how transformers work, which has now become commoner knowledge than it was a couple of years ago, I started poking at the problem by looking at the structure of mathematical expressions. When we evaluate a mathematical expression, we implicitly solve a graph-shaped problem. There are values at various leaves of the tree in an arithmetic graph. Mathematical operations perform unary, binary, and other operations on these values, leading to a result at the root of the tree.

Conventional transformer LLMs don't explicitly learn this structure. They're trained to be large-scale pattern learners that solve a wider array of language understanding tasks from noisy internet-scale data. So I asked the question: since mathematical expressions, especially arithmetic, are likely to be less noisy and easier to parse, can we build a sequence-to-sequence model that learns the patterns in arithmetic expressions?

Don't Train What You Already Know

The operations of division, addition, multiplication, and so on, are not in and of themselves subject to change unless we invent new number systems and new kinds of mathematics. We just need the mathematical operations of different trees we infer from the text to be computed using known rules. In other words, there is no point in trying to train addition using gradient descent. Doing something like that is not just wrong, it is beyond wrong, if there is such a thing. (Then again, SOTA LLMs often encode numbers as tokens, but let's not go there. I am tempted to say that this is not just stupid, but beyond stupid).

During the time I built Vidai, I was also teaching my four-year-old son some simple sums, the kind toddlers learn to do. Watching him struggle, I noticed something: the bulk of the challenge kids seem to have with arithmetic at that age is the parsing and the conceptual understanding of addition or subtraction or other operations. What does it mean to "add" two things? Which number goes where? Would it matter which way the operation was done? Does the mathematics map somehow to something in the real world? Subsequently, the kids learn the rules, but then struggle at a different level. Having learned the rules of mathematical operations, they struggle with understanding how to interpret a given sum so that the numbers may be "all lined up in a row", so that the simple, comfortable, well-known, and now-familiar work of addition or subtraction or other operations can be executed. So, parsing and structure understanding first, and computation second. I thought that this is perhaps the same split in functions that neural networks need.

This observation led me to separate what neural networks are good at from what they are bad at:

  • Parsing requires understanding that (3 + 5) * 2 means addition inside the parentheses with the result becoming an operand of multiplication. This involves operator precedence, parenthesis matching, implicit conventions, and the various ways humans write the same mathematical idea. This is pattern recognition, and neural networks excel at this. The noise is in how we express the expressions, and this noise is something LLMs can deal with in their current paradigm.

  • Computation, once you know the structure, is trivial. The rules of addition and multiplication are fixed, known, and implemented correctly in every programming language. No learning from data is required.

Vidai trains the model only on parsing tasks: converting expressions into prefix notation trees that make structure explicit. A separate module with zero learned parameters walks the resulting tree and executes operations using exact arithmetic.

Input: "2x² + 3xy - √(x+1)"
    ↓
VIDAI (learned parsing, 44M parameters)
    ↓
Prefix: "+ * 2 ** x 2 - * 3 * x y sqrt + x 1"
    ↓
SymPy (deterministic computation, 0 parameters)
    ↓
Structured: 2*x**2 + 3*x*y - sqrt(x + 1)

Ergo, the neural network never learns that 3 plus 5 equals 8. It learns that when humans write 3 + 5, they intend an addition operation with 3 and 5 as operands. Importantly, it learns to represent this as a tree of operations.

Why Parse Math When Calculators Exist?

When you type 2x + 3y into SymPy, it fails because 2x is not valid Python syntax and the multiplication operation must be explicitly communicated to it. Greek letters like θ² + φ require symbol declarations. Scanned equations from textbooks, arrive with OCR artifacts, inconsistent spacing, and notation variants that no existing system handles gracefully.

Calculators and computer algebra systems are powerful at computation but brittle at interpretation. They demand that humans translate notation into rigid syntax before engaging with it. Vidai occupies the space between human notation and symbolic computation, accepting expressions written the way people actually write them.

The Architecture

Character-Level Input Encoding

The encoder uses character-level tokenization: each ASCII character maps directly to its code (0-255). No vocabulary, no subword tokenization, nothing. The string "x^2 + 3y" becomes the sequence [120, 94, 50, 32, 43, 32, 51, 121].

def encode_input(text: str, max_len: int = 256) -> list[int]:
    """Encode input as ASCII codes."""
    ids = [ord(c) if ord(c) < 256 else ord('?') for c in text]
    return (ids + [0] * max_len)[:max_len]  # Pad or truncate

This approach has a useful property: any symbol works without vocabulary changes. Unicode characters like , θ, or ² just become their character codes. The model learns to recognize these patterns from training data rather than requiring explicit vocabulary entries.

Prefix Notation as Tree Representation

The key insight of the architecture: prefix notation (also known as Polish notation) makes tree structure explicit in a flat sequence. The expression (3 + 5) * 2 has this tree structure:

        *
       / \
      +   2
     / \
    3   5

In prefix notation, the operator comes before its operands: * + 3 5 2. Reading left to right, each operator "claims" the next N operands (2 for binary operators, 1 for unary). No parentheses needed; the structure is unambiguous.

A more complex example, 2x² + 3y:

Infix:   2*x^2 + 3*y
Tree:
            +
           / \
          *   *
         / \ / \
        2  ^ 3  y
          / \
         x   2

Prefix:  + * 2 ** x 2 * 3 y

The model's job is to learn this transformation, so, given arbitrary mathematical text, output the prefix notation that represents its structure.

The Encoder-Decoder Architecture

The system uses a standard transformer encoder-decoder, but with specific design choices:

ContextEncoder (19M parameters):
- 6 transformer layers with 8 attention heads
- 256-dimensional embeddings
- Character embeddings + positional embeddings + depth embeddings
- The depth embeddings encode tree structure hints during training

class ContextEncoder(nn.Module):
    def __init__(self, config):
        self.token_embedding = nn.Embedding(256, d_model)  # ASCII
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        self.depth_embedding = nn.Embedding(max_depth, d_model)
        self.transformer = nn.TransformerEncoder(...)

    def forward(self, input_ids, tree_depths, attention_mask):
        x = self.token_embedding(input_ids)
        x = x + self.position_embedding(positions)
        x = x + self.depth_embedding(tree_depths)
        return self.transformer(x, mask=attention_mask)

SymbolicParserDecoder (25M parameters):
- 4 transformer decoder layers
- Autoregressive generation with cross-attention to encoder output
- Output vocabulary of ~60 tokens: operators (+, -, *, /, **), functions (sin, cos, sqrt, etc), variables (x, y, theta), digits, and special tokens

class SymbolicParserDecoder(nn.Module):
    def generate(self, encoder_memory, max_len=64):
        """Autoregressive generation of prefix notation."""
        output_ids = [BOS_TOKEN]
        for _ in range(max_len):
            logits = self.forward(output_ids, encoder_memory)
            next_token = logits[:, -1, :].argmax()
            output_ids.append(next_token)
            if next_token == EOS_TOKEN:
                break
        return output_ids

The TreeComputeModule: Zero Learned Parameters

Once we have prefix notation, computation is deterministic. The TreeComputeModule has exactly zero learned parameters. It walks the prefix notation tree and executes operations using exact arithmetic:

# These are the actual operations - no neural network, no gradients
def add_op(left: Tensor, right: Tensor) -> Tensor:
    return left + right

def mul_op(left: Tensor, right: Tensor) -> Tensor:
    return left * right

def pow_op(left: Tensor, right: Tensor) -> Tensor:
    # Use exact integer multiplication for small exponents
    right_int = torch.round(right)
    is_small_int = (right == right_int) & (right_int >= 0) & (right_int <= 10)
    result = torch.ones_like(left)
    for i in range(10):
        result = torch.where(right_int > i, result * left, result)
    return torch.where(is_small_int, result, torch.pow(left, right))

ARITHMETIC_OPS = {
    'add': add_op, 'sub': sub_op, 'mul': mul_op,
    'div': div_op, 'pow': pow_op, 'mod': mod_op,
}

For symbolic expressions (containing variables), the prefix notation is passed to SymPy which handles the algebraic manipulation.

Summary

Component Parameters Role
ContextEncoder 19M Character-level encoding with tree depth signals
SymbolicParserDecoder 25M Autoregressive prefix notation generation
TreeComputeModule 0 Deterministic arithmetic via hardcoded operations

Total: 44.6M parameters which are all in the encoder-decoder, since as I have discussed earlier, computation is deterministic and there is no actual machine learning involved there.

The Precedence Bug

When I trained the above transformer model on the initial dataset, the performance plateaued to 68% sequence accuracy (despite 98% token accuracy). Token accuracy is easier to learn, and sequence accuracy harder, and the simultaneous learning of both of these in tandem during the training process showed how we'd get to 80% accuracy on the token accuracy long before we would reach a similar number for sequence accuracy. The 30-point gap pointed directly at the problem: operator precedence ambiguity in my training data.

The data generator I used for data generation of expression data, stored as text later, built random expression trees and converted them to infix notation, but 76 + 25 * 67 is ambiguous under standard mathematical conventions. The training data contained contradictory labels for identical inputs, and 68% accuracy was the best the model could achieve when trying to satisfy contradictory constraints.

After fixing the data to use explicit parentheses, accuracy reached 95% within the first epoch.

Metric Before Fix After Fix
Token Accuracy 98% 99%+
Sequence Accuracy 68% 95%+

Test Results

I built a systematic test suite covering 11 categories of mathematical patterns. The test script (scripts/eval/test_finetune.py) runs inference against hand-crafted test cases:

Category Test Cases V3 Baseline V4 After Fine-tuning
Trig functions (sin, cos, tan) 8 0% 100%
Log/Exp functions 6 0% 100%
Left associativity (a - b - c) 5 ~50% 100%
Left associativity (a / b / c) 4 ~50% 100%
Operator precedence (no parens) 7 ~67% 86%
Standalone negatives 8 ~62% 100%
Subscript variables (x1, x2) 7 ~70% 86%
Unicode sqrt (√) 6 0% 100%
Modulo (%) 4 N/A 100%
Implicit multiplication (2x) 4 80% 100%
Extended variables (r, c, d, v, g) 6 0% 91%

Overall on targeted patterns: 90.8%

The validation set accuracy (92%) masked critical failures on patterns like extended variables and Unicode. Testing on held-out patterns, not just held-out samples, revealed gaps the validation set didn't cover.

The CLI and Testing Process

Vidai ships with a CLI for parsing expressions:

# Parse expression to prefix notation
vidai parse "x^2 + 3*y"
# Output: + ** x 2 * 3 y

# Parse and evaluate with substitution
vidai parse "x^2 + y" --eval x=3 y=4
# Output: + ** x 2 y = 13

# Pure arithmetic evaluation
vidai parse "3 + 5 * 2" --eval
# Output: + 3 * 5 2 = 13

# System info (available models, GPU, etc.)
vidai info

The test suite runs inference against hand-crafted test cases organized by category:

python scripts/eval/test_finetune.py \
    --checkpoint models/finetune_v1_best.pt \
    --verbose

Each category tests specific patterns:
- Trig functions: sin(x), cos(theta), tan(z)
- Left associativity: a - b - c must produce - - a b c, not - a - b c
- Unicode: √x must produce sqrt x

The --verbose flag shows individual failures, which proved essential for diagnosing systematic issues.

The Power of Synthetic Data

Every training example is generated programmatically. No human labeling. No scraping math from the web.

Aspect Synthetic Human-Labeled
Cost ~$0 ~$20K for 2M examples
Speed 35,000 samples/second ~100 samples/hour
Quality Perfect by construction Error-prone
Distribution control Complete Limited

We build the tree first, then render it as text. The prefix notation label is correct by construction, no human judgment required.

# Generate 2M training samples (~60 seconds)
python scripts/data/generate_parser_data.py \
    --output-dir data/parser_v4 \
    --samples 2000000 \
    --mixed

Dataset Evolution Across Versions

The datasets evolved significantly across four major versions, each addressing specific failures discovered in evaluation:

Version Examples Key Change Result
V1 500K Baseline, no explicit parentheses 68% (precedence bug)
V2 1M Added explicit parentheses 78%
V3 1M Mixed notation formats 92% validation, 78% categories
V4 1M + 74K fine-tune Extended variables, Unicode, trig 90.8%

V1: The Precedence Bug

The initial dataset generated random expression trees and rendered them to infix notation without parentheses. The tree Mul(Add(76, 25), 67) became 76 + 25 * 67, but standard precedence parses that as 76 + (25 * 67). The training data contained contradictory labels for identical inputs.

V2: Explicit Parentheses

After discovering the bug, I regenerated all data with explicit parentheses: ((76 + 25) * 67). This made tree structure unambiguous. Accuracy jumped from 68% to 78% immediately.

V3/V4: Mixed Notation Formats

Real mathematical notation varies. To train robustness, V3 and V4 used a carefully designed mix that includes 30% parentheses-free expressions to teach the model operator precedence:

Format Proportion Example
Explicit parentheses, spaces 30% ( ( 3 + 5 ) * 2 )
Explicit parentheses, no spaces 15% ((3+5)*2)
No parentheses, spaces 20% 3 + 5 * 2
No parentheses, no spaces 10% 3+5*2
Negative numbers 8% -5, -3.14
High precision decimals 5% 3.14159
Unicode sqrt 5% √x, √(x+1)
Long chains (4+ terms) 5% a + b + c + d
Edge cases mixed 2% Various

The 30% parentheses-free data was critical for teaching operator precedence. Without it, the model would only learn to copy structure from parentheses rather than understanding * binds tighter than +.

V3 achieved 92% on the validation set, but when I tested on extended variables (r, c, d, v, g), accuracy was 0%. The validation set only tested interpolation, not generalization to unseen patterns.

V4: Targeted Fine-tuning

Rather than retrain from scratch, I generated 74K targeted examples covering:
- Extended variables in diverse contexts
- Unicode symbols (, ×, ÷)
- Trigonometric functions (sin, cos, tan)
- Left associativity cases (a - b - c)

Fine-tuning for 3,500 steps (about $3 on Runpod) brought accuracy on these patterns from 0% to 90%+.

Cloud Training: From Modal to Runpod

Local training on my MacBook Pro (M4 Pro with the Metal Performance Shaders, or MPS backend) runs at ~1.2 iterations/second. For 50,000 steps, that's 12+ hours. Being GPU poor in this this specific sense, I turned to Modal and then to Runpod.

I started with Modal, which offers a clean Python-native API for serverless GPU compute. You define your training function, decorate it, and Modal handles containerization and scheduling:

@app.function(gpu="A100", timeout=3600)
def train_model(config: dict):
    # Training code runs on A100
    ...

Modal worked well, but at $3.19/hour for A100, costs added up during iteration. I didn't want to get the $250 team plan for this small hobby project. I switched to Runpod for the next few training runs.

Platform GPU Throughput Cost/Hour 50K Steps Cost
Local (Macbook with M4 Pro) MPS 1.2 it/s ~12 hours
Modal A100 40GB 50 it/s $3.19 ~$0.90
Runpod A100 80GB 50 it/s $1.89 ~$0.54
Runpod RTX 4090 30 it/s $0.44 ~$0.25

Runpod's serverless API is straightforward, but I stumbled through a few bits such as the SSH which strangely did not allow file copies, and introduced its own CLIs runpod and runpodctl. Anyhow, the workflow is below:

export RUNPOD_API_KEY="your_key"

# Start training pod
python scripts/train/runpod_train.py --start --github-token $GITHUB_TOKEN

# Check status
python scripts/train/runpod_train.py --status --job-id <job_id>

# Download results
python scripts/train/runpod_train.py --download --pod-id <pod_id>

For a 44M parameter model, RTX 4090 (24GB VRAM) is sufficient and dramatically cheaper than A100. Total training cost for V4 (pre-training + fine-tuning): approximately $12. However, the RTX4090 was not available as everyone else had the same bright idea, and I was left with the A100 to train this one. A little more money but I am glad I could take a shot at doing this.

One lesson from my Runpod experience: don't use Runpod's SSH proxy (ssh.runpod.io). It blocks PTY and breaks SCP/rsync. Always use the direct IP:port from the Runpod API. This was a drag, and kept a pod running long after it was needed.

What I Learned

Each plateau during development had a clear cause unrelated to model capacity:

Problem Root Cause Fix Impact
82% ceiling Neural net memorizing arithmetic Separate parsing from computation +10pts
68% sequence accuracy Contradictory training labels Explicit parentheses in data +27pts
92% validation, 0% on r, c, d Incomplete variable coverage Extended variable list +91pts (subset)
0% on trig functions Missing training examples Targeted fine-tuning +100pts (subset)

The fix was never a better model. It was a clearer formulation of what the model should be learning.

Key lessons:

  1. Data quality > model size > training time: The jump from 68% to 95% came from fixing the data.
  2. The gap between token and sequence accuracy is diagnostic: A 30-point gap (98% token, 68% sequence) pointed directly at precedence ambiguity.
  3. Pre-training + fine-tuning is powerful even at small scale: 74K fine-tuning examples fixed patterns that 1M pre-training samples missed.
  4. Tokenizer coverage matters: Missing ln from the vocabulary means the model cannot output it, regardless of training.
  5. Test on held-out patterns, not just held-out samples: 92% on the validation set masked 0% on unseen variable names.

Several approaches have explored tree-based representations for mathematical expressions in neural networks. It's worth comparing Vidai to these methods.

MathGPT modifies GPT-2 by linearizing operator trees (OPTs) via depth-first traversal, adding tree position embeddings (binary representations of sibling indices) plus symbol type embeddings. It uses constrained decoding to ensure valid tree output. On equation extraction tasks, MathGPT achieves 52.4% tree match accuracy versus GPT-2's 47.8%.

Tree Decoders (seq2tree, Graph2Tree) generate expression trees directly for Math Word Problems (MWPs). These approaches use graph-based encoders for input text and tree-structured decoders for output, achieving state-of-the-art results on benchmarks like Math23K.

Skip-tree training masks subtrees in formal math corpora and trains LLMs to predict missing parts, yielding strong logical reasoning capabilities.

How does Vidai differ?

Aspect MathGPT / Tree Decoders Vidai
Goal Learn math end-to-end Learn parsing only
Input encoding Token-level with tree position Character-level ASCII
Output format OPT via constrained decoding Prefix notation (unconstrained)
Computation Neural (learned) Deterministic (SymPy)
Architecture Decoder-only (GPT-based) Encoder-decoder
Training data Math word problems, proofs Synthetic expression pairs

The key philosophical difference: MathGPT and tree decoders attempt to learn mathematical reasoning end-to-end. Vidai deliberately avoids this. The neural network learns only to parse human notation into trees; the mathematics is delegated to symbolic engines that are correct by construction.

This separation has trade-offs:
- Vidai's advantage: 100% computation accuracy, exact fractions, symbolic algebra via SymPy. No risk of the model "hallucinating" that 3 + 5 = 9.
- Vidai's limitation: Cannot learn new mathematical operations or solve novel problem types without extending the symbolic engine. MathGPT can potentially generalize to new patterns if trained on enough examples.

For tasks where correctness is non-negotiable (financial calculations, engineering, scientific computing), Vidai's approach is safer. For exploratory mathematical reasoning where approximate or heuristic solutions are acceptable, end-to-end approaches like MathGPT may be more flexible.

The tree representation insight is shared: both MathGPT and Vidai recognize that mathematical expressions are fundamentally trees, not sequences. MathGPT encodes this through tree position embeddings; Vidai encodes it through prefix notation output. The difference is what happens after the tree is extracted.

Known Issues

What Works Without Parentheses

Testing with the CLI shows that simple two-term precedence works correctly:

$ vidai parse "3 + 5 * 2"
+ 3 * 5 2 = 13.0     (correctly parsed as 3 + (5*2))

$ vidai parse "10 - 2 * 3"
- 10 * 2 3 = 4.0     (correctly parsed as 10 - (2*3))

$ vidai parse "a + b * c"
+ a * b c            (correctly parsed as a + (b*c))

What Fails Without Parentheses

Complex expressions with implicit multiplication or 3+ terms still struggle:

$ vidai parse "x^2 + 3*y"
* + ** x 2 3 y       Got ( + 3) * y, expected  + (3*y)

$ vidai parse "a + b + c * d"
+ a + b * c d        Got a + (b + (c*d)), expected (a + b) + (c*d)

The model handles basic operator precedence but struggles with:
1. Implicit multiplication combined with other operations
2. Left-associativity in chains of 3+ terms

Despite having 30% parentheses-free data in training, these edge cases remain problematic. The workaround is to use explicit parentheses for complex expressions.

Other Issues

ln function missing from tokenizer

Input:  ln(y)
Expected: ln y
Got:      <unk>n y

The output vocabulary has log but not ln. Fix requires vocabulary expansion and retraining.

What's Next: Higher Mathematics as Trees

The same architecture extends naturally to higher mathematics. The key observation: differentiation, integration, differential equations, and other advanced operations are themselves tree transformations. The neural network's job remains parsing; the symbolic engine handles the mathematical heavy lifting.

Calculus Operations

Differentiation and integration are operators that take an expression and a variable. Consider the derivative of x³ + 2x with respect to x:

Input:  "d/dx(x³ + 2x)"

Tree:
              diff
             /    \
            +      x
           / \
          ^   *
         / \ / \
        x  3 2  x

Prefix: diff + ** x 3 * 2 x x

The diff operator sits at the root, with the expression as its left child and the differentiation variable as its right child. The expression subtree is identical to what we saw in arithmetic.

Integration follows the same pattern:

Input:  "∫ sin(x) dx"

Tree:
          integrate
           /     \
         sin      x
          |
          x

Prefix: integrate sin x x

The neural network learns to parse the many notational variants ("d/dx", "∂/∂x", "f'(x)", "dy/dx", prime notation) into this canonical tree form.

Physics: The Kinematics Equation

Consider a classic physics problem: an airplane accelerating down a runway. The displacement equation is s = ut + ½at². Given initial velocity u, acceleration a, and time t, solve for s:

Input:  "s = u*t + (1/2)*a*t²"

Tree:
                  =
                 / \
                s   +
                   / \
                  *   *
                 / \ / \
                u  t *  ^
                    / \/ \
                   /  1  t  2
                  /   -
                 a    2

Prefix: = s + * u t * * / 1 2 a ** t 2

The entire equation, including the equality, is a tree. Once parsed, SymPy can solve for any variable, substitute values, or derive related equations. The neural network's job is to understand that "½" and "1/2" and "0.5" all mean the same thing, that "t²" means t**2, and that the equation represents a relationship between physical quantities. The network can parse unicode strings and also understand fractions, meaning that this differential equation may be solved in some future version of vidai.

Linear Algebra: Systems of Equations

A system of linear equations is also a tree structure. Consider solving:

2x + 3y = 7
x - y = 1
Input:  "2x + 3y = 7, x - y = 1"

Tree:
              system
              /    \
             =      =
            / \    / \
           +   7  -   1
          / \    / \
         *   *  x   y
        / \ / \
       2  x 3  y

Prefix: system = + * 2 x * 3 y 7 = - x y 1

SymPy's solve function takes this tree and returns {x: 2, y: 1}. The neural network handles the parsing: commas versus newlines, implicit multiplication, various equation separator conventions.

Determinants and Matrices

Matrix operations extend the same idea. A determinant computation:

Input:  "det([[a, b], [c, d]])"

Tree:
            det
             |
           matrix
            / \
          row  row
          / \  / \
         a  b c  d

Prefix: det matrix row a b row c d

The symbolic engine computes ad - bc. For numerical matrices, NumPy handles the computation. For symbolic matrices, SymPy does this.

The Key Point

All of these, from simple arithmetic to differential equations to linear algebra, are text tokens that can be represented as trees. The encoder-decoder model learns the mapping from human notation to a tree structure from training data. The trees are then executed by symbolic engines that have been proved mathematically to work and which have a steady definition of how to execute.

The neural network never learns calculus or linear algebra itself. Instead, it learns that when humans write "d/dx", they mean differentiation, and that when they write "det", they mean determinant. The mathematics is handled by code that does not require gradient descent.

The General Pattern

Domain Parse Target Execution Engine
Arithmetic Prefix notation tree Exact fraction arithmetic
Algebra Expression tree SymPy simplify/solve
Calculus Diff/integrate trees SymPy diff/integrate
ODEs Equation + conditions SymPy dsolve
PDEs Equation + boundary SymPy pdsolve
Linear Algebra Matrix expressions NumPy/SymPy

Each domain follows the same template: the neural network converts human notation into a canonical tree representation, and a symbolic engine executes the mathematics. The parsing problem scales with notational complexity; the computation problem is already solved.

This extends beyond mathematics. Code (parse to AST, execute with interpreter), proofs (parse to inference steps, verify with proof checker), chemistry (parse to molecular structure, simulate with physics engine). The pattern is general: neural network for parsing noisy human notation, symbolic engine for deterministic computation.

Conclusion

Building Vidai taught me something that seems obvious in retrospect: the hardest part of applying mathematical computation using neural networks, is not the computation itself, but the process of understanding what computation is being asked for. Once you frame the problem as parsing rather than calculating, the architecture of the neural network in question becomes clear. I foresee Vidai becoming an important bridge and integrating with frameworks like Praval in future, thereby allowing agents to parse mathematical input natively.

The 27-point accuracy gain from fixing the precedence bug, compared to negligible gains from model scaling, reinforced a lesson that applies broadly: for structured tasks, data quality dominates model size. At some level we have known this since the Textbooks are all you need paper. The model was never the bottleneck. The formulation of the problem was.

Vidai is far from complete. Complex precedence chains still fail. The ln token is missing from the vocabulary. Implicit multiplication combined with other operations confuses the parser. But the architecture is sound, and each failure points to a clear fix in the data or vocabulary rather than a fundamental limitation.

Perhaps most importantly, this project reminded me why I find machine learning compelling. Watching my son struggle with arithmetic, I saw the same parsing challenges that trip up neural networks. The structure must be understood before the computation can proceed. For humans and machines alike, the answer (விடை) comes only after the question is properly understood.

Try It Yourself

You can try Vidai directly in your browser. Enter a mathematical expression and see how the model parses it into prefix notation, then evaluates it using SymPy:

Tips for best results:
- Simple expressions work reliably: 3 + 5 * 2, sin(pi/2), sqrt(16)
- Use parentheses for complex expressions: (x^2) + (3*y) instead of x^2 + 3*y
- Variable substitution: enter x=3, y=4 in the substitutions field


Resources:
- Model: aiexplorations/vidai on HuggingFace
- Interactive Demo: aiexplorations/vidai-demo on HuggingFace Spaces
- Source Code: aiexplorations/vidai on GitHub

References

Lample, G., & Charton, F. (2020). Deep learning for symbolic mathematics. ICLR 2020. arXiv:1912.01412

Liu, T. (2023). Goat: Fine-tuned LLaMA outperforms GPT-4 on arithmetic tasks. arXiv:2305.14201

Nye, M., et al. (2021). Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114

Trask, A., Hill, F., Reed, S., Rae, J., Dyer, C., & Blunsom, P. (2018). Neural arithmetic logic units. NeurIPS 2018. arXiv:1808.00508

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022. arXiv:2201.11903

Google DeepMind. (2024). AI achieves silver-medal standard solving International Mathematical Olympiad problems.

Veličković, P. et al (2025). Category theory for neural networks. Machine Learning Street Talk.

https://arxiv.org/html/2411.16993v1