EngineeringApril 28, 20269 min read

Building a rules engine that doesn't fight the user

How we ship a deterministic, debuggable, no-magic categorization system that correctly handles 200+ vendors — and what we learned building it.

JPJonah Park

The most common complaint about finance apps isn't the price. It's that the categorization is wrong, and there's no obvious way to understand why or fix it. "Why did my Netflix charge get categorized as Software?" The app doesn't tell you. It just quietly does the wrong thing.

We decided early that our parsing pipeline had to be explainable. Every decision should have a reason that a user could understand — not a confidence vector from a model that nobody can debug.

Try the parser

Before the explanation, a demonstration. Here's a simplified version of what our pipeline does when an email arrives:

Interactive parser demo — try different emails

From

netflix@mailer.netflix.com

Subject

Your Netflix membership — $15.49 charged

Snippet

Your monthly membership has been renewed...

The pipeline design

The core insight is that most financial emails are actually quite structured. Vendors have to follow email deliverability conventions, which means sender domains are consistent and subjects follow predictable patterns. We exploit that structure before reaching for anything clever.

Sender domain lookup

Match from-address against 200+ known vendor domains. Highest confidence signal.

Subject line regex

Pattern match: invoice #, receipt for, your subscription, amount due. Handles 85% of cases.

Body extraction

Amount: regex with currency context. Date: ISO / named month / relative. Merchant: first proper noun near amount.

Gemini classification

For ambiguous emails: LLM with structured JSON output and confidence scoring. Fallback only.

Conflict resolution

Duplicate detection by (user_id, message_id). Confidence threshold gating at 0.65.

The LLM (Gemini) is a last resort, not the foundation. For 96% of emails in our dataset, deterministic rules handle the classification at higher confidence than any model we tested. LLMs hallucinate vendor names, misread amounts in edge cases, and are expensive to run at scale. Rules are fast, cheap, and debuggable.

Confidence distribution in production

After running the pipeline across a large corpus of real forwarded emails, here's how the confidence scores distribute:

Confidence distribution across parsed emails

0.95–1.0Sender domain match — 62%

0.85–0.95Subject regex match — 21%

0.70–0.85Body extraction — 11%

0.65–0.70LLM classification — 4%

Below 0.65Needs review flag — 2%

62% of emails are classified at the highest confidence tier — pure sender domain match. These are the easy ones: Netflix, Spotify, Notion, GitHub. Their From addresses are deterministic and we've mapped them all.

The 2% that fall below our 0.65 threshold get flagged for review rather than silently classified. We'd rather show a user "we're not sure about this one" than confidently get it wrong.

What "doesn't fight the user" actually means

A rules engine fights the user when it's a black box. You see the output, you don't understand the input, and you have no way to influence the result. That's almost every categorization system in consumer finance today.

Our system is designed so that every classification can be traced: "This was categorized as a subscription because the sender domain matched netflix.com in our vendor database, and the subject contained the string 'membership.'" Users can see that. They can override it. And when they do, we learn.

The rules are also deterministic: given the same email, the system produces the same output every time. There's no probabilistic drift, no model retraining causing silent reclassification. If you ask us to put Netflix in Entertainment instead of Streaming, it stays there.

That predictability is what calm software requires. Magic is the enemy of trust.

See it in action

Parsed receipts, not guessed ones

Forward a receipt and watch Spendbox extract merchant, amount, and renewal date with 96%+ accuracy. No magic, just good rules.

Join the waitlist