Building a rules engine that doesn't fight the user
How we ship a deterministic, debuggable, no-magic categorization system that correctly handles 200+ vendors — and what we learned building it.
The most common complaint about finance apps isn't the price. It's that the categorization is wrong, and there's no obvious way to understand why or fix it. "Why did my Netflix charge get categorized as Software?" The app doesn't tell you. It just quietly does the wrong thing.
We decided early that our parsing pipeline had to be explainable. Every decision should have a reason that a user could understand — not a confidence vector from a model that nobody can debug.
Try the parser
Before the explanation, a demonstration. Here's a simplified version of what our pipeline does when an email arrives:
The pipeline design
The core insight is that most financial emails are actually quite structured. Vendors have to follow email deliverability conventions, which means sender domains are consistent and subjects follow predictable patterns. We exploit that structure before reaching for anything clever.
The LLM (Gemini) is a last resort, not the foundation. For 96% of emails in our dataset, deterministic rules handle the classification at higher confidence than any model we tested. LLMs hallucinate vendor names, misread amounts in edge cases, and are expensive to run at scale. Rules are fast, cheap, and debuggable.
Confidence distribution in production
After running the pipeline across a large corpus of real forwarded emails, here's how the confidence scores distribute:
62% of emails are classified at the highest confidence tier — pure sender domain match. These are the easy ones: Netflix, Spotify, Notion, GitHub. Their From addresses are deterministic and we've mapped them all.
The 2% that fall below our 0.65 threshold get flagged for review rather than silently classified. We'd rather show a user "we're not sure about this one" than confidently get it wrong.
What "doesn't fight the user" actually means
A rules engine fights the user when it's a black box. You see the output, you don't understand the input, and you have no way to influence the result. That's almost every categorization system in consumer finance today.
Our system is designed so that every classification can be traced: "This was categorized as a subscription because the sender domain matched netflix.com in our vendor database, and the subject contained the string 'membership.'" Users can see that. They can override it. And when they do, we learn.
The rules are also deterministic: given the same email, the system produces the same output every time. There's no probabilistic drift, no model retraining causing silent reclassification. If you ask us to put Netflix in Entertainment instead of Streaming, it stays there.
That predictability is what calm software requires. Magic is the enemy of trust.
See it in action
Parsed receipts, not guessed ones
Forward a receipt and watch Spendbox extract merchant, amount, and renewal date with 96%+ accuracy. No magic, just good rules.
Join the waitlist