An end-to-end personal-finance agent for solopreneurs — perception, reasoning, memory, tools, and an eval harness — shipped as a product, not a demo.

TALLOW set out to build a 'personal CFO' agent for solopreneurs: it ingests bank feeds, classifies transactions, surfaces decisions, and answers natural-language questions about runway. The hard part was making it reliable enough to trust with money.

stack → Next.js + tRPC Claude 4.6 Plaid Postgres + pgvector LangGraph Inngest queues Eval harness (custom)

SCHEMATIC An abstract view of the TALLOW engagement — not a literal product screenshot. Built to communicate engineering shape, not surface design.

▷ outcomes

184

Golden traces gating runtime changes

Unauthorized financial actions in beta

T+714h

Closed beta launch

Solopreneurs onboarded in week 1

[ §01 ] the cycle

How 720 hours
actually ran.

Day 01 — 05

Agent shape + boundary

scope.agent drafted the agent's contract first — not the UI. What it can know, what it cannot do without permission, what it must refuse to answer. The boundary doc was longer than the UI spec and was the most consequential artefact of the cycle.

↳boundary.md ↳tool catalogue ↳refusal policy
Day 06 — 17

Runtime build

build.agent shipped the agent runtime: planner (Claude-backed), tool layer (8 tools), memory store (pgvector + structured ledger), refusal layer. Every tool call is logged with provenance and is replayable from the audit trail.

↳runtime live ↳8 tools ↳replayable audit
Day 18 — 24

Eval corpus + golden traces

qa.agent built 184 golden conversation traces covering common asks ('what's my runway') and edge cases ('move money to savings' — refused). Each trace has expected tool calls, expected memory writes, expected refusals.

↳184 traces ↳tool-call gating ↳refusal tests
Day 25 — 30

Closed beta + observability

deploy.agent shipped to 60 closed-beta solopreneurs with monitor.agent watching tool-call patterns, refusal rates, and an offline 'shadow eval' that re-runs new sessions against the golden set as background regression detection.

↳60 beta users ↳shadow evals ↳60d on-call

[ §02 ] agent log · selected

What the loop
looked like.

verbatim · 6 lines · timestamps redacted

cycle-log · tallow

archived

T+120h [ OK ] scope.agent boundary doc signed · 14 refusal cases enumerated · founder signed off

T+240h [ >> ] build.agent tool layer shipped · 8 tools · all writes require explicit user confirm

T+360h [WARN] qa.agent regression: 'move money' tool returning result instead of confirm prompt

T+480h [ OK ] build.agent patched · all financial writes now route through confirm flow

T+600h [ OK ] qa.agent 184/184 golden traces green · refusal rate within tolerance

T+720h [ OK ] monitor.agent shadow eval baseline established · drift detection armed

[ §03 ] notes from the cycle

TALLOW is the engagement where we get to be most explicit about what agent-first development actually means. Most “AI agency” engagements ship a prompt and a chat surface. TALLOW shipped a runtime.

The boundary doc

Before any UI work, before the planner, before the tool layer — the boundary doc. What the agent is allowed to know. What it is allowed to do without explicit confirmation. What it must refuse to answer. What it must escalate. Where it must store provenance.

For a money-handling agent, the boundary doc is more important than the UI mockups. The UI is downstream of the boundary; if the boundary is wrong, the UI is shipping the wrong product.

Tool catalogue

Eight tools, narrow scope each:

accounts.list — read-only
transactions.list — read-only, scoped to date range
transactions.classify — write to memory only, not to bank
runway.compute — read-only, returns explanation
savings.move_intent — drafts an intent, requires explicit confirm to execute
recurring.detect — read-only, returns flagged subscriptions
recurring.cancel_intent — drafts intent, requires confirm
escalate.human — surfaces to a human reviewer for ambiguous asks

No “general-purpose tool” that lets the planner do anything. Each tool’s contract is narrow, audited, and individually evaluated.

The shadow eval that runs forever

After launch, every real user session is replayed (anonymized, with consent) against the golden trace corpus as a background regression check. If a new prompt or model version changes the agent’s behaviour on the golden set, monitor.agent surfaces it before users see drift.

This isn’t an evaluation we run before launch. It’s an evaluation that runs forever. Kaedax leaves the harness behind; TALLOW owns it.

from the founder

"I came in thinking the hard part was the model. Kaedax showed me the hard part was the refusal layer, the audit trail, and the eval set. That's the difference between a demo and a product."

— Founder · TALLOW

← previous case

AI · Embedded feature

AURORA

next case →

Ops · Internal automation

HARBOR

An end-to-end personal-finance agent for solopreneurs — perception, reasoning, memory, tools, and an eval harness — shipped as a product, not a demo.

How 720 hours actually ran.

Agent shape + boundary

Runtime build

Eval corpus + golden traces

Closed beta + observability

What the loop looked like.

The boundary doc

Tool catalogue

The shadow eval that runs forever

How 720 hours
actually ran.

What the loop
looked like.