An AI agent module dropped into an existing B2B SaaS, behind a feature flag, with an eval harness that gated every PR.

A workflow SaaS with 1,200 enterprise customers wanted to ship an AI assistant inside their product without disrupting the existing surface or risking regression. We built the agent module, the eval corpus, and the rollout plan — and merged it behind a flag on day 19.

stack → Next.js (existing) Claude 4.6 pgvector LangGraph Inngest Datadog LaunchDarkly

SCHEMATIC An abstract view of the AURORA engagement — not a literal product screenshot. Built to communicate engineering shape, not surface design.

▷ outcomes

Regression incidents in existing surface

312

Eval cases gating every PR

T+672h

Design-partner rollout

94%

Eval threshold cleared at merge

[ §01 ] the cycle

How 720 hours
actually ran.

Day 01 — 04

Audit + spec

scope.agent read the existing codebase end-to-end and produced an audit memo: where the agent could live without entangling the existing surface, which APIs to extend, which to leave untouched. The spec was a 12-page document with eight signed ADRs.

↳audit.md ↳8 ADRs ↳module boundary diagram
Day 05 — 14

Module build

build.agent shipped the agent module: planner, tool layer, memory store, response formatter. Forty-one PRs, every one reviewed by their head of engineering. Zero changes to the existing surface code outside three extension points.

↳41 PRs ↳module isolated ↳ext points × 3
Day 15 — 22

Eval corpus + threshold gating

qa.agent built a 312-example eval corpus drawn from anonymized customer workflows. CI gate: 4 metrics, each with thresholds the agent had to hit before any PR could merge. Two PRs were blocked and rewritten.

↳312 eval cases ↳4 scoring axes ↳CI gate live
Day 23 — 30

Flagged rollout + monitoring

deploy.agent provisioned the feature flag tree (cohort-based, per-tenant). Soft rollout to 12 design-partner accounts on day 27, with monitor.agent on a custom alert harness watching for response quality drift.

↳flag tree ↳12 design partners live ↳drift alerts armed

[ §02 ] agent log · selected

What the loop
looked like.

verbatim · 6 lines · timestamps redacted

cycle-log · aurora

archived

T+120h [ OK ] scope.agent audit complete · 3 safe extension points identified · 0 risky entanglements

T+240h [ >> ] build.agent agent module isolated in /agent/* · existing routes untouched

T+360h [WARN] qa.agent eval threshold missed on cohort B (legal vertical) · paged human

T+480h [ OK ] qa.agent added domain-specific tool layer for legal cohort · threshold cleared

T+600h [ >> ] deploy.agent feature flag tree provisioned · 12 design partners cohorted in

T+720h [ OK ] monitor.agent response quality drift monitor armed · 7d baseline established

[ §03 ] notes from the cycle

AURORA is the engagement shape we get asked about most often: an established B2B SaaS with paying customers, real revenue, and a leadership team that wants to ship an AI feature without becoming an AI company by accident.

The dependency we agreed to up front

Existing code stays untouched. The agent module lives in its own namespace, behind a feature flag, with three explicit extension points where it talks to the host application. This isn’t an architectural preference — it’s a risk posture. The existing surface has 1,200 customers depending on it. The agent module starts at zero.

What the eval harness actually does

Every PR runs the agent module against 312 anonymized customer workflows, scored on four axes: task completion, format compliance, tool-call correctness, and tone fidelity. Below threshold on any axis, the PR is blocked. The corpus and the thresholds are now AURORA’s permanent property — they will gate every future AI feature their team ships.

This is the part of an AI build that consultancies skip and product teams discover late. We make it the artefact you walk away with.

What we explicitly didn’t build

A user-facing chat surface in the product. AURORA already had a clear workflow product; adding a generic chat would have weakened the existing UX. The agent surfaces as small inline interventions inside the existing workflow — a recommendation card, an inline auto-complete, a “fix this” button — never as a separate chat panel.

That call was made on day three, by their head of product, after a 30-minute discussion with the scope.agent’s draft. It’s the kind of judgement call that compounds into a product that doesn’t feel grafted.

from the founder

"We've been quoted nine months by larger shops for this exact piece of work. Kaedax did it in thirty days and the eval harness they built is the artefact we now reuse on every AI feature."

— Head of Engineering · AURORA

← previous case

B2C · Creator economy

PULSE

next case →

AI · Consumer agent product

TALLOW

An AI agent module dropped into an existing B2B SaaS, behind a feature flag, with an eval harness that gated every PR.

How 720 hours actually ran.

Audit + spec

Module build

Eval corpus + threshold gating

Flagged rollout + monitoring

What the loop looked like.

The dependency we agreed to up front

What the eval harness actually does

What we explicitly didn’t build

How 720 hours
actually ran.

What the loop
looked like.