kaedax
← work

An AI agent module dropped into an existing B2B SaaS, behind a feature flag, with an eval harness that gated every PR.

A workflow SaaS with 1,200 enterprise customers wanted to ship an AI assistant inside their product without disrupting the existing surface or risking regression. We built the agent module, the eval corpus, and the rollout plan — and merged it behind a flag on day 19.

stack → Next.js (existing) Claude 4.6 pgvector LangGraph Inngest Datadog LaunchDarkly
EXISTING SURFACE · workflow.app · v8.4 Projects ○○○○○○ Tasks ○○○○○○ Calendar ○○○○○○ Reports ○○○○○○ Inbox ○○○○○○ Members ○○○○○○ ext.intent ext.context ext.action AGENT MODULE · isolated PLANNER claude-4.6 TOOL LAYER 8 tools MEMORY pgvector · 14k EVAL HARNESS 312 cases · CI FEATURE FLAG 12 design partners ▷ CASE · AURORA · BLUEPRINT AI · EMBEDDED FEATURE
SCHEMATIC An abstract view of the AURORA engagement — not a literal product screenshot. Built to communicate engineering shape, not surface design.

outcomes

01

0

Regression incidents in existing surface

02

312

Eval cases gating every PR

03

T+672h

Design-partner rollout

04

94%

Eval threshold cleared at merge

[ §01 ] the cycle

How 720 hours
actually ran.

  1. Day 01 — 04

    Audit + spec

    scope.agent read the existing codebase end-to-end and produced an audit memo: where the agent could live without entangling the existing surface, which APIs to extend, which to leave untouched. The spec was a 12-page document with eight signed ADRs.

    audit.md 8 ADRs module boundary diagram
  2. Day 05 — 14

    Module build

    build.agent shipped the agent module: planner, tool layer, memory store, response formatter. Forty-one PRs, every one reviewed by their head of engineering. Zero changes to the existing surface code outside three extension points.

    41 PRs module isolated ext points × 3
  3. Day 15 — 22

    Eval corpus + threshold gating

    qa.agent built a 312-example eval corpus drawn from anonymized customer workflows. CI gate: 4 metrics, each with thresholds the agent had to hit before any PR could merge. Two PRs were blocked and rewritten.

    312 eval cases 4 scoring axes CI gate live
  4. Day 23 — 30

    Flagged rollout + monitoring

    deploy.agent provisioned the feature flag tree (cohort-based, per-tenant). Soft rollout to 12 design-partner accounts on day 27, with monitor.agent on a custom alert harness watching for response quality drift.

    flag tree 12 design partners live drift alerts armed

[ §02 ] agent log · selected

What the loop
looked like.

cycle-log · aurora
archived
T+120h [ OK ] scope.agent audit complete · 3 safe extension points identified · 0 risky entanglements
T+240h [ >> ] build.agent agent module isolated in /agent/* · existing routes untouched
T+360h [WARN] qa.agent eval threshold missed on cohort B (legal vertical) · paged human
T+480h [ OK ] qa.agent added domain-specific tool layer for legal cohort · threshold cleared
T+600h [ >> ] deploy.agent feature flag tree provisioned · 12 design partners cohorted in
T+720h [ OK ] monitor.agent response quality drift monitor armed · 7d baseline established

[ §03 ] notes from the cycle

AURORA is the engagement shape we get asked about most often: an established B2B SaaS with paying customers, real revenue, and a leadership team that wants to ship an AI feature without becoming an AI company by accident.

The dependency we agreed to up front

Existing code stays untouched. The agent module lives in its own namespace, behind a feature flag, with three explicit extension points where it talks to the host application. This isn’t an architectural preference — it’s a risk posture. The existing surface has 1,200 customers depending on it. The agent module starts at zero.

What the eval harness actually does

Every PR runs the agent module against 312 anonymized customer workflows, scored on four axes: task completion, format compliance, tool-call correctness, and tone fidelity. Below threshold on any axis, the PR is blocked. The corpus and the thresholds are now AURORA’s permanent property — they will gate every future AI feature their team ships.

This is the part of an AI build that consultancies skip and product teams discover late. We make it the artefact you walk away with.

What we explicitly didn’t build

A user-facing chat surface in the product. AURORA already had a clear workflow product; adding a generic chat would have weakened the existing UX. The agent surfaces as small inline interventions inside the existing workflow — a recommendation card, an inline auto-complete, a “fix this” button — never as a separate chat panel.

That call was made on day three, by their head of product, after a 30-minute discussion with the scope.agent’s draft. It’s the kind of judgement call that compounds into a product that doesn’t feel grafted.

from the founder

"We've been quoted nine months by larger shops for this exact piece of work. Kaedax did it in thirty days and the eval harness they built is the artefact we now reuse on every AI feature."

— Head of Engineering · AURORA