How We Rebuilt Workforce Wave Without Stopping the World

The classic rewrite mistake is the big bang: freeze the old system, build the new one, cut over, discover all the things you forgot about the old one. It fails because the old system's behavior is never fully documented, the cut-over date keeps moving, and the team is maintaining two codebases simultaneously without getting the benefits of either.

We needed to add an entirely new API surface to WFW — a /v2/ namespace designed for AI orchestrators — without breaking the existing admin UI that practices were using every day. This is the story of how we did it incrementally, what decisions we'd make the same way again, and what's still pending.

The Starting Point

The original WFW codebase had one consumer: a human admin using the wfw-admin dashboard. The architecture reflected that. Business logic lived in route handlers. Session-based auth was assumed throughout. The pattern looked like this:

// Original pattern — business logic in routes
app.post('/agents/create', requireSession, async (req, res) => {
  const { businessUrl, name } = req.body;

  // validation
  if (!businessUrl) return res.status(400).json({ error: 'businessUrl required' });

  // business logic — directly in the route
  const scoutJob = await triggerScoutProvision({ businessUrl, name, userId: req.session.userId });
  await db.insert(ww_agents).values({ ... });

  return res.json({ agentId: scoutJob.agentId });
});

This works fine when your only consumer is a human with a session. It's incompatible with AI orchestrators for three reasons:

req.session.userId assumes a browser session. Bot callers authenticate with OAuth 2.1 client credentials — no session.
The response shape ({ agentId: ... }) doesn't follow the consistent envelope that machine consumers need.
The synchronous response assumes Workforce Wave provisioning is fast. It isn't — 75–120 seconds isn't viable in a synchronous HTTP response.

The existing routes couldn't be changed without breaking the admin dashboard. The admin dashboard was in production. We needed a path that didn't require freezing either.

The Decision: New Namespace, Shared Service Layer

The core architectural decision was to add a /v2/ namespace that shares a service layer with the existing routes — rather than duplicating business logic or wrapping the existing routes.

┌─────────────────────────────────────────────────────┐
│                   Route Layer                        │
│                                                      │
│  /admin/*  (session auth)  │  /v2/*  (OAuth 2.1)   │
│  ─────────────────────────────────────────────────  │
│            Shared Service Layer                      │
│   AgentService  │  CallService  │  ScoutService     │
│ ─────────────────────────────────────────────────── │
│                   Data Layer                         │
│   ww_agents  │  ww_calls  │  ww_scout_jobs  │ ...  │
└─────────────────────────────────────────────────────┘

Business logic moves to service classes. Both route surfaces call the same service methods. The session-based route passes a session-derived context; the v2 route passes an OAuth-derived context. The services don't care which surface called them.

The bridging abstraction is ActorContext:

// ActorContext — the shared identity model
type ActorContext = {
  actorId: string;       // userId for human sessions, serviceAccountId for bot callers
  actorType: 'user' | 'service_account';
  businessId: string;    // the business this actor is operating on behalf of
  scopes: string[];      // what this actor is allowed to do
};

// Session surface creates ActorContext from session
function sessionToActorContext(session: Session): ActorContext {
  return {
    actorId: session.userId,
    actorType: 'user',
    businessId: session.businessId,
    scopes: ['agents:write', 'agents:read', 'calls:read', 'calls:write'] // all scopes for humans
  };
}

// v2 surface creates ActorContext from OAuth token
function tokenToActorContext(token: OAuthToken): ActorContext {
  return {
    actorId: token.serviceAccountId,
    actorType: 'service_account',
    businessId: token.businessId,
    scopes: token.scopes // explicitly granted at token issuance
  };
}

Service methods take an ActorContext as their first argument. Authorization checks run against context.scopes. Audit log entries record context.actorId and context.actorType. The business logic doesn't change — only the identity of who's calling it.

The Extraction Process

Extracting logic from routes to services is the most tedious part of this kind of refactor. The steps for each route:

Identify all the logic in the route handler that isn't HTTP-layer concerns (parsing the request, setting the response status, auth middleware)
Write a service method that takes the extracted parameters and returns a typed result
Replace the route handler body with a call to the service method
Write a new v2 route that calls the same service method with an OAuth-derived ActorContext

The tricky part is hidden dependencies. Route handlers that looked like they contained only business logic often had subtle dependencies on req that weren't obvious:

// This looks like business logic...
const ipAddress = req.headers['x-forwarded-for'] || req.socket.remoteAddress;
await logAuditEvent({ userId: req.session.userId, action: 'agent_create', ipAddress });

// ...but it needs the request object. Service layer can't have that.
// Solution: include audit-relevant data in ActorContext or pass as explicit param.

Each discovered dependency was either added to ActorContext (if it was identity-related) or passed as an explicit parameter to the service method (if it was call-context data like IP address).

What Stayed the Same

The database schema didn't change. Both route surfaces read and write the same tables — wwagents, wwcalls, wwscoutjobs, ww_users. The wfw-admin dashboard database is entirely separate and wasn't touched.

The existing admin routes stayed in place. The admin dashboard continues to work as before — session auth, same response shapes, same behavior. We didn't change any of that. We added a parallel surface.

The Workforce Wave provisioning pipeline stayed the same. The service layer calls the same provisioning job trigger the original routes called. What changed: the v2 route returns a 202 Accepted with an operation_id immediately, while the original admin route waited for Workforce Wave to complete before responding. Same underlying pipeline, different response patterns at the HTTP layer.

What Changed

Business logic moved out of route handlers into service classes over the course of the rebuild. Not all of it — this is still in progress (see below) — but the core operations (agent create, call initiate, transcript fetch, KB sync trigger) are now in service classes that both surfaces share.

The v2 routes add async-first response patterns, the consistent response envelope, machine-readable errors, cursor pagination, and idempotency key support. None of these exist on the original admin routes — they weren't needed for a human-operated dashboard.

Authentication is genuinely bifurcated. The admin routes use session middleware. The v2 routes use OAuth 2.1 token validation middleware. There's no code shared between them except the ActorContext translation functions.

The Deployment Story

Every stage of the extraction was deployed independently. No feature flags, no shadow mode — just incremental extraction with the existing routes staying unchanged until a service method was stable.

The sequence for a typical extraction:

Write the service method
Deploy it — not connected to any route yet
Update the existing admin route to call the service method (internal behavior change, no external API change)
Deploy and verify the admin dashboard still works
Write the v2 route
Deploy and verify v2 behavior

The existing route acts as an integration test for the service method in step 3. If the admin dashboard breaks, the service method has a bug. Fix it before writing the v2 route. This sequencing meant we never had a period where both surfaces were broken simultaneously.

What's Still Pending

Several route handlers still have logic that hasn't been extracted to services. The rule is: extract when you need to add a v2 surface for that operation, or when you need to write a test for the logic. Don't extract preemptively.

As of the current sprint, the extraction backlog is approximately 14 route handlers covering edge-case operations (bulk agent configuration, historical reporting, partner account management). These are low-traffic routes where the cost of the extraction isn't justified by the current call volume from bot consumers.

Sprint 23 is earmarked for clearing the extraction backlog. At that point, every operation will be available on both surfaces and the route layer will be a thin HTTP adapter over the service layer. We're not there yet, and that's fine — the incremental approach means every stage shipped working software.

Next in this series: Building a Reliable Event Bus on Serverless Infrastructure — the SSE polling-with-flush pattern, Redis replay buffer, and failure recovery behind WFW's real-time event system.