Dual-Mode Voice: How One Phone Number Serves Both Humans and AI Agents

One phone number. Two optimal response formats. Less than 500 milliseconds to figure out which one to use.

This is the dual-mode detection problem, and it's one of the more interesting pieces of engineering in the WFW stack. The core challenge: a human caller and an AI caller both connect to the same phone number, but the right response to each is completely different. A human gets a warm voice greeting and a conversational flow. An AI gets a structured JSON acknowledgment and expects to exchange structured messages, not speech.

Getting this wrong in either direction has a real cost. A human caller who gets a JSON response has a broken, confusing experience. An AI caller that gets TTS audio has to parse speech to extract structured data — that's slow, fragile, and wasteful. The detection has to be accurate and it has to be fast.

Here's how the detection stack works.

The 500ms Budget

The total latency budget for caller detection is 500ms from call connect to the first response. This is the number that determines how many layers you can run and how fast each needs to be.

Why 500ms? Two reasons:

Human experience: A human caller expects to hear something within half a second of connecting. Beyond that, the silence feels like a dead line and they may hang up or re-dial.

Bot experience: An AI caller is waiting for its acknowledgment before it sends its first structured message. 500ms is long enough to run meaningful detection; extending it significantly would slow down AI-to-AI interactions that need to be fast.

The budget is allocated across three detection layers:

Layer	Mechanism	Budget	Confidence if triggered
1	SIP header inspection	<10ms	High
2	Pre-negotiated token validation	<50ms	Definitive
3	First-utterance pattern analysis	<400ms	Medium-High

The layers run in order and short-circuit. If Layer 1 produces a definitive result, Layers 2 and 3 don't run. If Layer 2 validates a token, Layer 3 doesn't run. Layer 3 only runs when the earlier layers haven't resolved detection — and it has whatever budget remains from the 500ms total.

Layer 1: SIP Header Inspection

Modern SIP clients can include custom headers in INVITE messages. Well-behaved AI callers will self-identify in headers:

User-Agent: WFW-Orchestrator/1.0 (ai-caller)
X-Caller-Type: automated-system
X-WFW-Bot-Token: Bearer eyJhbGciOiJSUzI1NiJ9...

Header inspection happens at the ingress layer, before the call is routed to the agent. It takes under 10ms and requires no LLM involvement.

If recognized header patterns are present — specifically, if X-WFW-Bot-Token is present, or if X-Caller-Type contains automated-system or ai-caller — the inspection layer marks the call for token validation (Layer 2) rather than voice response.

If the token is embedded in the SIP headers (the preferred pattern for registered callers), the call skips Layer 3 entirely.

The limitation: SIP header injection requires that the calling system knows to include these headers. Standard PSTN callers — including humans calling from cell phones and many older automated systems — don't include custom SIP headers. Layer 1 resolves calls from well-configured AI callers quickly; it misses everyone else.

Layer 2: Pre-Negotiated Token Validation

Registered callers (AI systems that have been pre-registered via POST /v2/callers/register) receive a caller token that can be presented in one of two ways:

SIP header (preferred): included in the X-WFW-Bot-Token header of the SIP INVITE. Detected in Layer 1 and validated in Layer 2 in under 50ms total.

DTMF sequence: the first 2–4 seconds of the call, before any audio, the calling system sends a DTMF tone sequence that encodes the token ID. The agent's ingress layer listens for DTMF before triggering the voice greeting.

The DTMF path exists for callers that can't set SIP headers — for example, a calling system that uses a standard PSTN gateway and can't modify SIP headers directly. It's slower (the DTMF sequence itself takes 1–3 seconds) but still faster than voice processing.

Token validation is a JWT signature check plus a scope check:

// Valid bot token payload
{
  "sub": "caller_acme_orch_001",
  "client_id": "cli_acme_001",
  "type": "caller_token",
  "scopes": ["structured_requests", "appointment_reschedule"],
  "issued_at": 1750000000,
  "jti": "tok_unique_id_abc"
}

A valid token with the appropriate scopes triggers immediate bot mode. No further detection needed.

A token that fails validation (expired, revoked, wrong signature) falls through to Layer 3 rather than failing the call. We don't block calls on token failures — we degrade to human mode. A human who somehow ends up with a token in their SIP headers gets a normal voice greeting.

Layer 3: First-Utterance Pattern Analysis

When Layers 1 and 2 don't resolve detection, the agent plays a short greeting and listens to the first utterance. The utterance is classified by a lightweight pattern analyzer (not a full LLM inference call) against a set of features that distinguish AI callers from human callers.

Features the classifier uses:

1. Preamble and silence pattern. Human callers almost always have a brief pause after the greeting before they start speaking — they're processing the greeting, formulating a response. AI callers typically start speaking immediately after the greeting ends, with no ambient noise and minimal pause.

2. Structural declaration without social preamble. Humans say "Hi, um, I was calling about..." or "Yeah hi, I wanted to..." AI callers tend to state their purpose immediately and directly: "This is an automated scheduling system. Request type: appointment_reschedule. Patient ID: p-8820..." The absence of social preamble is a strong signal.

3. Utterance vocabulary and structure. AI callers often include identifiers, reference numbers, or structured data early in their utterance. Humans rarely open a call with an appointment ID.

4. Audio characteristics. TTS-generated speech has consistent acoustic properties — uniform speaking rate, minimal disfluency, absence of breath sounds and background noise. The classifier includes audio-level features alongside text-level ones.

The classifier runs in under 400ms on the first utterance (typically 3–8 seconds of audio). It outputs a confidence score (0.0–1.0) and a direction (human or bot).

Confidence thresholds:

confidence >= 0.80 → route to detected mode
confidence < 0.80 → fall back to human mode

The asymmetry is intentional. We need high confidence to route to bot mode. We fall back to human mode on uncertainty. A human who gets mistaken for a bot has a confusing, broken experience. A bot that gets mistaken for a human has to navigate a voice conversation — annoying but recoverable.

The Decision Tree Combined

Call connects
    ↓
[Layer 1] SIP header check (<10ms)
    → X-WFW-Bot-Token present? → [Layer 2] Validate token
        → Valid? → BOT MODE
        → Invalid? → [Layer 3]
    → X-Caller-Type: ai-caller? → [Layer 2] Validate token (if present)
    → No bot signals → Play greeting, listen for first utterance
        ↓
[Layer 3] Pattern analysis (<400ms from first word)
    → Confidence >= 0.80, direction=bot → BOT MODE
    → Otherwise → HUMAN MODE (default)

The key property of this tree: HUMAN MODE is always the fallback. There is no detection failure state that routes an unknown caller to bot mode. When in doubt, human mode.

What Happens After Detection

Human path:

The agent greets the caller conversationally (if not already done during Layer 3), routes the call to the appropriate intent handler, and proceeds with the configured voice agent behavior — TTS responses, conversational turns, escalation logic, appointment booking tools.

Bot path:

The agent suppresses TTS output entirely and returns a structured acknowledgment:

{
  "mode": "bot",
  "agent_id": "agt_rldnt_a8b2c1",
  "session_id": "sess_9f2b1c",
  "accepted_request_types": [
    "appointment_schedule",
    "appointment_reschedule",
    "availability_query"
  ],
  "auth": "verified",
  "protocol_version": "wfw-a2a-1.0"
}

The calling AI parses this acknowledgment and sends its first structured request. From here, the exchange is pure JSON over the audio channel — no speech synthesis, no audio parsing. The phone network is just the transport layer.

Structured Request and Response Format

A complete bot-mode exchange for appointment rescheduling:

Caller sends:

{
  "request_type": "appointment_reschedule",
  "appointment_id": "apt_22891",
  "patient_id": "p_8820",
  "preferred_window": {
    "day_of_week": "tuesday",
    "weeks_out": 1
  },
  "authorization_token": "pat_auth_cc_approved_8820_0412"
}

Agent responds:

{
  "status": "pending_confirmation",
  "available_slots": [
    {
      "slot_id": "slot_9943",
      "datetime": "2026-07-15T14:00:00",
      "duration_minutes": 60,
      "provider": "Dr. Reeves"
    }
  ],
  "requires_patient_confirmation": false,
  "authorization_validated": true
}

Caller confirms:

{
  "action": "confirm",
  "slot_id": "slot_9943",
  "session_id": "sess_9f2b1c"
}

Agent confirms:

{
  "status": "confirmed",
  "appointment_id": "apt_22891",
  "new_datetime": "2026-07-15T14:00:00",
  "confirmation_number": "RLD-2026-9930",
  "pms_updated": true,
  "patient_sms_sent": true
}

Total round-trip from call connect to confirmed reschedule: under 5 seconds. No audio synthesized or parsed.

Registering a Calling System

To get the Layer 2 benefits (definitive detection, token-based auth), register your calling system:

curl -X POST https://api.workforcewave.com/v2/callers/register \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Orchestrator v2",
    "description": "Healthcare coordination platform",
    "request_types": ["appointment_reschedule", "availability_query"],
    "sip_header_capable": true
  }'

{
  "caller_id": "caller_acme_orch_001",
  "token": "eyJhbGciOiJSUzI1NiJ9...",
  "sip_header_name": "X-WFW-Bot-Token",
  "expires_at": "2027-06-29T00:00:00Z"
}

Include the returned token in your X-WFW-Bot-Token SIP header on every call. Layer 1 will detect it; Layer 2 will validate it; you'll be in bot mode before the first audio frame is processed.

Why This Architecture

The three-layer design exists because no single layer works universally.

SIP header inspection is fast and definitive but only works for well-configured callers. Pre-negotiated tokens are the most reliable mechanism but require pre-registration, which not every caller has done. First-utterance analysis works for any caller but is slower and probabilistic.

Stacking them in order of speed and certainty — with human mode as the fallback — gives the best aggregate behavior: registered AI callers get fast, definitive detection; unregistered AI callers have a chance to be detected via patterns; unknown callers default to human mode.

The 500ms budget disciplines the design. Every layer that takes longer has to prove it's worth the latency. The classifier in Layer 3 runs on a purpose-built model, not a general-purpose LLM, specifically because general-purpose inference can't reliably hit sub-400ms in a real-time voice context.

The phone number is just the address. What happens after connection is the interesting part.

This concludes the Developer Deep Dives series, Volume 1. Volume 2 covers the Workforce Wave prompt optimization pipeline, KB versioning and diff semantics, and the extraction confidence calibration system.