Caller-Type Detection at 500ms: How to Tell a Human from an AI Mid-Call

When a call arrives at a WFW-powered phone number, the first question the platform answers isn't "how can I help you?" It's "who am I talking to?"

That distinction matters because the optimal response to an AI caller and a human caller are completely different. A human caller gets a warm voice greeting and natural conversation. An AI caller should get a direct JSON response, zero TTS latency, and structured data instead of spoken language. Same underlying agent service. Different output mode.

The catch: the detection decision has to happen in under 500ms. The agent generates its first utterance immediately after the call connects. If we haven't identified the caller type by then, we either delay the human greeting (bad user experience) or we commit to human mode and handle an AI caller incorrectly.

Here's the three-signal detection system we built.

Signal 1: SIP Headers (Zero Latency)

SIP (Session Initiation Protocol) calls carry metadata headers before audio ever flows. This is the fastest signal — we get it at the moment the INVITE lands, before the call is even answered.

AI callers that we've pre-registered with can include a X-WFW-Caller-Type header:

INVITE sip:+18434567890@wfw.sip.twilio.com SIP/2.0
X-WFW-Caller-Type: bot
X-WFW-Bot-Token: eyJhbGciOiJIUzI1NiJ9...
X-WFW-Client-Id: client_abc123

Twilio exposes these custom SIP headers through its webhook payload, which we receive at call start:

// lib/telephony/call-handler.ts

interface TwilioCallWebhookBody {
  CallSid: string;
  From: string;
  To: string;
  // Custom SIP headers are passed as SipHeader_* fields
  SipHeader_X_WFW_Caller_Type?: string;
  SipHeader_X_WFW_Bot_Token?: string;
  SipHeader_X_WFW_Client_Id?: string;
  // ... other Twilio fields
}

/**
 * Extract caller type from SIP headers.
 * Returns "bot" only if both the header is present AND the token is valid.
 * A header alone is not sufficient — anyone can set a SIP header.
 */
async function detectFromSIPHeaders(
  body: TwilioCallWebhookBody
): Promise<"bot" | "unknown"> {
  const callerTypeHeader = body.SipHeader_X_WFW_Caller_Type;
  const botToken = body.SipHeader_X_WFW_Bot_Token;
  const clientId = body.SipHeader_X_WFW_Client_Id;

  if (callerTypeHeader !== "bot" || !botToken || !clientId) {
    return "unknown";
  }

  // Validate the bot token against the client's service account
  const isValid = await validateBotSIPToken(botToken, clientId);
  return isValid ? "bot" : "unknown";
}

SIP header detection has zero latency (we get it before audio) and zero false positives (it requires a valid cryptographic token). It covers all AI-to-AI calls in our system where the caller is a known WFW partner agent.

For third-party AI callers that haven't pre-registered, we fall through to the next signals.

Signal 2: DTMF Token (300ms Window)

AI callers can send a specific DTMF sequence in the first 300ms of the call as a pre-negotiated machine handshake. We listen for the sequence ##7823 during the initial audio window.

// lib/telephony/dtmf-detector.ts

/** The pre-negotiated DTMF sequence that identifies WFW bot callers */
const WFW_BOT_DTMF_TOKEN = "##7823";

/**
 * Listen for the WFW bot DTMF token during the call setup window.
 * We give the caller 300ms to send it before falling through to voice analysis.
 *
 * Using TwiML <Gather> with a 300ms timeout:
 * - If DTMF token received → bot confirmed
 * - If timeout → unknown (proceed to voice analysis)
 */
export function buildDTMFDetectionTwiML(): string {
  return `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Gather
    input="dtmf"
    numDigits="6"
    timeout="0.3"
    action="/api/v2/telephony/dtmf-result"
    method="POST"
  >
    <!-- Silent: no audio plays during detection window -->
  </Gather>
  <!-- If no DTMF received, fall through to voice analysis -->
  <Redirect>/api/v2/telephony/voice-analysis</Redirect>
</Response>`;
}

/**
 * Handle the DTMF result from Twilio.
 * Called by Twilio after the <Gather> either receives digits or times out.
 */
export function processDTMFResult(
  digits: string | undefined,
  callSid: string
): "bot" | "unknown" {
  if (digits === WFW_BOT_DTMF_TOKEN) {
    // Confirmed bot — zero ambiguity
    return "bot";
  }
  // Either no digits or wrong sequence — proceed to voice analysis
  return "unknown";
}

DTMF token detection is deterministic. If the sequence matches, it's a bot. Period. No probability, no threshold. The tradeoff is that it requires the calling AI to be programmed to send it — which works for known callers but not unknown ones.

Signal 3: First Utterance Pattern Analysis

For callers that didn't send SIP headers or DTMF, we analyze the first utterance. Human speech and AI agent speech have measurably different patterns.

Humans: "Hi, um, I wanted to ask about getting my, uh, furnace checked before winter?" AI agents: "SCHEDULEAPPOINTMENT customerid:cust8823 service:HVACMAINTENANCE datepreference:nextweek"

The distinguishing features aren't about content — they're about structure:

Disfluency rate: humans use filler words (um, uh, like, so). Structured AI calls have zero fillers.
Parse structure: AI callers often use key:value notation or structured prefixes, even in natural-language mode.
Onset latency: humans have a brief hesitation after the greeting. AI agents respond at near-zero latency.
Token density: AI utterances have high information density in the first 5 words. Human greetings have low density ("Hi, I was just wondering if...").

We run a lightweight classifier on the first 5 words of the utterance transcribed at 300ms:

// lib/telephony/voice-classifier.ts

interface UtteranceFeatures {
  firstFiveWords: string[];
  hasDisfluencies: boolean;        // "um", "uh", "like", "so"
  hasStructuredTokens: boolean;    // key:value, CAPS_IDENTIFIERS, numeric IDs
  wordCount: number;
  onsetLatencyMs: number;          // time from call answer to first word
}

/**
 * Binary classifier: returns probability that caller is a bot.
 * Threshold: > 0.7 → route as bot. Otherwise → route as human.
 * We use a conservative threshold — false positive (treating a bot as human)
 * is less harmful than false negative (treating a human as a bot).
 *
 * Accuracy on our test set: 94% (human), 91% (bot)
 * False positive rate: 2% (humans incorrectly classified as bots)
 */
export function classifyCallerType(features: UtteranceFeatures): {
  classification: "bot" | "human";
  confidence: number;
} {
  let botScore = 0;

  // High-weight signal: structured tokens strongly indicate a bot
  if (features.hasStructuredTokens) botScore += 0.5;

  // High-weight signal: zero disfluencies + low onset latency
  if (!features.hasDisfluencies && features.onsetLatencyMs < 100) botScore += 0.3;

  // Medium-weight signal: high word density in opening
  const wordDensity = features.wordCount / 5;
  if (wordDensity > 0.8) botScore += 0.15;

  // Medium-weight signal: disfluencies present strongly suggests human
  if (features.hasDisfluencies) botScore -= 0.4;

  const confidence = Math.min(Math.max(botScore, 0), 1);
  const classification = confidence > 0.7 ? "bot" : "human";

  return { classification, confidence };
}

This classifier gets to 94% accuracy on human callers and 91% on bot callers — good but not perfect. That's why it's the last-resort signal, not the primary one.

The Routing Decision

The three signals combine into a priority-ordered decision:

// lib/telephony/caller-routing.ts

export type CallerType = "human" | "bot";

export interface RoutingResult {
  callerType: CallerType;
  detectionMethod: "sip_header" | "dtmf_token" | "voice_classifier" | "default";
  confidence: number;
  /** The actor context to use for agent dispatch — differs between human/bot */
  actorContext: ActorContext;
}

/**
 * Determine caller type and return routing decision.
 * Priority: SIP header → DTMF token → voice classifier → default (human)
 * Errors conservatively: if uncertain, treat as human.
 */
export async function resolveCallerType(
  sipDetection: "bot" | "unknown",
  dtmfDetection: "bot" | "unknown",
  voiceFeatures: UtteranceFeatures | null,
  callSid: string,
  agentConfig: AgentConfig
): Promise<RoutingResult> {

  // SIP header detection: highest confidence, zero latency
  if (sipDetection === "bot") {
    return {
      callerType: "bot",
      detectionMethod: "sip_header",
      confidence: 1.0,
      actorContext: await buildBotActorContext(callSid),
    };
  }

  // DTMF token: deterministic, no probability
  if (dtmfDetection === "bot") {
    return {
      callerType: "bot",
      detectionMethod: "dtmf_token",
      confidence: 1.0,
      actorContext: await buildBotActorContext(callSid),
    };
  }

  // Voice classifier: probabilistic
  if (voiceFeatures) {
    const { classification, confidence } = classifyCallerType(voiceFeatures);
    if (classification === "bot" && confidence > 0.7) {
      return {
        callerType: "bot",
        detectionMethod: "voice_classifier",
        confidence,
        actorContext: await buildBotActorContext(callSid),
      };
    }
  }

  // Default: human. False negative (bot treated as human) is less harmful
  // than false positive (human treated as bot) — humans get a slightly slower
  // response, bots get a voice greeting they'll ignore.
  return {
    callerType: "human",
    detectionMethod: "default",
    confidence: 0.5,
    actorContext: buildHumanCallerContext(callSid),
  };
}

What Changes Between Modes

Once the routing decision is made, the two modes diverge:

Human mode:

Greet with TTS-rendered voice message
Run natural conversation loop
Tool calls trigger voice-formatted responses
Session ends on caller hangup or agent completion

Bot mode:

Skip TTS entirely — send JSON directly
Enforce sub-500ms response latency (no excuse for TTS delay)
Validate bearer token from SIP header before processing any tool calls
Return { outcome, data, event_id } on each exchange instead of voice output
Fire call.bot_completed webhook when done, rather than recording a voice transcript

// The dual-mode response type
export type CallResponse =
  | { mode: "voice"; twimlResponse: string }
  | { mode: "bot"; outcome: string; data: Record<string, unknown>; eventId: string };

Both modes use the same underlying AgentService — the same tools, the same knowledge base, the same compliance rules. Only the serialization layer changes.

The `callerdetectionmode` Configuration

Partners can configure how their agents handle detection. Some agents should never accept bot calls (a customer service line where bot-initiated calls would be unusual). Others are built specifically for bot-to-bot communication.

export type CallerDetectionMode =
  | "human_only"   // Reject calls that detect as bot
  | "bot_only"     // Reject calls that detect as human (B2B integrations)
  | "dual";        // Accept both, route appropriately

// In agent config
interface AgentConfig {
  caller_detection_mode: CallerDetectionMode;
  // ...
}

humanonly agents hang up on confidently-detected bot calls. botonly agents reject human callers with a recorded message directing them to the web. dual agents route dynamically — most agents in production use dual.

The 500ms constraint is real and binding. We measured it. Human callers notice a pause longer than 500ms before the first greeting — it reads as a dead line. The entire detection pipeline — SIP header parsing (0ms), DTMF collection (300ms timeout), voice transcription snippet (100ms) — is designed to fit inside that window, with 100ms of margin.

Caller-Type Detection at 500ms: How to Tell a Human from an AI Mid-Call

Signal 1: SIP Headers (Zero Latency)

Signal 2: DTMF Token (300ms Window)

Signal 3: First Utterance Pattern Analysis

The Routing Decision

What Changes Between Modes

The `callerdetectionmode` Configuration

Related Articles

HIPAA on the Phone: What Every Healthcare AI Must Know

Why Hotel Front Desks Are Losing 35% of Revenue to Unanswered Calls — and What the Smart Chains Are Doing About It

HIPAA, PCI, TCPA, and More: The Complete Compliance Guide for Voice AI in 2026

Caller-Type Detection at 500ms: How to Tell a Human from an AI Mid-Call

Signal 1: SIP Headers (Zero Latency)

Signal 2: DTMF Token (300ms Window)

Signal 3: First Utterance Pattern Analysis

The Routing Decision

What Changes Between Modes

The callerdetectionmode Configuration

Related Articles

HIPAA on the Phone: What Every Healthcare AI Must Know

Why Hotel Front Desks Are Losing 35% of Revenue to Unanswered Calls — and What the Smart Chains Are Doing About It

HIPAA, PCI, TCPA, and More: The Complete Compliance Guide for Voice AI in 2026

The `callerdetectionmode` Configuration