Caller-Type Detection at 500ms: How to Tell a Human from an AI Mid-Call
When a call arrives at a WFW-powered phone number, the first question the platform answers isn't "how can I help you?" It's "who am I talking to?"
That distinction matters because the optimal response to an AI caller and a human caller are completely different. A human caller gets a warm voice greeting and natural conversation. An AI caller should get a direct JSON response, zero TTS latency, and structured data instead of spoken language. Same underlying agent service. Different output mode.
The catch: the detection decision has to happen in under 500ms. The agent generates its first utterance immediately after the call connects. If we haven't identified the caller type by then, we either delay the human greeting (bad user experience) or we commit to human mode and handle an AI caller incorrectly.
Here's the three-signal detection system we built.
Signal 1: SIP Headers (Zero Latency)
SIP (Session Initiation Protocol) calls carry metadata headers before audio ever flows. This is the fastest signal — we get it at the moment the INVITE lands, before the call is even answered.
AI callers that we've pre-registered with can include a X-WFW-Caller-Type header:
INVITE sip:+18434567890@wfw.sip.twilio.com SIP/2.0
X-WFW-Caller-Type: bot
X-WFW-Bot-Token: eyJhbGciOiJIUzI1NiJ9...
X-WFW-Client-Id: client_abc123
Twilio exposes these custom SIP headers through its webhook payload, which we receive at call start:
// lib/telephony/call-handler.ts
interface TwilioCallWebhookBody {
CallSid: string;
From: string;
To: string;
// Custom SIP headers are passed as SipHeader_* fields
SipHeader_X_WFW_Caller_Type?: string;
SipHeader_X_WFW_Bot_Token?: string;
SipHeader_X_WFW_Client_Id?: string;
// ... other Twilio fields
}
/**
* Extract caller type from SIP headers.
* Returns "bot" only if both the header is present AND the token is valid.
* A header alone is not sufficient — anyone can set a SIP header.
*/
async function detectFromSIPHeaders(
body: TwilioCallWebhookBody
): Promise<"bot" | "unknown"> {
const callerTypeHeader = body.SipHeader_X_WFW_Caller_Type;
const botToken = body.SipHeader_X_WFW_Bot_Token;
const clientId = body.SipHeader_X_WFW_Client_Id;
if (callerTypeHeader !== "bot" || !botToken || !clientId) {
return "unknown";
}
// Validate the bot token against the client's service account
const isValid = await validateBotSIPToken(botToken, clientId);
return isValid ? "bot" : "unknown";
}
SIP header detection has zero latency (we get it before audio) and zero false positives (it requires a valid cryptographic token). It covers all AI-to-AI calls in our system where the caller is a known WFW partner agent.
For third-party AI callers that haven't pre-registered, we fall through to the next signals.
Signal 2: DTMF Token (300ms Window)
AI callers can send a specific DTMF sequence in the first 300ms of the call as a pre-negotiated machine handshake. We listen for the sequence ##7823 during the initial audio window.
// lib/telephony/dtmf-detector.ts
/** The pre-negotiated DTMF sequence that identifies WFW bot callers */
const WFW_BOT_DTMF_TOKEN = "##7823";
/**
* Listen for the WFW bot DTMF token during the call setup window.
* We give the caller 300ms to send it before falling through to voice analysis.
*
* Using TwiML <Gather> with a 300ms timeout:
* - If DTMF token received → bot confirmed
* - If timeout → unknown (proceed to voice analysis)
*/
export function buildDTMFDetectionTwiML(): string {
return `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather
input="dtmf"
numDigits="6"
timeout="0.3"
action="/api/v2/telephony/dtmf-result"
method="POST"
>
<!-- Silent: no audio plays during detection window -->
</Gather>
<!-- If no DTMF received, fall through to voice analysis -->
<Redirect>/api/v2/telephony/voice-analysis</Redirect>
</Response>`;
}
/**
* Handle the DTMF result from Twilio.
* Called by Twilio after the <Gather> either receives digits or times out.
*/
export function processDTMFResult(
digits: string | undefined,
callSid: string
): "bot" | "unknown" {
if (digits === WFW_BOT_DTMF_TOKEN) {
// Confirmed bot — zero ambiguity
return "bot";
}
// Either no digits or wrong sequence — proceed to voice analysis
return "unknown";
}
DTMF token detection is deterministic. If the sequence matches, it's a bot. Period. No probability, no threshold. The tradeoff is that it requires the calling AI to be programmed to send it — which works for known callers but not unknown ones.
Signal 3: First Utterance Pattern Analysis
For callers that didn't send SIP headers or DTMF, we analyze the first utterance. Human speech and AI agent speech have measurably different patterns.
Humans: "Hi, um, I wanted to ask about getting my, uh, furnace checked before winter?" AI agents: "SCHEDULEAPPOINTMENT customerid:cust8823 service:HVACMAINTENANCE datepreference:nextweek"
The distinguishing features aren't about content — they're about structure:
- Disfluency rate: humans use filler words (um, uh, like, so). Structured AI calls have zero fillers.
- Parse structure: AI callers often use key:value notation or structured prefixes, even in natural-language mode.
- Onset latency: humans have a brief hesitation after the greeting. AI agents respond at near-zero latency.
- Token density: AI utterances have high information density in the first 5 words. Human greetings have low density ("Hi, I was just wondering if...").
We run a lightweight classifier on the first 5 words of the utterance transcribed at 300ms:
// lib/telephony/voice-classifier.ts
interface UtteranceFeatures {
firstFiveWords: string[];
hasDisfluencies: boolean; // "um", "uh", "like", "so"
hasStructuredTokens: boolean; // key:value, CAPS_IDENTIFIERS, numeric IDs
wordCount: number;
onsetLatencyMs: number; // time from call answer to first word
}
/**
* Binary classifier: returns probability that caller is a bot.
* Threshold: > 0.7 → route as bot. Otherwise → route as human.
* We use a conservative threshold — false positive (treating a bot as human)
* is less harmful than false negative (treating a human as a bot).
*
* Accuracy on our test set: 94% (human), 91% (bot)
* False positive rate: 2% (humans incorrectly classified as bots)
*/
export function classifyCallerType(features: UtteranceFeatures): {
classification: "bot" | "human";
confidence: number;
} {
let botScore = 0;
// High-weight signal: structured tokens strongly indicate a bot
if (features.hasStructuredTokens) botScore += 0.5;
// High-weight signal: zero disfluencies + low onset latency
if (!features.hasDisfluencies && features.onsetLatencyMs < 100) botScore += 0.3;
// Medium-weight signal: high word density in opening
const wordDensity = features.wordCount / 5;
if (wordDensity > 0.8) botScore += 0.15;
// Medium-weight signal: disfluencies present strongly suggests human
if (features.hasDisfluencies) botScore -= 0.4;
const confidence = Math.min(Math.max(botScore, 0), 1);
const classification = confidence > 0.7 ? "bot" : "human";
return { classification, confidence };
}
This classifier gets to 94% accuracy on human callers and 91% on bot callers — good but not perfect. That's why it's the last-resort signal, not the primary one.
The Routing Decision
The three signals combine into a priority-ordered decision:
// lib/telephony/caller-routing.ts
export type CallerType = "human" | "bot";
export interface RoutingResult {
callerType: CallerType;
detectionMethod: "sip_header" | "dtmf_token" | "voice_classifier" | "default";
confidence: number;
/** The actor context to use for agent dispatch — differs between human/bot */
actorContext: ActorContext;
}
/**
* Determine caller type and return routing decision.
* Priority: SIP header → DTMF token → voice classifier → default (human)
* Errors conservatively: if uncertain, treat as human.
*/
export async function resolveCallerType(
sipDetection: "bot" | "unknown",
dtmfDetection: "bot" | "unknown",
voiceFeatures: UtteranceFeatures | null,
callSid: string,
agentConfig: AgentConfig
): Promise<RoutingResult> {
// SIP header detection: highest confidence, zero latency
if (sipDetection === "bot") {
return {
callerType: "bot",
detectionMethod: "sip_header",
confidence: 1.0,
actorContext: await buildBotActorContext(callSid),
};
}
// DTMF token: deterministic, no probability
if (dtmfDetection === "bot") {
return {
callerType: "bot",
detectionMethod: "dtmf_token",
confidence: 1.0,
actorContext: await buildBotActorContext(callSid),
};
}
// Voice classifier: probabilistic
if (voiceFeatures) {
const { classification, confidence } = classifyCallerType(voiceFeatures);
if (classification === "bot" && confidence > 0.7) {
return {
callerType: "bot",
detectionMethod: "voice_classifier",
confidence,
actorContext: await buildBotActorContext(callSid),
};
}
}
// Default: human. False negative (bot treated as human) is less harmful
// than false positive (human treated as bot) — humans get a slightly slower
// response, bots get a voice greeting they'll ignore.
return {
callerType: "human",
detectionMethod: "default",
confidence: 0.5,
actorContext: buildHumanCallerContext(callSid),
};
}
What Changes Between Modes
Once the routing decision is made, the two modes diverge:
Human mode:
- Greet with TTS-rendered voice message
- Run natural conversation loop
- Tool calls trigger voice-formatted responses
- Session ends on caller hangup or agent completion
Bot mode:
- Skip TTS entirely — send JSON directly
- Enforce sub-500ms response latency (no excuse for TTS delay)
- Validate bearer token from SIP header before processing any tool calls
- Return
{ outcome, data, event_id }on each exchange instead of voice output - Fire
call.bot_completedwebhook when done, rather than recording a voice transcript
// The dual-mode response type
export type CallResponse =
| { mode: "voice"; twimlResponse: string }
| { mode: "bot"; outcome: string; data: Record<string, unknown>; eventId: string };
Both modes use the same underlying AgentService — the same tools, the same knowledge base, the same compliance rules. Only the serialization layer changes.
The callerdetectionmode Configuration
Partners can configure how their agents handle detection. Some agents should never accept bot calls (a customer service line where bot-initiated calls would be unusual). Others are built specifically for bot-to-bot communication.
export type CallerDetectionMode =
| "human_only" // Reject calls that detect as bot
| "bot_only" // Reject calls that detect as human (B2B integrations)
| "dual"; // Accept both, route appropriately
// In agent config
interface AgentConfig {
caller_detection_mode: CallerDetectionMode;
// ...
}
humanonly agents hang up on confidently-detected bot calls. botonly agents reject human callers with a recorded message directing them to the web. dual agents route dynamically — most agents in production use dual.
The 500ms constraint is real and binding. We measured it. Human callers notice a pause longer than 500ms before the first greeting — it reads as a dead line. The entire detection pipeline — SIP header parsing (0ms), DTMF collection (300ms timeout), voice transcription snippet (100ms) — is designed to fit inside that window, with 100ms of margin.
Ready to put AI voice agents to work in your business?
Get a Live Demo — It's FreeContinue Reading
Related Articles
HIPAA on the Phone: What Every Healthcare AI Must Know
PHI over the phone, identity verification requirements, the HIPAA compliance layer — and why the handling of Protected Health Information is the key enabler for healthcare voice AI adoption.
Why Hotel Front Desks Are Losing 35% of Revenue to Unanswered Calls — and What the Smart Chains Are Doing About It
A data-driven buyer's guide for hotel operators evaluating AI voice assistants and AI concierge software. What actually drives revenue, what PMS integration really means, and why dual-mode matters in hospitality.
HIPAA, PCI, TCPA, and More: The Complete Compliance Guide for Voice AI in 2026
Voice AI creates compliance attack surfaces that most platforms ignore. PHI in transcripts. Card numbers in recordings. Auto-dialed calls without consent. Prohibited phrases in real estate. This is the definitive compliance reference for every regulated business deploying voice AI.