Glossary
ASR (Automatic Speech Recognition)
The technology that converts spoken audio into machine-readable text in real time.
Automatic Speech Recognition (ASR), also called Speech-to-Text (STT), is the technology that converts spoken audio into written text transcripts in real time. ASR is the first critical stage in every AI voice agent pipeline.
How ASR Works
Modern ASR systems use deep neural networks trained on millions of hours of speech data. When a caller speaks, the ASR model breaks the audio into acoustic features, compares them against learned patterns, and outputs a text hypothesis. Top ASR engines like Google Speech-to-Text and OpenAI Whisper achieve 95%+ accuracy on clear audio.
Factors Affecting ASR Accuracy
- Audio quality — background noise, echo, or poor phone line quality degrades accuracy.
- Accents and dialects — models trained on diverse speakers perform better.
- Domain knowledge — custom acoustic models trained on industry jargon (medical terms, product names) improve accuracy.
- Real-time constraints — some ASR systems buffer audio to improve accuracy; this introduces latency.
ASR Impact on Voice Agent Performance
If ASR accuracy drops to 85%, downstream intent detection and response generation suffer cascading errors. High-quality voice agents deploy ASR with specialized models, multi-pass refinement, and fallback to human escalation when confidence is low.
Related Terms
See AI Voice Agents in Action
Workforce Wave deploys AI voice agents across healthcare, staffing, and more. Book a 30-minute demo — no pressure, no generic scripts.
Book a Demo