Executive Summary
- Human conversational tolerance is 500ms. Anything slower feels like talking to a walkie-talkie.
- Traditional REST and bulk Websocket architectures physically cannot transmit TTS/STT and LLM inference fast enough.
- WebRTC coupled with streaming LLM tokens drastically closes the gap to human-parity speed.
The architectural benchmark for a voice AI to perceive, think, and begin speaking.
1. The Speed Bottleneck
In a standard voice pipeline, Speech-to-Text takes 200ms, LLM thought takes 800ms, and Text-to-Speech takes 400ms. If processed sequentially, the user waits 1.4 seconds before hearing a response.
Pipeline Latency Breakdown (Milliseconds)
Token-Level Streaming
2. Interruption Handling (Barge-in)
If the user interrupts the bot mid-sentence, the VAD (Voice Activity Detection) must trigger in 50ms, kill the audio stream instantly, and inject a 'user interruption' flag into the LLM context so it acknowledges it was cut off.
3. Enterprise Use Cases
WebRTC architecture enables real-time drive-thru automation, massive outbound sales infrastructure, and live multilingual translation services with delay profiles indistinguishable from analog phone networks.
