Building voice AI with sub-2-second latency taught me where every millisecond hides
← Back to Blog
May 2026·Voice AI·12 min read

Building voice AI with sub-2-second latency taught me where every millisecond hides

The first version of my voice assistant sounded smart and felt terrible. You asked a question, waited four or five seconds, then got an answer. Technically correct. Emotionally dead. Getting that latency below two seconds changed everything about how I design AI systems.

The Moment It Felt Broken

The first time I called my own voice assistant, I thought I had built something impressive. Whisper handled speech-to-text. A language model generated the response. Text-to-speech turned it back into audio. Twilio handled the phone layer. End-to-end, the system worked.

Then I asked a simple question and waited.

One second. Two seconds. Three. Then the voice replied with a perfectly reasonable answer that already felt too late. It was the kind of delay you can tolerate in a demo and never accept in a real conversation.

That was the moment I understood a hard truth about voice AI: intelligence is not enough. If the system does not answer fast enough, users stop experiencing it as conversation and start experiencing it as a broken interface.

In text chat, a two-second pause is normal. In voice, two seconds feels awkward. Four seconds feels like failure. Human conversation has rhythm, interruption, overlap, tiny acknowledgments, and almost invisible timing rules. If you miss those rules, people do not say your architecture is bad. They just stop using the thing.

Why Voice Is Harder Than Chat

Developers who have only built text products often underestimate this. In chat, latency is annoying. In voice, latency is the product. Or at least a massive part of it.

A voice system has to do four jobs in sequence. First, capture audio cleanly enough that speech recognition can work. Second, transcribe quickly without waiting forever for the speaker to fully finish. Third, generate a response with enough intelligence to be useful. Fourth, synthesize audio and stream it back with as little dead air as possible.

Each stage is individually manageable. The problem is that the delays multiply. Three hundred milliseconds here, seven hundred there, another second in generation, half a second in audio startup. Suddenly you are at three or four seconds and the experience feels robotic no matter how good the answer is.

That is why I stopped thinking about voice AI as one feature and started treating it like a latency budget. Every component had to justify its milliseconds.

The Actual Pipeline

The system I built is simple in concept. Twilio receives the phone call and streams audio chunks. Whisper handles speech-to-text. The transcript goes into an LLM that has the assistant context, tools, and persona. The response is sent to a TTS engine and streamed back to the caller. On paper, that is a neat four-box diagram.

In practice, each box hides more complexity than the diagram admits. Audio arrives in chunks that do not align neatly with sentence boundaries. Callers pause mid-thought. They restart sentences. They speak over the assistant. Network conditions vary. A perfect pipeline in a local terminal behaves very differently when a real human calls from a mobile network while walking outside.

I use Whisper for transcription because it is reliable, local, and good enough for real speech. For synthesis, I started with Edge TTS because it is fast and easy to integrate. The language model does the reasoning and can also call tools when needed. The whole thing is glued together by a small voice server that keeps state per call.

What matters is not that any one component is magical. What matters is how quickly the handoff happens between them.

My First Mistake: Waiting For Perfect Transcripts

The earliest version waited too long before transcribing. I wanted clean input, so I buffered large chunks of audio and looked for clear pauses before sending anything to Whisper. This improved transcript quality slightly and destroyed the conversational feel completely.

I was optimizing for correctness in the wrong place. A voice assistant does not need a courtroom-grade transcript before it starts thinking. It needs a good enough understanding fast enough to respond naturally.

The fix was partial transcription. Instead of waiting for a long, comfortable silence, I started processing shorter segments and updating the transcript incrementally. That alone cut a noticeable amount of dead time. More importantly, it changed my mental model. Voice systems should prefer progressive certainty over delayed perfection.

This is a pattern I keep seeing in AI engineering. The technically best output is not always the best product outcome. In voice, users forgive small transcription imperfections far more easily than they forgive silence.

My Second Mistake: Letting The Model Talk Too Much

Large models love being thorough. That is often useful in writing and dangerous in voice. My assistant initially answered like an overachieving consultant. Good explanations, complete thoughts, elegant wording, terrible pacing.

A spoken answer needs to be shorter than a written one. It needs to get to the point faster. It needs to sound like something a human would say out loud, not like a polished blog paragraph read into a microphone.

I changed the prompt to prioritize brevity, interruption tolerance, and verbal pacing. Short first sentence. Answer first, detail second. Avoid long lists unless asked. Prefer plain words over formal ones. The quality improved immediately, but the more interesting win was latency: shorter responses mean faster generation, faster synthesis, and faster turn-taking.

That was one of my favorite lessons from this project. Good UX and good performance often want the same thing.

Where The Milliseconds Actually Went

When people talk about AI latency, they usually blame the model. Sometimes that is fair. Often it is lazy. In my case, the model was only one part of the delay.

The real latency breakdown looked more like this:

  • Audio chunk accumulation before transcription
  • Speech-to-text processing time
  • LLM first-token delay
  • Text-to-speech startup time
  • Telephony stream buffering and network jitter

None of those were catastrophic alone. Together they were enough to ruin the experience. Once I measured them separately, the optimization path became obvious. Reduce chunk size. Start transcribing sooner. Use a faster model for voice turns. Keep responses concise. Begin TTS streaming as early as possible. Avoid anything that forces the system to wait for the full final response when it could already be speaking.

What got me under two seconds was not one breakthrough. It was removing ten small frictions.

The Architecture Shift That Helped Most

The biggest architectural change was moving from a batch mindset to a streaming mindset. At first, every stage waited for the previous stage to fully finish. Entire utterance in, complete transcript out, complete model response out, complete audio out. Clean. Predictable. Slow.

Streaming changes the shape of the system. Transcription starts while the user is still finishing the thought. The model can begin once enough intent is clear. TTS can begin speaking the first sentence before the last sentence is generated. The call feels alive because the pipeline overlaps instead of queues.

This makes implementation harder. State management gets messier. Interruptions matter. You need cancellation logic so the assistant stops speaking when the caller starts talking again. But the UX improvement is so dramatic that I think it is worth the extra complexity for any serious voice product.

If I had to summarize the entire project in one sentence, it would be this: sub-2-second voice AI is mostly an orchestration problem.

Interruptions Changed Everything

One subtle thing that separates a toy from a useful voice assistant is barge-in handling. Humans interrupt each other constantly. Not rudely, just naturally. We say "yeah," "right," "wait," "hold on," or ask a follow-up before the other person fully finishes.

My first implementation treated the assistant's audio output as sacred. Once it started speaking, it wanted to finish. That created a weird dynamic where the user felt trapped inside the assistant's sentence.

So I changed it. If the caller starts speaking, the assistant stops. Audio playback is cut, the new speech takes priority, and the conversation continues. Technically, this adds edge cases. Product-wise, it adds humanity.

This also forced me to simplify answers further. If a user is likely to interrupt after the first useful sentence, then the first sentence had better contain the actual value.

The Quality Trade-Offs I Actually Accepted

I did not get low latency by pretending trade-offs do not exist. I got there by choosing the right ones.

I accepted that voice transcripts can be slightly messy as long as intent is preserved. I accepted that spoken answers should be shorter and less comprehensive than written ones. I accepted that for live calls, a fast model that is 95% as smart is often better than a slower model that is 100% as smart.

I did not accept random hallucinations, poor tool safety, or missing context. Speed is not an excuse for unreliability. If the assistant can schedule a meeting, read private notes, or act on behalf of a user, guardrails matter just as much as performance.

This balance is where a lot of AI demos fall apart in production. They optimize for wow-factor metrics and ignore operational trust. A voice system that answers in 900 milliseconds and says the wrong thing is still a bad system.

What Testing Voice AI Taught Me

Testing voice systems is different from testing chat. In chat, you can inspect the final text and be done. In voice, timing is part of correctness. A response that is semantically right but arrives too late is functionally wrong.

I started evaluating calls with questions like these: how long until the first audible response? How often does the assistant talk over the user? How often does it fail to stop when interrupted? Does it handle hesitation naturally? Does it recover from transcription ambiguity without sounding confused?

These are not benchmark questions. They are product questions. They are also much closer to how real people judge voice experiences.

The most useful test setup I found was not synthetic audio. It was repeated real phone calls from different networks, different rooms, different speaking styles, and deliberately messy conversations. Real users do not speak in benchmark-quality sentences. They mumble, pause, change direction, and interrupt themselves. Your system has to survive that.

Why Two Seconds Matters So Much

I would not claim that two seconds is a universal law. But it is a meaningful threshold. Above it, conversations start feeling sticky. Below it, people relax. They stop thinking about the software and start interacting with it.

The difference was obvious the first time I got the pipeline consistently near that range. The calls felt lighter. I found myself naturally asking follow-up questions instead of testing the system like a machine. That was the real milestone, not the stopwatch.

It reminded me of something I learned building other AI systems: users do not experience architectures. They experience waiting, friction, and trust. If you remove enough waiting, the intelligence gets a chance to matter.

What I Would Tell Anyone Building Voice AI

If you are building a voice assistant, start by measuring before optimizing. Break latency into stages. Do not treat it as one mysterious number. Then decide where you can stream, where you can shorten, and where you can safely trade perfect output for faster interaction.

Design prompts specifically for speech, not for text copied into speech. Build interruption handling early, not as a polish task. Test on bad networks and messy speech. And if you have to choose, optimize for responsiveness before eloquence. A fast good answer beats a slow beautiful one almost every time.

I went into this project thinking I was building a smart assistant. I came out of it thinking I was building a timing system with intelligence attached. That perspective changed how I work on every conversational product now.

In voice AI, users do not reward you for the smartest answer if it arrives too late. Conversation is a performance problem before it becomes an intelligence problem.
Igor Gawrys
Igor Gawrys
AI Engineer & IT Consultant · Katowice, Poland