Skip to main content

Command Palette

Search for a command to run...

Voice Agent Architectures Explained: Cascading vs Native Multimodal Pipelines

Updated
8 min read
Voice Agent Architectures Explained: Cascading vs Native Multimodal Pipelines
R
Software engineer with experience building backend services, cloud infrastructure, and AI-powered applications. Over the past few years, I’ve worked across API design, distributed systems, DevOps, and AI workflows, helping take products from idea to production. I enjoy solving engineering problems end-to-end—whether that involves designing system architecture, building reliable backend services, automating infrastructure, or integrating AI where it creates real value. Interests: Backend Engineering, Platform Engineering, Cloud Infrastructure, Distributed Systems, and Applied AI.

Everyone wants to build “voice agents”, but that term hides two very different architectures.

The first is the classic cascading pipeline: speech-to-text → LLM → text-to-speech, all coordinated by your backend.

The second is the newer native multimodal realtime pipeline, where the client talks almost directly to a realtime model session capable of handling speech input/output in one loop.

I sketched both architectures to understand the tradeoffs more clearly. In this post, I’ll break down how each pipeline works, where latency comes from, what your backend is responsible for, and when I’d choose one over the other.

Why voice agent architecture matters?

Building a voice agent is not just about plugging an LLM into a microphone. The architecture you choose determines:

  • how fast the assistant feels

  • whether you can swap STT/TTS vendors independently

  • how easily you can add tools and business logic

  • how much conversational nuance the system can preserve

  • how much your backend stays in the critical path

  • how hard it is to debug latency spikes or broken turns

At a high level, most production systems today fall into one of two buckets:

  1. Cascading voice pipeline
    Separate STT, LLM, and TTS services connected by backend orchestration.

  2. Native multimodal realtime pipeline
    A realtime model session handles the conversational loop, while the backend acts more like a sidecar for auth, tools, and control.

1) Cascading voice pipeline

In the cascading design, the voice stack is decomposed into three separate stages:

  • Speech-to-Text (STT / ASR)
    Converts user audio into text

  • LLM reasoning/generation
    Takes the transcript + context and produces a response

  • Text-to-Speech (TTS)
    Converts the response text back into audio

The backend orchestrates all three stages and sits directly in the middle of the conversation loop.

High-level flow

  1. The client (browser / SIP endpoint / phone integration) streams audio to the backend

  2. The backend forwards audio to an STT provider

  3. The STT system emits partial/final transcripts

  4. The backend sends the transcript + conversation context to an LLM

  5. The LLM streams back response text

  6. The backend forwards the generated text to a TTS provider

  7. TTS returns synthesized audio

  8. The backend streams audio back to the client

Breaking down the components

Client

The client can be:

  • a browser using WebRTC/WebSocket audio streaming

  • a SIP/telephony system for phone calls

  • a mobile app streaming microphone audio

Its job is simple: capture audio, send it upstream, and play audio responses back with low latency.

Backend

The backend is the conductor of the entire pipeline.

It is responsible for:

  • managing websocket sessions

  • buffering audio chunks

  • forwarding audio to STT

  • maintaining conversation state

  • calling the LLM with the latest transcript + context

  • invoking TTS for response synthesis

  • streaming synthesized audio back to the client

  • enforcing business rules, auth, logging, and analytics

In a cascading pipeline, the backend is not optional infrastructure — it is the core orchestrator.

STT / ASR

The STT layer converts incoming speech into text. Typical providers include:

  • OpenAI Whisper / realtime transcription offerings

  • Azure Speech-to-Text

  • Deepgram

  • AssemblyAI

  • ElevenLabs speech recognition offerings where relevant

This stage often also handles:

  • voice activity detection (VAD) — detecting when the user starts/stops speaking

  • turn detection — deciding when the user’s utterance is “complete enough” to send to the LLM

  • partial transcript streaming

LLM

Once the backend has a usable transcript, it calls the LLM. This is where:

  • intent understanding happens

  • tool decisions happen

  • conversation memory is used

  • response text is generated

For voice systems, teams often prefer small, fast models for the conversational path because latency matters more than perfect reasoning depth for every turn.

TTS

Finally, response text is synthesized into speech using providers like:

  • ElevenLabs

  • Deepgram

  • AssemblyAI / others depending on the stack

The backend then streams the generated audio to the client.

Why teams still use cascading pipelines

Advantages

  • Modular and replaceable
    You can swap Deepgram for Whisper, or ElevenLabs for another TTS provider, without redesigning the whole system.

  • Maximum backend control
    The backend sees the transcript, the prompt, the tool calls, and the generated response. That’s valuable for compliance, debugging, analytics, and guardrails.

  • Tool orchestration is straightforward
    Since the LLM call already goes through your backend, executing tools, calling APIs, or injecting business logic is easy to control.

  • Best when speech is only one part of a larger workflow
    For example: contact-center agents, internal enterprise assistants, booking flows, or customer support systems with strict backend workflows.

Drawbacks

  • Higher latency because STT, LLM, and TTS are separate hops

  • Speech expressiveness can be lost when everything is collapsed into text between stages

  • Interruption handling is harder because the system has to coordinate stop-speaking / start-speaking behavior across multiple services

  • More moving parts means more operational complexity

2) Native multimodal realtime pipeline

The second architecture is structurally different.

Instead of routing every conversational turn through separate STT → LLM → TTS services, the client establishes a direct low-latency connection to a realtime multimodal model session.

The backend still exists, but it no longer sits in the middle of every audio token flowing through the system. Instead, it acts more like a control plane / sideband service.

High-level flow

  1. The client requests an ephemeral auth token from the backend

  2. The client uses that token to establish a direct websocket/realtime connection to the model provider

  3. Audio/video streams directly between the client and the realtime session

  4. The backend maintains a sideband channel to the session for:

    • tool/function execution

    • telemetry monitoring

    • session updates / injected instructions

    • business logic hooks

This changes the role of the backend significantly.

Breaking down the components

Client

In the native multimodal setup, the client becomes much more important.

It is responsible for:

  • establishing the realtime connection

  • sending live audio (and sometimes video)

  • receiving streamed speech responses directly from the model session

  • handling interruptions and playback state on the client side

Because the client is closer to the model, the interaction can feel more conversational and immediate.

Backend

The backend does not disappear. It just moves out of the direct audio path.

Its responsibilities now shift toward:

  • minting ephemeral session/auth tokens

  • maintaining a sideband websocket / control channel

  • intercepting and executing tool calls

  • injecting dynamic instructions or policy updates

  • collecting telemetry, usage, and observability data

  • enforcing app-specific business rules

This is a crucial mindset shift:

In a cascading pipeline, the backend is the conversation orchestrator.
In a realtime multimodal pipeline, the backend is often the session controller and tool executor.

Realtime multimodal model session

The model session now handles much more of the conversational loop natively:

  • streaming speech input

  • turn-taking behavior

  • response generation

  • speech output

  • sometimes multimodal context like audio + text + image/video

Instead of manually chaining STT + LLM + TTS, you’re delegating that conversational stack to the model provider.

Why the native multimodal pipeline feels faster

The biggest architectural win is that the system removes multiple explicit stages from the critical path.

Instead of:

client → backend → STT → backend → LLM → backend → TTS → backend → client

you get something closer to:

client ↔ realtime model session

with the backend participating only when needed.

That reduces:

  • extra network hops

  • orchestration overhead

  • cross-service coordination delays

It also enables more natural interaction patterns like:

  • faster interruptions / barge-in

  • lower response latency

  • smoother turn-taking

  • tighter speech-to-speech conversational flow

When I’d choose each architecture

I’d choose a cascading pipeline when…

  • I need full control over the stack

  • I want to independently choose the best STT, LLM, and TTS vendors

  • tool execution and business workflows are central to the product

  • I’m building for enterprise / contact-center / workflow-heavy use cases

  • I need strong logging, auditability, and stage-by-stage debugging

I’d choose a native multimodal realtime pipeline when…

  • the core product value is real-time conversation quality

  • low latency is the top priority

  • I want more natural interruptions and turn-taking

  • I’m building a voice assistant, AI tutor, interviewer, sales caller, or realtime support agent

  • I’m comfortable delegating more of the speech loop to the model provider

If I had to summarize the tradeoff in one line:

  • Cascading pipeline = control and composability (flexibility, low cost)

  • Native multimodal pipeline = speed and conversational fluidity (less flexible, high cost, low latency than cascading pipeline)