Skip to content
Foundry Ventures
  • Products
  • Solutions
  • Blog
  • About
  • Contact
  • Get Started
Foundry Ventures

AI-Powered Software. Shipped.

Navigation

  • Products
  • Solutions
  • Blog
  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
© 2026 Foundry Ventures LLC. All rights reserved.
  1. Home
  2. Blog
  3. Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive
Cloud Architecture

Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive

March 5, 2026•7 min read•...

Contents

  • The Latency Budget
  • Bidirectional WebSocket Architecture
  • Simultaneous input and output
  • Streaming token generation
  • Barge-in detection
  • The Twilio Integration
  • Optimization Techniques
  • Connection pooling
  • Audio chunking
  • Response prefetching
  • Results

MDFit Nova-Sonic handles live phone calls, which means every millisecond of latency is perceptible. Here is how we architected the system for real-time streaming.

The Latency Budget

For natural conversation, total round-trip time from user speech to AI response must stay under 1 second. That budget breaks down as:

  • Speech-to-text: ~200ms
  • Intent classification + routing: ~100ms
  • LLM generation: ~400ms
  • Text-to-speech: ~200ms

Exceed 1.2 seconds and callers perceive the AI as slow. Exceed 2 seconds and they hang up.

Bidirectional WebSocket Architecture

Amazon Nova Sonic supports bidirectional streaming over WebSockets. This is critical because it allows:

Simultaneous input and output

The system can process incoming audio while generating a response. This enables natural turn-taking and even mid-sentence interruption handling.

Streaming token generation

Instead of waiting for a complete response, tokens stream to the TTS engine as they are generated. The first syllable of the response begins playing while the rest is still being generated.

Barge-in detection

If a caller interrupts the AI mid-response, the system detects the barge-in, stops the current response, and processes the new input.

The Twilio Integration

Twilio Media Streams provides the raw audio bridge between the phone network and our WebSocket server.

Caller → Twilio → Media Stream → WebSocket → Nova Sonic → WebSocket → Media Stream → Twilio → Caller

Each hop adds latency, so we colocate our WebSocket server in the same AWS region as the Nova Sonic endpoint.

Optimization Techniques

Connection pooling

Nova Sonic WebSocket connections are expensive to establish. We maintain a warm pool of connections that are reused across calls.

Audio chunking

We send audio in 20ms chunks — small enough for responsive processing, large enough to avoid excessive network overhead.

Response prefetching

For common intents (greeting, hold, transfer), we pre-generate responses and cache the audio. This drops response time to under 200ms for predictable interactions.

Results

With these optimizations, our median response time is 780ms and our p95 is 1.1 seconds. Callers consistently rate the conversation flow as natural.

Enjoyed this post?

Get AI insights and engineering lessons delivered to your inbox. No spam, unsubscribe anytime.

Share:
← From Idea to SaaS: How TestIQ Went From Prototype to ProductBuilding COPPA-Compliant EdTech: Lessons from MindfulTime →

Related Posts

Deploying Next.js on Vercel with Neon Postgres: Our Production Stack

6 min read

How We Built a Voice AI System That Handles Real Healthcare Calls

8 min read

Multi-Agent Systems: Why One AI Isn't Enough

6 min read