feat(oss/learn): changes to voice agent guide (#1762)

hntrl · ccurme · web-flow · commit bb8f5aae1b81 · 2025-12-07T11:31:49.000-08:00
Co-authored-by: ccurme &lt;chester.curme@gmail.com&gt;
diff --git a/src/oss/langchain/voice-agent.mdx b/src/oss/langchain/voice-agent.mdx
@@ -3,16 +3,34 @@ title: Build a voice agent with LangChain
 sidebarTitle: Voice agent
 ---
 
+## Overview
 
-Some applications, including customer support, personal assistants, and hands-free interfaces, can benefit from supporting real-time interactions through audio.
+Chat interfaces have dominated how we interact with AI, but recent breakthroughs in multimodal AI are opening up exciting new possibilities. High-quality generative models and expressive text-to-speech (TTS) systems now make it possible to build agents that feel less like tools and more like conversational partners.
 
-This tutorial demonstrates how to interact with LangChain [agents](/oss/langchain/agents) through voice channels. It follows the reference architecture implemented in the [LangChain repo](https://github.com/langchain-ai/voice-sandwich-demo).
+Voice agents are one example of this. Instead of relying on a keyboard and mouse to type inputs into an agent, you can use spoken words to interact with it. This can be a more natural and engaging way to interact with AI, and can be especially useful for certain contexts.
 
-## Concepts
+### What are voice agents?
 
-There are two common architectures for voice agents:
+Voice agents are [agents](/oss/langchain/agents) that can engage in natural spoken conversations with users. These agents combine speech recognition, natural language processing, generative AI, and text-to-speech technologies to create seamless, natural conversations.
 
-### 1. "The Sandwich"
+They're suited for a variety of use cases, including:
+
+- Customer support
+- Personal assistants
+- Hands-free interfaces
+- Coaching and training
+
+### How do voice agents work?
+
+At a high level, every voice agent needs to handle three tasks:
+
+1. **Listen** - capture audio and transcribe it
+2. **Think** - interpret intent, reason, plan
+3. **Speak** - generate audio and stream it back to the user
+
+The difference lies in how these steps are sequenced and coupled. In practice, production agents follow one of two main architectures:
+
+#### 1. STT > Agent > TTS Architecture (The "Sandwich")
 
 The Sandwich architecture composes three distinct components: speech-to-text (STT), a text-based LangChain agent, and text-to-speech (TTS).
 
@@ -34,7 +52,7 @@ flowchart LR
 - Additional complexity in managing the pipeline
 - Conversion from speech to text loses information (e.g., tone, emotion)
 
-### 2. Speech-to-Speech Architecture
+#### 2. Speech-to-Speech Architecture (S2S)
 
 Speech-to-speech uses a multimodal model that processes audio input and generates audio output natively.
 
@@ -55,11 +73,11 @@ flowchart LR
 - Less transparency in how audio is processed
 - Reduced controllability and customization options
 
-## Demo application overview
-
 This guide demonstrates the **sandwich architecture** to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components.
 
-The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
+### Demo application overview
+
+We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
 
 An end-to-end reference application is available in the [voice-sandwich-demo](https://github.com/langchain-ai/voice-sandwich-demo) repository. We will walk through that application here.
 
@@ -726,6 +744,7 @@ We use [RunnableGenerators](https://reference.langchain.com/python/langchain_cor
 
 :::js
 ```typescript
+// using https://hono.dev/
 app.get("/ws", upgradeWebSocket(async () => {
   const inputStream = writableIterator<Uint8Array>();