Skip to content

Commit bb8f5aa

Browse files
hntrlccurme
andauthored
feat(oss/learn): changes to voice agent guide (#1762)
Co-authored-by: ccurme <chester.curme@gmail.com>
1 parent 4ad224b commit bb8f5aa

File tree

1 file changed

+28
-9
lines changed

1 file changed

+28
-9
lines changed

src/oss/langchain/voice-agent.mdx

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,34 @@ title: Build a voice agent with LangChain
33
sidebarTitle: Voice agent
44
---
55

6+
## Overview
67

7-
Some applications, including customer support, personal assistants, and hands-free interfaces, can benefit from supporting real-time interactions through audio.
8+
Chat interfaces have dominated how we interact with AI, but recent breakthroughs in multimodal AI are opening up exciting new possibilities. High-quality generative models and expressive text-to-speech (TTS) systems now make it possible to build agents that feel less like tools and more like conversational partners.
89

9-
This tutorial demonstrates how to interact with LangChain [agents](/oss/langchain/agents) through voice channels. It follows the reference architecture implemented in the [LangChain repo](https://github.com/langchain-ai/voice-sandwich-demo).
10+
Voice agents are one example of this. Instead of relying on a keyboard and mouse to type inputs into an agent, you can use spoken words to interact with it. This can be a more natural and engaging way to interact with AI, and can be especially useful for certain contexts.
1011

11-
## Concepts
12+
### What are voice agents?
1213

13-
There are two common architectures for voice agents:
14+
Voice agents are [agents](/oss/langchain/agents) that can engage in natural spoken conversations with users. These agents combine speech recognition, natural language processing, generative AI, and text-to-speech technologies to create seamless, natural conversations.
1415

15-
### 1. "The Sandwich"
16+
They're suited for a variety of use cases, including:
17+
18+
- Customer support
19+
- Personal assistants
20+
- Hands-free interfaces
21+
- Coaching and training
22+
23+
### How do voice agents work?
24+
25+
At a high level, every voice agent needs to handle three tasks:
26+
27+
1. **Listen** - capture audio and transcribe it
28+
2. **Think** - interpret intent, reason, plan
29+
3. **Speak** - generate audio and stream it back to the user
30+
31+
The difference lies in how these steps are sequenced and coupled. In practice, production agents follow one of two main architectures:
32+
33+
#### 1. STT > Agent > TTS Architecture (The "Sandwich")
1634

1735
The Sandwich architecture composes three distinct components: speech-to-text (STT), a text-based LangChain agent, and text-to-speech (TTS).
1836

@@ -34,7 +52,7 @@ flowchart LR
3452
- Additional complexity in managing the pipeline
3553
- Conversion from speech to text loses information (e.g., tone, emotion)
3654

37-
### 2. Speech-to-Speech Architecture
55+
#### 2. Speech-to-Speech Architecture (S2S)
3856

3957
Speech-to-speech uses a multimodal model that processes audio input and generates audio output natively.
4058

@@ -55,11 +73,11 @@ flowchart LR
5573
- Less transparency in how audio is processed
5674
- Reduced controllability and customization options
5775

58-
## Demo application overview
59-
6076
This guide demonstrates the **sandwich architecture** to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components.
6177

62-
The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
78+
### Demo application overview
79+
80+
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
6381

6482
An end-to-end reference application is available in the [voice-sandwich-demo](https://github.com/langchain-ai/voice-sandwich-demo) repository. We will walk through that application here.
6583

@@ -726,6 +744,7 @@ We use [RunnableGenerators](https://reference.langchain.com/python/langchain_cor
726744

727745
:::js
728746
```typescript
747+
// using https://hono.dev/
729748
app.get("/ws", upgradeWebSocket(async () => {
730749
const inputStream = writableIterator<Uint8Array>();
731750

0 commit comments

Comments
 (0)