You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/oss/langchain/voice-agent.mdx
+28-9Lines changed: 28 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,16 +3,34 @@ title: Build a voice agent with LangChain
3
3
sidebarTitle: Voice agent
4
4
---
5
5
6
+
## Overview
6
7
7
-
Some applications, including customer support, personal assistants, and hands-free interfaces, can benefit from supporting real-time interactions through audio.
8
+
Chat interfaces have dominated how we interact with AI, but recent breakthroughs in multimodal AI are opening up exciting new possibilities. High-quality generative models and expressive text-to-speech (TTS) systems now make it possible to build agents that feel less like tools and more like conversational partners.
8
9
9
-
This tutorial demonstrates how to interact with LangChain [agents](/oss/langchain/agents) through voice channels. It follows the reference architecture implemented in the [LangChain repo](https://github.com/langchain-ai/voice-sandwich-demo).
10
+
Voice agents are one example of this. Instead of relying on a keyboard and mouse to type inputs into an agent, you can use spoken words to interact with it. This can be a more natural and engaging way to interact with AI, and can be especially useful for certain contexts.
10
11
11
-
##Concepts
12
+
### What are voice agents?
12
13
13
-
There are two common architectures for voice agents:
14
+
Voice agents are [agents](/oss/langchain/agents) that can engage in natural spoken conversations with users. These agents combine speech recognition, natural language processing, generative AI, and text-to-speech technologies to create seamless, natural conversations.
14
15
15
-
### 1. "The Sandwich"
16
+
They're suited for a variety of use cases, including:
17
+
18
+
- Customer support
19
+
- Personal assistants
20
+
- Hands-free interfaces
21
+
- Coaching and training
22
+
23
+
### How do voice agents work?
24
+
25
+
At a high level, every voice agent needs to handle three tasks:
26
+
27
+
1.**Listen** - capture audio and transcribe it
28
+
2.**Think** - interpret intent, reason, plan
29
+
3.**Speak** - generate audio and stream it back to the user
30
+
31
+
The difference lies in how these steps are sequenced and coupled. In practice, production agents follow one of two main architectures:
32
+
33
+
#### 1. STT > Agent > TTS Architecture (The "Sandwich")
16
34
17
35
The Sandwich architecture composes three distinct components: speech-to-text (STT), a text-based LangChain agent, and text-to-speech (TTS).
18
36
@@ -34,7 +52,7 @@ flowchart LR
34
52
- Additional complexity in managing the pipeline
35
53
- Conversion from speech to text loses information (e.g., tone, emotion)
36
54
37
-
### 2. Speech-to-Speech Architecture
55
+
####2. Speech-to-Speech Architecture (S2S)
38
56
39
57
Speech-to-speech uses a multimodal model that processes audio input and generates audio output natively.
40
58
@@ -55,11 +73,11 @@ flowchart LR
55
73
- Less transparency in how audio is processed
56
74
- Reduced controllability and customization options
57
75
58
-
## Demo application overview
59
-
60
76
This guide demonstrates the **sandwich architecture** to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components.
61
77
62
-
The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
78
+
### Demo application overview
79
+
80
+
We'll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using [AssemblyAI](https://www.assemblyai.com/) for STT and [ElevenLabs](https://elevenlabs.io/) for TTS (although adapters can be built for most providers).
63
81
64
82
An end-to-end reference application is available in the [voice-sandwich-demo](https://github.com/langchain-ai/voice-sandwich-demo) repository. We will walk through that application here.
65
83
@@ -726,6 +744,7 @@ We use [RunnableGenerators](https://reference.langchain.com/python/langchain_cor
0 commit comments