The Echo Engine
The engine behind every conversation
Vought is the brain. ElevenLabs is the voice. Between them sits the Echo Engine — diarization, orchestration, retrieval, and streaming — tuned end-to-end for sub-second whispers.
The voice loop
Every leg is streamed, never batched. The line is moving toward your ear before the other person finishes their sentence.
01WEBRTC
Your mic
16kHz mono PCM streamed to the browser edge.
02SPEECH ENGINE
ElevenLabs STT
Streaming transcription with end-of-turn detection.
03DIARIZATION
diart sidecar
Separates you from them on a single mic in real time.
04ORCHESTRATION
Echo Engine
Persona + playbook RAG + memory assembled into the prompt.
05GPT-4o / CLAUDE
Streaming LLM
First token in ~250ms, cancelled on interruption.
06CLONED VOICE
ElevenLabs TTS
Your voice, first audio byte in ~200ms, into your earbud.
Latency budget
Under one second, every turn
End-of-turn detection≤ 350ms
LLM first token≤ 250ms
TTS first byte≤ 200ms
Network + buffer≤ 100ms
Median end-to-end· 412ms
Responsible by default
Built for conversations that can’t leak
- —Zero-retention mode on by default — no audio stored unless you turn it on.
- —Explicit, revocable voice-clone consent. The clone is yours alone.
- —The AI cancels mid-sentence within 200ms when you start to speak.
- —Every whisper cites its source — playbook, chunk, and version.
Build on Vought
One session call. Your voice on the line.
TypeScriptcopy
import { Vought } from "@vought/sdk";
const session = await Vought.connect({
persona: "sales-discovery",
voiceId: user.clonedVoiceId,
onWhisper: (line) => earbud.play(line.audio),
});Pythoncopy
from vought import Vought
session = Vought.connect(
persona="sales-discovery",
voice_id=user.cloned_voice_id,
)
for whisper in session.stream():
earbud.play(whisper.audio)cURLcopy
curl -N https://api.vought.com/v1/sessions \ -H "Authorization: Bearer $VOUGHT_KEY" \ -d persona=sales-discovery \ -d voice_id=$VOICE_ID