The Echo Engine

The engine behind every conversation

Vought is the brain. ElevenLabs is the voice. Between them sits the Echo Engine — diarization, orchestration, retrieval, and streaming — tuned end-to-end for sub-second whispers.

The voice loop

Every leg is streamed, never batched. The line is moving toward your ear before the other person finishes their sentence.

01WEBRTC

Your mic

16kHz mono PCM streamed to the browser edge.

02SPEECH ENGINE

ElevenLabs STT

Streaming transcription with end-of-turn detection.

03DIARIZATION

diart sidecar

Separates you from them on a single mic in real time.

04ORCHESTRATION

Echo Engine

Persona + playbook RAG + memory assembled into the prompt.

05GPT-4o / CLAUDE

Streaming LLM

First token in ~250ms, cancelled on interruption.

06CLONED VOICE

ElevenLabs TTS

Your voice, first audio byte in ~200ms, into your earbud.

Latency budget

Under one second, every turn

End-of-turn detection≤ 350ms
LLM first token≤ 250ms
TTS first byte≤ 200ms
Network + buffer≤ 100ms
Median end-to-end· 412ms

Responsible by default

Built for conversations that can’t leak

  • Zero-retention mode on by default — no audio stored unless you turn it on.
  • Explicit, revocable voice-clone consent. The clone is yours alone.
  • The AI cancels mid-sentence within 200ms when you start to speak.
  • Every whisper cites its source — playbook, chunk, and version.

Build on Vought

One session call. Your voice on the line.

TypeScriptcopy
import { Vought } from "@vought/sdk";

const session = await Vought.connect({
  persona: "sales-discovery",
  voiceId: user.clonedVoiceId,
  onWhisper: (line) => earbud.play(line.audio),
});
Pythoncopy
from vought import Vought

session = Vought.connect(
    persona="sales-discovery",
    voice_id=user.cloned_voice_id,
)
for whisper in session.stream():
    earbud.play(whisper.audio)
cURLcopy
curl -N https://api.vought.com/v1/sessions \
  -H "Authorization: Bearer $VOUGHT_KEY" \
  -d persona=sales-discovery \
  -d voice_id=$VOICE_ID

Build on the voice stack