The engine room
Vought turns a live conversation into a whisper in under a second. The instrument that makes it possible is the ElevenLabs Speech Engine — streaming speech-to-text with end-of-turn detection on the way in, Flash v2 text-to-speech in the operator’s own cloned voice on the way out.
Signal path
Seven stages. One second.
Every leg streams — nothing is batched. The ElevenLabs Speech Engine bookends the loop; the Echo Engine orchestrates the middle.
Engine config · live
Tuned for latency, not defaults
model_id "eleven_flash_v2" turn_timeout 2 optimize_streaming_latency 3 privacy.zero_retention_mode true overrides.first_message false voice_id <operator clone>
turn_timeout: 2 cuts end-of-turn waiting from the default 7s; optimize_streaming_latency: 3 and Flash v2 push first audio byte under 200ms; zero-retention keeps the call ephemeral.
Why it matters
The whisper has to beat the pause
- —Humans notice a reply gap past ~1s. The entire loop holds a 412ms median — the operator hears the line before the silence gets awkward.
- —STT end-of-turn detection fires the engine the instant the other party stops — no fixed timer waiting.
- —TTS streams the cloned voice byte-by-byte, so playback starts before the sentence finishes generating.
- —Interruptions cancel the whole chain in <200ms, so Vought never overlaps a real voice.
What the engine unlocks
Advanced use cases, live
Live whisper coaching
The next line streamed into the rep’s ear mid-call — rendered in their own cloned voice, before the moment passes.
Sub-200ms interruption
When the operator starts speaking, the LLM stream and TTS cancel within 200ms via AbortSignal. Never talks over a human.
Single-mic diarization
A diart sidecar separates speakers on one microphone and gates the engine so it only ever whispers about the other party.
30-second voice clone
A private ElevenLabs voice model from 30s of audio, hot-swapped into the TTS resource per session. Raw audio discarded after.
Autonomous reception
The same engine answers inbound calls end-to-end — greet, qualify, book — on Flash v2 latency.
Zero-retention by default
Ephemeral processing. No audio stored unless explicitly enabled. SOC 2 / HIPAA / GDPR posture.
The team
Vought is a small team building the calm, precise voice layer for high-stakes conversations — from San Francisco and Bangalore. We chose ElevenLabs because the voice has to be indistinguishable from the operator’s own, and fast enough to land inside a live call.