Research · Speech Engine · System nominal

The engine room

Vought turns a live conversation into a whisper in under a second. The instrument that makes it possible is the ElevenLabs Speech Engine — streaming speech-to-text with end-of-turn detection on the way in, Flash v2 text-to-speech in the operator’s own cloned voice on the way out.

VOUGHT · ECHO ENGINE
CORE · ONLINE
REALTIME · 60FPS
· 412ms LOOP
STT END-OF-TURN
≤350ms
LLM FIRST TOKEN
≤250ms
TTS FIRST BYTE
≤200ms
LOOP MEDIAN
412ms
INTERRUPT CANCEL
<200ms
AUDIO RETENTION
0 bytes

Signal path

Seven stages. One second.

Every leg streams — nothing is batched. The ElevenLabs Speech Engine bookends the loop; the Echo Engine orchestrates the middle.

Engine config · live

Tuned for latency, not defaults

model_id                     "eleven_flash_v2"
turn_timeout                 2
optimize_streaming_latency   3
privacy.zero_retention_mode  true
overrides.first_message      false
voice_id                     <operator clone>

turn_timeout: 2 cuts end-of-turn waiting from the default 7s; optimize_streaming_latency: 3 and Flash v2 push first audio byte under 200ms; zero-retention keeps the call ephemeral.

Why it matters

The whisper has to beat the pause

  • Humans notice a reply gap past ~1s. The entire loop holds a 412ms median — the operator hears the line before the silence gets awkward.
  • STT end-of-turn detection fires the engine the instant the other party stops — no fixed timer waiting.
  • TTS streams the cloned voice byte-by-byte, so playback starts before the sentence finishes generating.
  • Interruptions cancel the whole chain in <200ms, so Vought never overlaps a real voice.

What the engine unlocks

Advanced use cases, live

Live whisper coaching

The next line streamed into the rep’s ear mid-call — rendered in their own cloned voice, before the moment passes.

Sub-200ms interruption

When the operator starts speaking, the LLM stream and TTS cancel within 200ms via AbortSignal. Never talks over a human.

Single-mic diarization

A diart sidecar separates speakers on one microphone and gates the engine so it only ever whispers about the other party.

30-second voice clone

A private ElevenLabs voice model from 30s of audio, hot-swapped into the TTS resource per session. Raw audio discarded after.

Autonomous reception

The same engine answers inbound calls end-to-end — greet, qualify, book — on Flash v2 latency.

Zero-retention by default

Ephemeral processing. No audio stored unless explicitly enabled. SOC 2 / HIPAA / GDPR posture.

The team

Vought is a small team building the calm, precise voice layer for high-stakes conversations — from San Francisco and Bangalore. We chose ElevenLabs because the voice has to be indistinguishable from the operator’s own, and fast enough to land inside a live call.