Research · Speech Engine · System nominal

The engine room

Vought turns a live conversation into a whisper in under a second. The instrument that makes it possible is the ElevenLabs Speech Engine — streaming speech-to-text with end-of-turn detection on the way in, Flash v2 text-to-speech in the operator’s own cloned voice on the way out.

VOUGHT · ECHO ENGINE

CORE · ONLINE

REALTIME · 60FPS

· 412ms LOOP

STT END-OF-TURN

≤350ms

LLM FIRST TOKEN

≤250ms

TTS FIRST BYTE

≤200ms

LOOP MEDIAN

412ms

INTERRUPT CANCEL

<200ms

AUDIO RETENTION

0 bytes

Signal path

Seven stages. One second.

Every leg streams — nothing is batched. The ElevenLabs Speech Engine bookends the loop; the Echo Engine orchestrates the middle.

Engine config · live

Tuned for latency, not defaults

model_id                     "eleven_flash_v2"
turn_timeout                 2
optimize_streaming_latency   3
privacy.zero_retention_mode  true
overrides.first_message      false
voice_id                     <operator clone>

turn_timeout: 2 cuts end-of-turn waiting from the default 7s; optimize_streaming_latency: 3 and Flash v2 push first audio byte under 200ms; zero-retention keeps the call ephemeral.

Why it matters

The whisper has to beat the pause

—Humans notice a reply gap past ~1s. The entire loop holds a 412ms median — the operator hears the line before the silence gets awkward.
—STT end-of-turn detection fires the engine the instant the other party stops — no fixed timer waiting.
—TTS streams the cloned voice byte-by-byte, so playback starts before the sentence finishes generating.
—Interruptions cancel the whole chain in <200ms, so Vought never overlaps a real voice.

What the engine unlocks

Advanced use cases, live

Live whisper coaching

The next line streamed into the rep’s ear mid-call — rendered in their own cloned voice, before the moment passes.

Sub-200ms interruption

When the operator starts speaking, the LLM stream and TTS cancel within 200ms via AbortSignal. Never talks over a human.

Single-mic diarization

A diart sidecar separates speakers on one microphone and gates the engine so it only ever whispers about the other party.

30-second voice clone

A private ElevenLabs voice model from 30s of audio, hot-swapped into the TTS resource per session. Raw audio discarded after.

Autonomous reception

The same engine answers inbound calls end-to-end — greet, qualify, book — on Flash v2 latency.

Zero-retention by default

Ephemeral processing. No audio stored unless explicitly enabled. SOC 2 / HIPAA / GDPR posture.

The team

Vought is a small team building the calm, precise voice layer for high-stakes conversations — from San Francisco and Bangalore. We chose ElevenLabs because the voice has to be indistinguishable from the operator’s own, and fast enough to land inside a live call.

Get a demo Read the platform