One socket, three jobs.
Speech Engine collapses STT, TTS, and turn detection into a single connection. No round-trips between a transcription vendor, a thinking layer, and a synthesis vendor — the integration tax is gone.
Vought is a thin layer of taste on top of an extraordinary stack. ElevenLabs is the engine. Vercel and Render keep it running. The source is public — read every line.
ElevenLabs didn’t just ship better TTS. They moved the entire field — voice quality past the uncanny valley, voice cloning from seconds of audio instead of hours, and now Speech Engine: the first primitive that puts STT, TTS, and turn detection on one socket. Every voice product built after them inherits the floor they raised.
We tried the obvious alternative — a Whisper transcription service, a frontier LLM, and a separate TTS vendor stitched together. End-to-end latency landed near two seconds. Voices drifted. Interrupting the AI mid-sentence required custom plumbing. Speech Engine collapsed all three into a single primitive and the loop dropped to well under a second on the first try.
Animated edges are live audio and streaming LLM tokens. Static edges are control signals. The diagram is faithful — these are the exact services and the exact direction of flow in production.
Speech Engine collapses STT, TTS, and turn detection into a single connection. No round-trips between a transcription vendor, a thinking layer, and a synthesis vendor — the integration tax is gone.
eleven_flash_v2 ships the first audio chunk faster than most pipelines finish thinking. That is the entire reason Vought feels like a whisper instead of a robot.
The product’s wow moment — the AI speaking in your voice — exists because cloning is a 30-second capture and a single API call, not a multi-week studio session.
No SIP gateway, no audio bridge, no native client. The same socket runs in a hackathon laptop browser and a production deployment unchanged.
Sessions are ephemeral. We opted into retention_days = -1 at engine creation so transcripts and audio never persist on their side. Compliance gets shorter, not longer.
When the operator interrupts the whisper, the same AbortController that cancels the LLM also closes the TTS stream cleanly. End-to-end interrupt in under 200 ms.
await elevenlabs.speechEngine.attach(SPEECH_ENGINE_ID, httpServer, '/ws', {
onTranscript: async (transcript, signal, session) => {
const stream = await llm.chat({ signal, ... }); // AbortSignal threaded
await session.sendResponse(stream); // STT → LLM → TTS
},
});elevenlabs.io Both Next.js apps — the marketing site at vought.com and the product app at app.vought.com — ship to Vercel on every push to main. Preview deployments are how the design review loop stays under five minutes.
vercel.comThe Node Echo Engine, the Python diart sidecar, and the managed Postgres + Redis instances all run on Render. One Dockerfile per service, deployed from the same monorepo with zero glue.
render.comvought-os/ ├── apps/ │ ├── web/ → Vercel (marketing) │ └── app/ → Vercel (product) ├── services/ │ ├── echo-engine/ → Render (Node, Dockerfile) │ └── diarization-sidecar/ → Render (Python, Dockerfile) ├── packages/ │ ├── design-system/ → tokens + global.css │ ├── motion/ → the four signature motions │ └── ui/ → Container, Grid, primitives ├── docker-compose.yml → local Postgres + Redis └── turbo.json → pipelines
Each service has its own Dockerfile, its own deployment target, and its own scaling story. The monorepo glue is Turborepo pipelines and a shared design-system package — nothing exotic.
The Echo Engine, the diarization sidecar, the live-call screen, the voice-clone onboarding, the marketing site — every file is on GitHub.
github.com/Sushant6095/vought-os