Robot Holmes and the Silenced Witness: A Noir Guide to Real-Time Voice AI
- Track:
- Machine Learning, NLP and CV
- Type:
- Talk (long session)
- Level:
- intermediate
- Duration:
- 45 minutes
Abstract
The hardships of building End-to-End Voice Assistants in the Wild
Robot Holmes is back in the mist-choked streets of MLington, but he isn’t working solo.
Meet Zintia, an intern from the Voice Assistant district. She’s helpful, hyper-efficient, and incredibly annoying, providing Holmes with data before he can lift a finger. But Zintia has a secret. The longer she’s on the case, the more of her "dark side" emerges. She’s not just hearing the truth; she’s deciding which parts Holmes is allowed to hear.
This is a story-driven, practical session for anyone tired of "Hello World" chatbots. We will move past the hype to look at what it actually take to make End-to-End Voice Assistants work in the real world.
Our Investigation Includes:
- The Gear: How to use E2E speech models like gpt-realtime and integrate them into a production voice interface using FreeSWITCH and Pipecat.
- The Interrogation: Navigating the hardships of instruction-following, ensuring underlying LLMs stay on path through defined states and agentic flow.
- The Double-Cross: Identifying and mitigating "hidden agendas" - the hallucinations and safety guardrails that can make a voice assistant turn on its user.
Expect live demos, hard-won production lessons, a detective noir story and a blueprint for building voice agents that are fast, fluid, and (mostly) law-abiding.