Beyond the Demo: Production Patterns for Streaming LLM Systems
- Track:
- Data Engineering and MLOps
- Type:
- Talk
- Level:
- advanced
- Duration:
- 30 minutes
Abstract
Most LLM tutorials end where production begins. When OpenAI returns a 429, when Claude’s latency spikes 10x, or when your streaming response dies mid-generation—what happens to your users?
This talk covers battle-tested architecture patterns for production LLM streaming, moving beyond simple API calls to resilient systems. We will explore multi-provider failover chains (OpenAI → Anthropic → local), circuit breakers specifically configured for AI workloads, and token-aware rate limiting that protects both latency and cost.
You will learn framework-agnostic Python patterns using asyncio and LiteLLM for provider abstraction. We will examine real incident patterns—including the December 2025 Anthropic outage—and the architectural decisions that separate 99.5% availability from 99.9%.