shuo θ―΄
A voice agent framework in ~600 lines of Python.
python main.py +1234567890
π Server starting on port 3040
β Ready https://mature-spaniel-physically.ngrok-free.app
π Calling +1234567890...
β Call initiated SID: CA094f2e...
π WebSocket connected
βΆ Stream started SID: MZ8a3b1f...
β Flux EndOfTurn "Hey, how's it going?"
β LISTENING β RESPONDING
β Start Agent "Hey, how's it going?"
β Agent turn done
β RESPONDING β LISTENING
How it works
Two abstractions, one pure function:
- Deepgram Flux β always-on STT + turn detection over a single WebSocket
- Agent β self-contained LLM β TTS β Player pipeline, owns conversation history
process_event(state, event) β (state, actions)β the entire state machine in ~30 lines
Everything streams. LLM tokens feed TTS immediately, TTS audio feeds Twilio immediately. If you interrupt (barge-in), the agent cancels everything and clears the audio buffer instantly.
LISTENING ββEndOfTurnβββ RESPONDING ββDoneβββ LISTENING
β β
βββββStartOfTurnββββββββββ (barge-in)
Project structure
shuo/
types.py # Immutable state, events, actions
state.py # Pure state machine (~30 lines)
conversation.py # Main event loop
agent.py # LLM β TTS β Player pipeline
log.py # Colored logging
server.py # FastAPI endpoints
services/
flux.py # Deepgram Flux (STT + turns)
llm.py # OpenAI GPT-4o-mini streaming
tts.py # ElevenLabs WebSocket streaming
tts_pool.py # TTS connection pool (warm spares)
player.py # Audio playback to Twilio
twilio_client.py # Outbound calls + message parsing
Setup
Requires Python 3.9+, ngrok, and API keys for Twilio, Deepgram, OpenAI, and ElevenLabs.
pip install -r requirements.txt cp .env.example .env # fill in your keys ngrok http 3040 # in another terminal python main.py +1234567890
Tests
python -m pytest tests/ -v # runs in ~0.03sLicense
MIT