When to use
- You want full control over the speech stack — STT, LLM, TTS, turn detection, tools, memory — and run it yourself.
- You already have a Pipecat, LiveKit Agents, custom Python/Node, or other voice agent and need to give it a face.
- The hosted engines (OpenAI Realtime, Cartesia) don’t fit your pipeline.
How it works
When you create a session withtype: "external", Avaturn opens a WebSocket to the url you provide and exchanges:
- Binary — raw PCM16LE mono audio in both directions.
- JSON — small control protocol that frames every burst of avatar speech as a segment and propagates playback events back to your engine.
Prerequisites
- A reachable WebSocket endpoint —
wss://in production. Avaturn must reach it over the public internet. - Avaturn API key (dashboard).
- A way to authenticate the incoming WebSocket (shared secret, signed token, IP allowlist — your choice; see Authentication).
1. Create the Avaturn session
conversation_engine fields:
| Field | Type | Notes |
|---|---|---|
type | "external" | Required. |
url | string | wss:// URL Avaturn opens. Must be reachable from Avaturn’s infra. |
audio.user.sample_rate | 16000 | 24000 | Sample rate of the user-mic stream Avaturn sends you. Default 24000. Use 24000 for speech-to-speech LLMs that consume audio natively (OpenAI Realtime, Gemini Live). Pick 16000 if you’d rather halve the upstream bitrate — most VAD and turn-detection models (Silero, Smart Turn) work at 16 kHz internally. |
headers | Record<string,string> | null | Optional. Forwarded on the WebSocket upgrade — typically Authorization: Bearer .... The values are stored only for the lifetime of the session. |
avatar_id, background, model (render model, default delta), user_absent_timeout (default 60s, min 10), max_duration (default 3600s, min 60s, max 86400s). See the API reference.
Response:
session_id— backend handletoken— short-lived credential for the Web SDK
2. Connect from the frontend
3. Implement the WebSocket protocol
Once the session is created, Avaturn opens a WebSocket to yoururl (with your headers) and starts streaming the user’s microphone immediately.
Audio
| Direction | Format |
|---|---|
| Avaturn → engine | Binary PCM16LE mono @ audio.user.sample_rate |
| Engine → Avaturn | Binary PCM16LE mono @ 24 kHz (avatar speech) |
Control messages (engine → Avaturn)
Every chunk of avatar audio must live inside an open segment. Open one withsegment.create before pushing any bytes, then segment.close after the last chunk.
segment_uidis your own correlation id (any string). Avaturn echoes it back on the corresponding playback events so you can match them up.avatar.speech.interruptdiscards anything Avaturn has buffered for playback. Use it when your turn detector decides the user has barged in.sdk.message.sendforwards an arbitrary JSON payload to the Web SDK over the data channel (see Web SDK events).
error frame — open the segment first.
Control messages (Avaturn → engine)
segment_idis Avaturn’s id;segment_uidis the one you supplied. Use whichever is convenient.playback.started/playback.endedfire when the avatar actually starts/finishes lip-syncing the segment — useful for transcript timing.playback.interruptedfires afteravatar.speech.interruptor when a new user utterance pre-empts the current segment.sdk.message.receivecarries messages sent from the browser via the Web SDK.
error frames are advisory — the WebSocket stays open. Most common subtypes:
subtype | Fires when |
|---|---|
avatar.speech.segment.error | You pushed audio bytes outside an open segment, or tried to create while another was still open. Open / close the segment as expected and retry. |
message.type.error | An incoming JSON frame had an unknown type. Check spelling against the outgoing-message list. |
json.parsing.error | An incoming text frame wasn’t valid JSON. |
Segment lifecycle
A correct turn looks like: Only one segment can be open at a time. Attempting tocreate while another is open returns an error — close the current one first.
Authentication
Anything reachable on the public internet at a guessable URL is a free avatar — set up auth before exposing the endpoint.- Shared secret in
headers. Pass{"Authorization": "Bearer <secret>"}when creating the session and check it in your WS upgrade handler. Simple and good enough for most deployments. - Per-session signed token in the URL path. Mint a short-lived HMAC token at session-create time and bake it into the
url(e.g.wss://engine.example.com/ws/<token>). The token is single-use and self-expiring, so the secret never leaves your infra. This is the pattern the reference demo uses for Pipecat Cloud. - IP allowlist. Contact support@avaturn.me for the current egress range if you want network-level filtering in front of your service.
Connection behavior
- Keep-alive. Avaturn sends WebSocket pings every ~75 seconds with a 30-second pong timeout. Most reverse proxies need an idle-timeout ≥ 180 seconds in front of your engine to avoid mid-conversation drops — bump
proxy_read_timeout(nginx), idle timeout (ALB, Cloudflare), or the equivalent. You don’t need to send application-level pings yourself; Avaturn’s WebSocket-protocol pings are sufficient. - Disconnect = session end. If your engine closes the socket, the Avaturn session ends. If Avaturn closes it (e.g. user disconnected,
max_durationreached),recv()returns end-of-stream — drain and exit cleanly. - No automatic reconnect. Avaturn does not retry failed upgrades or dropped connections inside an active session. Make sure your engine is up before the session starts.
Session lifecycle
A session ends on any of:- Explicit
DELETE /api/v1/sessions/{session_id} - The conversation-engine WebSocket closing
user_absent_timeoutelapses with the user disconnected (default 60s)max_durationcap reached (default 3600s, max 86400s)
avatar.dispose() on the frontend to tear down the local SDK state. The backend session terminates as described above — dispose() does not directly close it.
Reference implementation
github.com/avaturn-live/pipecat-avaturn-live-demo — a full open-source reference. Two pipelines ship side-by-side, switchable via a single env var: speech-to-speech (OpenAI Realtime) and cascaded (STT → LLM → TTS). Same transport, serializer, and segment processor wrap both.pipecat_avaturn/serializer.py— bidirectional Pipecat ↔ Avaturn wire format. Read this first to see the protocol on the wire.pipecat_avaturn/segment_processor.py—TTSStartedFrame/TTSStoppedFrame→segment.create/segment.close.pipecat_avaturn/transport.py— the Pipecat FastAPI WebSocket transport with its default real-time pacing sleep disabled. The non-obvious gotcha for anyone building a streaming engine — see the “Don’t throttle output” note in Audio.pipecat_avaturn/broker.py— minimal client forPOST /api/v1/sessionswithtype: "external".server.py— FastAPI app combining the session broker and the conversation engine in one process.
See also
- Web SDK integration guide
- Web SDK events
- Avaturn API reference
- OpenAI Realtime engine — hosted alternative.
- Cartesia engine — hosted alternative.