Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.avaturn.live/llms.txt

Use this file to discover all available pages before exploring further.

When to use

  • Low-latency voice-to-voice with natural barge-in.
  • Prompts, voice, and turn detection configured inline per session via an ephemeral OpenAI client secret.
  • You don’t want to run a separate agent runtime.
For Cartesia Line-managed agents, see Cartesia. For backend-driven scripted speech, see Legacy text-echo.

Prerequisites

  • OpenAI API key with Realtime API access. Only the GA version is supported — the beta is not.
  • Avaturn API key (dashboard).

1. Mint a client secret

On your backend, exchange your OpenAI API key for a short-lived client secret. Avaturn uses this secret to open the Realtime WebSocket on the user’s behalf.
from openai import AsyncOpenAI

openai = AsyncOpenAI(api_key="<OPENAI_API_KEY>")
secret = await openai.realtime.client_secrets.create(
    expires_after={"seconds": 600, "anchor": "created_at"},
    session={"type": "realtime", "model": "gpt-realtime"},
)
client_secret = secret.value  # ek_...
Mint a fresh secret per user session. The default lifetime is 600 seconds (max 7200). The secret governs token issuance — an existing WebSocket continues working after the secret expires.

2. Create an Avaturn session

import httpx

async with httpx.AsyncClient() as http:
    r = await http.post(
        "https://api.avaturn.live/api/v1/sessions",
        headers={"Authorization": "Bearer <AVATURN_API_KEY>"},
        json={
            "conversation_engine": {
                "type": "openai-realtime",
                "client_secret": client_secret,
            },
        },
    )
    r.raise_for_status()
    session = r.json()  # { "session_id": "...", "token": "..." }
Response:
  • session_id — backend handle (terminate, telemetry)
  • token — short-lived credential for the Web SDK
Optional session fields: avatar_id, background, render_model (avatar render preset, not the LLM), user_absent_timeout (default 60s, min 10), max_duration (default 3600s, max 86400). See the API reference.

3. Connect from the frontend

import { AvaturnHead } from "@avaturn-live/web-sdk";

const root = document.querySelector<HTMLDivElement>("#avaturn-video")!;
const avatar = new AvaturnHead(root, {
  sessionToken: session.token,
  audioSource: true, // required — voice-to-voice
});

await avatar.init();

Configuring the agent

The session object you pass to client_secrets.create() is applied to the WebSocket Avaturn opens on the user’s behalf — full control over instructions, voice, VAD, and transcription.

Instructions and voice

secret = await openai.realtime.client_secrets.create(
    expires_after={"seconds": 600, "anchor": "created_at"},
    session={
        "type": "realtime",
        "model": "gpt-realtime",
        "instructions": "You are a helpful assistant. Be concise and friendly.",
        "audio": {
            "output": {"voice": "marin"},
            "input": {
                "transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 500,
                },
            },
        },
    },
)
OpenAI currently recommends marin and cedar voices for best quality. Other supported values: alloy, ash, ballad, coral, echo, sage, shimmer, verse.
User transcripts require audio.input.transcription. Without it, OpenAI doesn’t emit transcription events and Avaturn has nothing to forward to the SDK. Avatar response transcripts (assistant side) flow regardless.

Stored prompts

secret = await openai.realtime.client_secrets.create(
    expires_after={"seconds": 600, "anchor": "created_at"},
    session={
        "type": "realtime",
        "model": "gpt-realtime",
        "prompt": {
            "id": "pmpt_abc123",
            "version": "6",
            "variables": {"company_name": "Acme", "tone": "professional"},
        },
    },
)
Full configuration surface (turn detection variants, transcription, audio params): OpenAI session reference.

Engine behavior

  • Audio. 24 kHz mono PCM in both directions.
  • Interruptions. OpenAI server VAD (or semantic VAD, if configured). When the user starts speaking, Avaturn discards in-flight avatar audio.
  • Transcripts. Assistant transcripts (response.output_audio_transcript.done) flow by default. User transcripts (conversation.item.input_audio_transcription.completed) flow only when audio.input.transcription is configured. Both are forwarded to the SDK via ce_events.realtime.*.
  • Tools. Tool definitions sent in the session config are parsed by OpenAI, but Avaturn doesn’t surface response.function_call_arguments.* events to the Web SDK nor relay function results back. Tool calls won’t execute end-to-end — avoid them at this layer until proper support lands.
  • GA only. Beta or mixed beta/GA usage causes a session_lifecycle_error with code openai-realtime-version-mismatch. See beta-to-GA migration.

Session lifecycle

A session ends on any of:
  • Explicit DELETE /api/v1/sessions/{session_id}
  • user_absent_timeout elapses with the user disconnected (default 60s)
  • max_duration cap reached (default 3600s, max 86400s)
async with httpx.AsyncClient() as http:
    await http.delete(
        f"https://api.avaturn.live/api/v1/sessions/{session_id}",
        headers={"Authorization": "Bearer <AVATURN_API_KEY>"},
    )
Call avatar.dispose() on the frontend to tear down the local SDK state. The backend session terminates as described above — dispose() does not directly close it. Don’t try to resume a session after it ends; mint a new client secret and create a fresh session.

Reference