Duplex Session Management

DuplexSession Class

DuplexSession (duplex/lib/duplex-session.js) is the core session management class for duplex mode, encapsulating the WebSocket lifecycle, message protocol, state machine, and audio playback integration.

Constructor

new DuplexSession({
    prefix: 'omni',              // Session ID prefix ('omni' | 'adx')
    getMaxKvTokens: () => 8192,  // KV Cache upper limit
    getPlaybackDelayMs: () => 200,// Playback delay
    outputSampleRate: 24000,     // Audio output sample rate
    getWsUrl: (sid) => `...`,    // WebSocket URL generator
})

Complete Method List

Method Description
start(systemPrompt, preparePayload, startMediaFn) Start session: connect WS → queue → prepare → start media capture
sendChunk(msg) Send audio chunk (automatically injects force_listen flag)
pauseToggle() Toggle pause/resume
toggleForceListen() Toggle force listen mode
stop() Stop session
cancelQueue() Cancel queue
cleanup() Full cleanup (WS close + AudioPlayer stop)

Callback Hooks

Callback Parameters Trigger
onSystemLog(text) Log text System events (connect/disconnect/error)
onQueueUpdate(data) {position, eta_seconds} Queue status change
onQueueDone() Left queue, processing starts
onSpeakStart(text) First text segment AI starts speaking
onSpeakUpdate(handle, text) Accumulated text AI speaking text update
onSpeakEnd() AI finishes speaking
onListenResult(result) Complete result Model is in listening state
onExtraResult(result, recvTime) Raw result Triggered on every result (for metrics)
onPrepared() Preparation complete
onCleanup() Session cleanup complete
onMetrics(data) Audio metrics AudioPlayer metrics update
onRunningChange(running) bool Running state change
onPauseStateChange(state) State string Pause state change
onForceListenChange(active) bool Force listen state change

Session Lifecycle

sequenceDiagram
    participant UI as User Interface
    participant DS as DuplexSession
    participant WS as WebSocket
    participant AP as AudioPlayer
    participant Media as Media Capture

    UI->>DS: start(systemPrompt, payload, startMediaFn)
    DS->>WS: connect(wsUrl)

    alt Queuing
        WS-->>DS: queued
        DS->>UI: onQueueUpdate
        loop Waiting
            WS-->>DS: queue_update
            DS->>UI: onQueueUpdate
        end
        WS-->>DS: queue_done
        DS->>UI: onQueueDone
    end

    DS->>WS: prepare (system_prompt + config)
    WS-->>DS: prepared
    DS->>UI: onPrepared
    DS->>Media: startMediaFn() Start audio/video capture

    loop Full-duplex Loop
        Media->>DS: sendChunk(audio + frame)
        DS->>WS: audio_chunk
        WS-->>DS: result

        alt is_listen=false (SPEAK)
            DS->>AP: beginTurn() / playChunk() / endTurn()
            DS->>UI: onSpeakStart → onSpeakUpdate → onSpeakEnd
        else is_listen=true (LISTEN)
            DS->>UI: onListenResult
        end
    end

    UI->>DS: stop()
    DS->>WS: stop
    DS->>AP: stopAll()
    DS->>Media: Stop capture
    DS->>UI: onCleanup

State Machine

Pause State Machine

stateDiagram-v2
    [*] --> active
    active --> pausing: pauseToggle()
    Note right of pausing: Send pause message\nWait for server confirmation + audio finish
    pausing --> paused: serverPauseConfirmed\n&& audio playback complete
    paused --> active: pauseToggle()\nSend resume
    active --> [*]: stop()
    pausing --> [*]: stop()
    paused --> [*]: stop()

pausing → paused transition conditions: - Server returns paused confirmation message (serverPauseConfirmed = true) - AudioPlayer has no audio currently playing

When both conditions are met, _tryCompletePause() advances the state to paused.

Force Listen Mode

The forceListenActive flag is injected into the force_listen field of every sendChunk() message. When enabled: - Worker-side duplex_generate(force_listen=True) forces the model to output <|listen|> - AudioPlayer immediately calls stopAll() to stop current playback - Used when the user wants to interrupt the AI while it's speaking

KV Cache Auto-Stop

When kv_cache_length >= maxKvTokens in the result, DuplexSession automatically calls stop() to prevent KV Cache overflow.


WebSocket Message Protocol

Client → Server

Type Fields Description
prepare system_prompt, config, ref_audio_base64, tts_ref_audio_base64, max_slice_nums, deferred_finalize Initialize duplex session
audio_chunk audio_base64, frame_base64_list, force_listen, max_slice_nums Send audio + video frames
pause timeout Pause request
resume Resume request
stop Stop session
client_diagnostic metrics Client diagnostic information

Server → Client

Type Fields Description
queued ticket_id, position, eta_seconds Enqueue notification
queue_update position, eta_seconds Position update
queue_done Left queue
prepared prompt_length, recording_session_id Preparation complete
result is_listen, text, audio_data, end_of_turn, cost_*_ms, kv_cache_length Single-step result
paused timeout Pause confirmation
resumed Resume confirmation
stopped Stopped
timeout reason Pause timeout
error error Error

Session Recording System

SessionRecorder (Stereo WAV)

Used in Audio Duplex mode, records a dual-channel WAV file: - Left channel: User audio (PCM captured by AudioWorklet) - Right channel: AI audio (from AudioPlayer's onRawAudio callback) - Precise time alignment: based on AudioContext timestamps

SessionVideoRecorder (Video + Audio)

Used in Omni mode, records video + stereo audio:

Three-tier fallback strategy: 1. videoEl.captureStream() — Preferred approach 2. srcObject clone — Safari compatibility 3. Canvas drawImage loop — Subtitle compositing mode

Subtitle compositing: - Draws video frames via Canvas drawImage - Overlays AI speaking text (subtitles) on frames - Feeds the composited frame stream to MediaRecorder

Audio mixing: - Uses stereo-recorder-processor.js AudioWorklet - Interleaves user and AI audio into stereo

recording-settings.js

Recording settings panel, configurable options: - Video format (WebM / MP4) - Video quality - Whether to enable subtitles - Whether to enable recording