Duplex Mode (Full-Duplex)
Overview
Duplex mode enables simultaneous input and output — the user can speak (and optionally show video) while the model generates audio responses at the same time. The system runs a per-second inference loop: every second, the user's audio (and optionally a video frame) is fed to the model, which then decides autonomously whether to listen (stay silent and absorb input) or speak (generate text + audio output).
Two variants share the same WebSocket endpoint, differing only in input modalities:
| Variant | Input | What client sends each second |
|---|---|---|
| Omnimodal Full-Duplex | Voice + Camera | audio_chunk + video_frame |
| Audio Full-Duplex | Voice only | audio_chunk only |
The GPU Worker is exclusively occupied for the entire session. Unlike Half-Duplex, there is no explicit VAD — the model itself decides when to speak based on the audio content, using learned listen/speak probabilities.
Capabilities: Voice (+ Vision) input → Text + Voice output, autonomous listen/speak decisions, interrupt support, pause/resume, exclusive GPU Worker.
Lifecycle
sequenceDiagram
participant C as Client
participant S as Server
C->>S: WS /ws/duplex/{session_id}
S-->>C: queued (ticket_id, position, ETA)
S-->>C: queue_done
C->>S: prepare (system_prompt + DuplexConfig)
S-->>C: prepared
loop Per-second inference (every chunk_ms)
C->>S: audio_chunk (+video_frame for Omni)
S-->>C: result (is_listen / text + audio)
end
alt Pause / Resume
C->>S: pause
S-->>C: paused
C->>S: resume
S-->>C: resumed
end
alt Client stops
C->>S: stop
S-->>C: stopped (session_id)
else Pause timeout
S-->>C: timeout
end
Note over C,S: GPU released, session recorded
Phase 1 — Connection & Queue: Client connects to wss://host/ws/duplex/{session_id} with a unique session ID. The session ID prefix determines the variant: omni_* for Omnimodal, audio_duplex_* for Audio-only. The server enqueues the request and sends queued with ticket_id, position, and eta_seconds. While waiting, the client may receive periodic queue_update messages. When assigned, the client receives queue_done.
Phase 2 — Preparation: Client sends prepare with the system prompt, DuplexConfig parameters, and optionally a reference audio path. The server initializes the duplex session: loads TTS, prefills the system prompt, and sets up internal state. The client receives prepared. Audio capture should now begin.
Phase 3 — Per-Second Inference Loop: The core of duplex mode. Every chunk_ms milliseconds (default 1000ms), the client sends an audio_chunk (and a video_frame for Omni). The server processes each chunk through a three-step pipeline:
- Prefill: The audio waveform (and video frames) are encoded and appended to the KV Cache.
- Generate: The model produces one generation step. It outputs either a listen decision (stay silent, continue absorbing input) or a speak decision (emit text tokens + audio).
- Finalize: Post-generation bookkeeping (update turn state, handle end-of-turn).
The client receives a result message for each step.
Startup protection (force_listen_count): For the first N steps (default 3), the model is forced to listen regardless of its internal state. This prevents the model from speaking before it has received enough context.
Listen/Speak behavior: When result.is_listen is true, the model heard the audio but chose to stay silent — text and audio_data will be empty. When is_listen is false, the model is speaking — text contains generated tokens and audio_data contains the corresponding audio. Multiple consecutive is_listen: false results form a continuous speaking turn. When end_of_turn becomes true, the model has finished its current speaking turn and transitions back to listening.
Interrupt (set_break): If the client detects that the user started speaking while the model is mid-speech (based on input audio energy or VAD on the client side), it can keep sending audio_chunk. The model may naturally transition from speaking back to listening on the next step — this is the "barge-in" / interrupt behavior.
Phase 4 — Pause / Resume: Client can send pause to temporarily suspend the session (e.g., when the user switches tabs). The server responds with paused. No audio_chunk should be sent during pause. To resume, send resume; the server responds with resumed. If the session remains paused for longer than the pause_timeout (default 60 seconds), the server sends timeout and releases the GPU automatically.
Phase 5 — Termination: The session ends when:
- Client stop: Client sends stop. Server responds with stopped containing the session_id.
- Pause timeout: Server sends timeout after the pause timeout expires.
- Connection drop: If the WebSocket disconnects unexpectedly, the server cleans up the session.
After termination, GPU memory is released and the session recording is finalized.
WebSocket — wss://host/ws/duplex/{session_id}
Client → Server
| Message Type | Key Fields | When to send | Description |
|---|---|---|---|
prepare |
prefix_system_prompt, config, ref_audio_path |
Once, after queue_done |
Initialize the duplex session with system prompt and configuration |
audio_chunk |
audio (Base64) |
Every chunk_ms ms, after prepared |
Send one chunk of microphone audio (PCM float32, 16kHz). Chunk duration should match config.chunk_ms (default 1s) |
video_frame |
frame (Base64 JPEG) |
With each audio_chunk (Omni only) |
Send a camera frame. Only for Omnimodal variant |
pause |
— | Any time during active session | Temporarily suspend the session. Stop sending audio_chunk |
resume |
— | After paused received |
Resume a paused session. Start sending audio_chunk again |
stop |
— | Any time | Gracefully stop the session and release GPU |
client_diagnostic |
metrics |
Periodically (optional) | Client-side diagnostic metrics for monitoring |
prepare example:
{
"type": "prepare",
"prefix_system_prompt": "You are a fun assistant.",
"config": {
"generate_audio": true,
"chunk_ms": 1000,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"force_listen_count": 3,
"max_new_speak_tokens_per_chunk": 20,
"listen_prob_scale": 1.0,
"ls_mode": "explicit",
"sample_rate": 16000
},
"ref_audio_path": "assets/ref_audio/ref_minicpm_signature.wav"
}
DuplexConfig fields:
| Field | Type | Default | Description |
|---|---|---|---|
generate_audio |
bool | true | Generate audio output. When false, only text is produced |
ls_mode |
string | "explicit" |
Listen/Speak decision mode. Controls how the model decides between listening and speaking |
force_listen_count |
int | 3 | Startup protection: force the model to listen for the first N steps, preventing premature speech before context is established |
max_new_speak_tokens_per_chunk |
int | 20 | Maximum speak tokens per inference step. Limits how much text is generated per second to maintain real-time pacing |
temperature |
float | 0.7 | Sampling temperature for text generation |
top_k |
int | 20 | Top-K sampling |
top_p |
float | 0.8 | Top-P (nucleus) sampling |
listen_prob_scale |
float | 1.0 | Scale factor for listen probability. Values > 1.0 make the model more likely to listen (less talkative); < 1.0 makes it more eager to speak |
chunk_ms |
int | 1000 | Audio chunk duration in milliseconds. Determines the per-second loop cadence. Client must send audio chunks at this interval |
sample_rate |
int | 16000 | Expected audio sample rate |
audio_chunk example:
{
"type": "audio_chunk",
"audio": "<base64 PCM float32, 16kHz, 1s>"
}
video_frame example:
{
"type": "video_frame",
"frame": "<base64 JPEG>"
}
Server → Client
Messages follow the lifecycle order. During the active loop, result messages arrive at the cadence of chunk_ms.
| Message Type | Key Fields | Lifecycle Phase | Description |
|---|---|---|---|
queued |
ticket_id, position, eta_seconds |
Connection | Enqueued; waiting for GPU |
queue_update |
position, eta_seconds |
Connection | Queue position changed |
queue_done |
— | Connection | GPU assigned. Client should send prepare |
prepared |
— | Preparation | Session ready. Client should begin sending audio_chunk |
result |
is_listen, text, audio_data, end_of_turn, timing fields |
Active loop | Per-step inference result. See DuplexGenerateResult below |
paused |
— | Pause | Session paused |
resumed |
— | Resume | Session resumed |
stopped |
session_id |
Termination | Session stopped; GPU released |
timeout |
— | Termination | Pause timeout expired; GPU released |
error |
message |
Any | Error; connection will close |
DuplexGenerateResult fields (the result message payload):
| Field | Type | Description |
|---|---|---|
is_listen |
bool | true = model chose to listen (silent). false = model chose to speak (generating output) |
text |
string | Generated text tokens. Empty string when is_listen: true |
audio_data |
string | Base64 audio at 24kHz. Empty string when is_listen: true. Client should play this audio immediately |
end_of_turn |
bool | true when the model finishes its speaking turn and transitions back to listening. Only meaningful when is_listen: false |
current_time |
int | Cumulative session time in milliseconds |
cost_llm_ms |
float | LLM inference latency for this step (ms) |
cost_tts_ms |
float | TTS synthesis latency for this step (ms) |
cost_all_ms |
float | Total step latency including prefill + generate + finalize (ms). Should stay under chunk_ms for real-time performance |
n_tokens |
int | Number of LLM tokens generated in this step |
n_tts_tokens |
int | Number of TTS tokens generated in this step |
server_send_ts |
float | Server-side send timestamp (unix seconds). Used for client-side latency measurement |
result example (speaking):
{
"type": "result",
"is_listen": false,
"text": "Hello",
"audio_data": "<base64, 24kHz>",
"end_of_turn": false,
"current_time": 5000,
"cost_llm_ms": 45.2,
"cost_tts_ms": 12.3,
"cost_all_ms": 78.5,
"n_tokens": 3,
"server_send_ts": 1708771200.123
}
result example (listening):
{
"type": "result",
"is_listen": true,
"text": "",
"audio_data": "",
"end_of_turn": false,
"current_time": 3000,
"cost_llm_ms": 12.1,
"cost_tts_ms": 0,
"cost_all_ms": 35.4,
"n_tokens": 1,
"server_send_ts": 1708771197.456
}
Example: Full Lifecycle
JavaScript — Audio Duplex
const sessionId = 'adx_' + Date.now().toString(36);
const ws = new WebSocket(`wss://${location.host}/ws/duplex/${sessionId}`);
let currentText = '';
// -- Reference audio for voice cloning (base64 PCM float32, 16kHz) --
const refAudioBase64 = getRefAudioBase64();
ws.onopen = () => console.log('Connected, waiting for queue...');
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'queued':
console.log(`Queue #${msg.position}, ETA: ${msg.eta_seconds}s`);
break;
case 'queue_update':
console.log(`Queue moved to #${msg.position}`);
break;
case 'queue_done':
// GPU assigned — send prepare with ref audio for voice cloning.
// ref_audio_base64 is used for both LLM system prompt embedding and TTS voice.
// To use a different voice for TTS, set tts_ref_audio_base64 separately.
ws.send(JSON.stringify({
type: 'prepare',
prefix_system_prompt: 'You are a fun assistant.',
ref_audio_base64: refAudioBase64,
config: {
generate_audio: true,
chunk_ms: 1000,
temperature: 0.7,
force_listen_count: 3,
},
}));
break;
case 'prepared':
console.log('Session ready, starting audio capture');
startPerSecondCapture();
break;
case 'result':
// Per-second inference result: model decides to listen or speak
if (msg.is_listen) {
console.log(`[${msg.current_time}ms] Listening (${msg.cost_all_ms.toFixed(0)}ms)`);
} else {
currentText += msg.text;
console.log(`[${msg.current_time}ms] Speaking: "${msg.text}" (${msg.cost_all_ms.toFixed(0)}ms)`);
if (msg.audio_data) playAudio(msg.audio_data); // PCM float32, 24kHz
if (msg.end_of_turn) {
console.log(`Turn ended. Full text: "${currentText}"`);
currentText = '';
}
}
break;
case 'paused':
console.log('Session paused');
break;
case 'resumed':
console.log('Session resumed');
break;
case 'stopped':
console.log(`Session stopped: ${msg.session_id}`);
break;
case 'timeout':
console.log('Pause timeout — session ended');
break;
case 'error':
console.error('Error:', msg.message);
break;
}
};
async function startPerSecondCapture() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000 } });
const ctx = new AudioContext({ sampleRate: 16000 });
await ctx.audioWorklet.addModule('capture-processor.js');
const source = ctx.createMediaStreamSource(stream);
const node = new AudioWorkletNode(ctx, 'capture-processor', {
processorOptions: { chunkSize: 16000 } // 1 second of audio at 16kHz
});
source.connect(node);
// AudioWorklet is event-driven, NOT timer-based:
// The audio rendering thread accumulates mic samples in real-time and fires
// 'chunk' exactly when 1 second of audio is ready. No sleep or setInterval needed.
node.port.onmessage = (e) => {
if (e.data.type === 'chunk' && ws.readyState === WebSocket.OPEN) {
const msg = {
type: 'audio_chunk',
audio: arrayBufferToBase64(e.data.audio.buffer),
};
// For Omni variant, additionally attach video frames:
// msg.frame_base64_list = [captureFrameAsJpegBase64()];
ws.send(JSON.stringify(msg));
}
};
}
function pauseSession() { ws.send(JSON.stringify({ type: 'pause' })); }
function resumeSession() { ws.send(JSON.stringify({ type: 'resume' })); }
function stopSession() { ws.send(JSON.stringify({ type: 'stop' })); }
Python
import asyncio, json, base64, time
import numpy as np
import websockets
def load_ref_audio(path: str) -> str:
"""Load a WAV file and return base64-encoded PCM float32 at 16kHz."""
import soundfile as sf
audio, _ = sf.read(path, dtype="float32", samplerate=16000)
return base64.b64encode(audio.tobytes()).decode()
def audio_file_to_1s_chunks(path, sr=16000):
"""Read audio and yield 1-second float32 chunks as base64."""
import soundfile as sf
audio, _ = sf.read(path, dtype="float32", samplerate=sr)
for i in range(0, len(audio), sr):
yield base64.b64encode(audio[i:i + sr].tobytes()).decode()
async def duplex_session(
audio_path: str,
server="wss://localhost:8006",
ref_audio_path: str | None = "ref.wav",
):
session_id = f"adx_{int(time.time()*1000):x}"
url = f"{server}/ws/duplex/{session_id}"
async with websockets.connect(url) as ws:
# 1. Wait for queue assignment
while True:
msg = json.loads(await ws.recv())
if msg["type"] == "queue_done":
break
# 2. Prepare — attach ref audio for voice cloning.
# ref_audio_base64: used for both LLM system prompt embedding and TTS voice.
# To use a different TTS voice, set tts_ref_audio_base64 separately.
prepare_msg = {
"type": "prepare",
"prefix_system_prompt": "You are a fun assistant.",
"config": {
"generate_audio": True,
"chunk_ms": 1000,
"temperature": 0.7,
"force_listen_count": 3,
},
}
if ref_audio_path:
prepare_msg["ref_audio_base64"] = load_ref_audio(ref_audio_path)
await ws.send(json.dumps(prepare_msg))
msg = json.loads(await ws.recv())
assert msg["type"] == "prepared"
print("Session ready")
# 3. Concurrently send audio and receive results
async def send_audio():
for chunk_b64 in audio_file_to_1s_chunks(audio_path):
await ws.send(json.dumps({
"type": "audio_chunk",
"audio": chunk_b64,
}))
# Simulate real-time microphone cadence: in a browser, the
# AudioWorklet fires chunk events driven by the audio rendering
# thread — no sleep needed. Here we sleep because we're reading
# from a file and need to match the server's per-second loop.
await asyncio.sleep(1.0)
# Allow server to finish processing the last chunks
await asyncio.sleep(3)
await ws.send(json.dumps({"type": "stop"}))
async def recv_results():
current_text = ""
async for raw in ws:
msg = json.loads(raw)
if msg["type"] == "result":
t = msg["current_time"]
if msg["is_listen"]:
print(f"[{t}ms] Listening ({msg['cost_all_ms']:.0f}ms)")
else:
current_text += msg.get("text", "")
print(f"[{t}ms] Speaking: {msg.get('text', '')!r} ({msg['cost_all_ms']:.0f}ms)")
if msg["end_of_turn"]:
print(f" Turn ended: {current_text!r}")
current_text = ""
elif msg["type"] in ("stopped", "timeout"):
print(f"Session ended: {msg['type']}")
break
await asyncio.gather(send_audio(), recv_results())
asyncio.run(duplex_session("test_audio.wav"))
Processor Method Chain
The internal processing pipeline for each second of a Duplex session:
| Phase | Method | Description |
|---|---|---|
| Init | UnifiedProcessor.set_duplex_mode() |
Switch to Duplex mode (< 0.1ms), returns DuplexView |
| Prepare | DuplexView.prepare(system_prompt, ref_audio_path, prompt_wav_path) |
Initialize session: prefill system prompt, load TTS reference audio |
| Each step | DuplexView.prefill(audio_waveform, frame_list, ...) |
Encode and append 1-second audio (+ video frames) to KV Cache |
| Each step | DuplexView.generate(force_listen) |
Run one generation step. Returns DuplexGenerateResult with listen/speak decision, text, audio, and timing |
| Each step | DuplexView.finalize() |
Post-generation bookkeeping: update turn counters, handle deferred finalization if end_of_turn |
| Interrupt | DuplexView.set_break() / clear_break() |
Set or clear the interrupt flag. When set, the model will transition to listening on the next step |
| Terminate | DuplexView.stop() |
Signal the session to stop |
| Cleanup | DuplexView.cleanup() |
Release GPU memory, clear KV Cache, finalize session state |