Duplex Mode Details
The Duplex endpoint (/ws/duplex/{session_id}) supports two full-duplex interaction modes:
- Omnimodal Full-Duplex: Sends
audio_chunk+video_frameevery second; the model processes both vision and voice simultaneously - Audio Full-Duplex: Sends only
audio_chunkevery second; no visual input
Both share the same prefill-generate unit loop, differing only in whether video frames are provided.
Basic Concepts
Duplex (full-duplex) mode delivers a real-time conversational experience similar to a phone call: while the user is speaking, the model can respond at any time without waiting for the user to finish.
Key differences from Streaming mode:
| Streaming Mode | Duplex Mode | |
|---|---|---|
| Interaction | Turn-based (user finishes → model responds) | Real-time full-duplex (listens and responds simultaneously) |
| Input processing | One-time prefill of complete message | Per-second streaming prefill of audio/video |
| Worker occupation | Occupied only during inference, released upon completion | Exclusively occupied for the entire session |
| Use cases | Text/multimodal Q&A | Real-time voice/video conversation |
Per-Second Unit Loop
The core of Duplex is a prefill-generate loop that executes once per second, with each iteration called a "unit":
sequenceDiagram
participant C as Client
participant W as Worker
C->>W: prepare (system prompt + ref_audio)
W->>W: Initialize: reset → prefill system prompt + audio reference
loop One Unit per second
C->>W: audio_chunk (~1s) + video_frame
W->>W: streaming_prefill()
Note over W: 1. Feed ⟨unit⟩ token
2. Encode image → feed vision embedding
3. Encode audio → feed audio embedding
4. Produce pending_logits
W->>W: streaming_generate()
Note over W: Decode based on pending_logits
Output ⟨listen⟩ → continue listening
Output text tokens → speak
alt Model decides to speak
W-->>C: text + audio_data
else Model decides to listen
W-->>C: is_listen=true
end
end
streaming_prefill() receives an audio segment (and optional video frames) each second, encodes them, and feeds them into the LLM's KV Cache:
# MiniCPMO45/modeling_minicpmo_unified.py — DuplexCapability.streaming_prefill()
# Called once per second; determines mode based on input: AUDIO / VISION / OMNI
# Step 1: Feed <unit> token (marks the start of a new unit)
self.decoder.feed(self.decoder.embed_token(self.unit_token_id))
# Step 2: Encode image → feed vision embedding (if video frames present)
vision_hidden_states = self.model.get_vision_embedding(processed_frames)
self.decoder.feed(vision_embed)
# Step 3: Encode audio → feed audio embedding
audio_embeds = self.model.get_audio_embedding(processed_audio)
self.decoder.feed(audio_embed)
# → Produces pending_logits for use by generate
streaming_generate() decodes based on pending_logits, and the model autonomously decides whether to "listen" or "speak":
# MiniCPMO45/modeling_minicpmo_unified.py — DuplexCapability.streaming_generate()
logits = self.pending_logits
for j in range(max_new_speak_tokens_per_chunk):
last_id = self.decoder.decode(logits=logits, mode=decode_mode, ...)
is_listen = last_id.item() == self.listen_token_id
if last_id.item() in self.chunk_terminator_token_ids:
break # Current chunk ends (listen / chunk_eos / turn_eos)
else:
self.res_ids.append(last_id.item()) # Record spoken token
self.speak_count += 1
logits, hidden = self.decoder.feed(...) # Feed token, get next logits
Full Flow (with Queuing + Gateway Proxy)
sequenceDiagram
participant C as Client
participant G as Gateway
participant Q as Queue
participant W as Worker
C->>G: WS /ws/duplex/{session_id}
G->>Q: enqueue("omni_duplex" / "audio_duplex")
Q->>W: Assign Worker (exclusive)
C->>G: prepare (system prompt + config)
G->>W: duplex_prepare()
loop Full-duplex loop (once per second)
C->>G: audio_chunk (+ video_frame)
G->>W: duplex_prefill(audio, frames)
W->>W: duplex_generate()
alt Model decides to speak
W-->>G: result (text + audio_data)
G-->>C: Forward result
else Model decides to listen
W-->>G: result (is_listen=true)
end
end
C->>G: stop
G->>W: duplex_cleanup()
G->>Q: release_worker()
Omnimodal vs Audio Mode
The Gateway distinguishes between the two modes via task_type at enqueue time:
| Omnimodal Full-Duplex | Audio Full-Duplex | |
|---|---|---|
| Enqueue task_type | omni_duplex |
audio_duplex |
| Frontend page | omni.html |
audio_duplex.html |
| Sent per second | audio_chunk + video_frame |
audio_chunk |
streaming_prefill() mode |
OMNI (audio + vision) | AUDIO (audio only) |
| Vision encoding | SigLIP → Resampler → feed | Skipped |
Both modes share the exact same code path on the Worker side. DuplexCapability.streaming_prefill() automatically selects the mode (AUDIO / VISION / OMNI) based on whether frame_list is provided.
Worker-Side Processing Details
Duplex is the most complex exclusive mode — the Worker is exclusively occupied for the entire session duration.
Prepare Phase
- Set state →
DUPLEX_ACTIVE(exclusive Worker) - Decode LLM ref_audio and TTS ref_audio (they can differ):
- LLM ref_audio → embedded in system prompt
- TTS ref_audio → initializes the vocoder
duplex_prepare(system_prompt, ref_audio, tts_audio)initializes the duplex session- Initialize
DuplexSessionRecorder(optional) - Send
prepared
Full-Duplex Loop
Each loop iteration processes one audio chunk (~1 second):
- Decode
audio_base64→ float32 audio waveform (16kHz) - Decode
frame_base64_list→ PIL Image list (Omni mode only) - Wait for previous round's finalize to complete (
asyncio.Eventbarrier) - Execute in thread:
duplex_prefill(audio, frames)— prefill audio + videoduplex_generate(force_listen)— model decides to listen or speak- Send
result(withis_listen,text,audio_data, performance metrics,kv_cache_length) - Deferred Finalize (enabled by default):
- Send result to client first (overlap network transmission)
- Execute
duplex_finalize()asynchronously (~37ms, feed terminators + sliding window maintenance) - Guaranteed to complete before next round's prefill via
asyncio.Eventbarrier - Measured improvement: LISTEN wall_clock reduced by ~30ms, SPEAK reduced by ~50ms
Pause and Resume
pause→DUPLEX_PAUSED+ start timeout watchdogresume→DUPLEX_ACTIVE+ cancel watchdog- Timeout (default 60s) → automatically release Worker, notify client
Stop and Resource Cleanup
stop→duplex_stop()finallyblock (whether normal or abnormal termination):duplex_stop()stops generationduplex_cleanup()releases GPU resources:- Release KV Cache, TTS caches, etc.
gc.collect()+torch.cuda.empty_cache()- Releases ~1.5GB VRAM (diagnostic data: ~1,591 MB leaked after stop → ~48 MB remaining after cleanup)
- Restore state →
IDLE