Half-Duplex Mode (VAD-based Voice Conversation)

Overview

Half-Duplex mode implements hands-free voice conversation with automatic turn-taking. The server runs SileroVAD (Voice Activity Detection) to detect when the user starts and stops speaking. After the user finishes, the model generates a streaming reply. Once playback completes, the system resumes listening — like a phone call where each side takes turns.

Unlike Chat mode, Half-Duplex is a stateful, long-lived session. The GPU Worker is exclusively occupied for the entire session (default 3-minute timeout). KV Cache persists across turns, so the model accumulates context from the full conversation history without re-encoding previous turns.

Capabilities: Voice input → Text + Voice output, streaming output, multi-turn context accumulation, KV Cache persistence, exclusive GPU Worker.

Lifecycle

sequenceDiagram
    participant C as Client
    participant S as Server

    C->>S: WS /ws/half_duplex/{session_id}
    S-->>C: queued (position, ETA)
    S-->>C: queue_done
    C->>S: prepare (system_prompt + config)
    S-->>C: prepared (session_id, timeout_s)

    loop Voice conversation turns
        C->>S: audio_chunk (continuous, every 0.5s)
        S-->>C: vad_state: speaking=true
        Note over C,S: User is speaking...
        S-->>C: vad_state: speaking=false
        S-->>C: generating (speech_duration_ms)
        loop Streaming reply
            S-->>C: chunk (text_delta + audio_data)
        end
        S-->>C: turn_done (turn_index, full text)
        Note over C,S: Client plays audio, then resumes sending audio_chunk
    end

    alt Client stops
        C->>S: stop
        S-->>C: stopped
    else Session timeout
        S-->>C: timeout (elapsed_s)
    end
    Note over C,S: GPU released, session recorded

Phase 1 — Connection & Queue: Client connects to wss://host/ws/half_duplex/{session_id} with a unique session ID. The server places the request in the FIFO queue. The client receives queued with its position and estimated wait time. When a GPU Worker becomes available, the client receives queue_done.

Phase 2 — Preparation: Client sends prepare with the system prompt, VAD parameters, generation config, TTS settings, and optionally a reference audio for voice cloning. The server initializes: (1) loads the SileroVAD ONNX model, (2) prefills the system prompt into KV Cache, (3) initializes TTS with the reference audio, (4) starts the session recorder. The client receives prepared with the assigned session_id, timeout_s, and recording_session_id.

Phase 3 — Listening Loop: Client begins sending audio_chunk messages continuously (every 0.5 seconds, float32 PCM 16kHz). The server feeds each chunk into StreamingVAD. When speech is detected, the server sends vad_state: {speaking: true}. The client should display a "listening" indicator.

Cold start guard: For the first 0.5 seconds after prepared, all VAD results are ignored to filter out microphone initialization noise.

Phase 4 — Speech End & Generation: When VAD detects sustained silence (configurable via min_silence_duration_ms, default 800ms), the accumulated speech segment is finalized. The server sends vad_state: {speaking: false} followed immediately by generating with speech_duration_ms. The speech segment is encoded as audio content and prefilled into KV Cache, then streaming generation begins. Each generated token produces a chunk message with text_delta and audio_data.

Phase 5 — Turn Completion: When generation finishes, the server sends turn_done with the turn_index (0-based counter) and the full text of this turn. The client should play back all received audio. During playback, the client must stop sending audio_chunk to prevent echo feedback (the model would otherwise hear its own voice). After playback completes, the client waits an additional ~800ms buffer, then resumes sending audio_chunk to start the next turn.

Phase 6 — Termination: The session ends in one of three ways: - Client stop: Client sends stop. Server sends stopped and releases the GPU. - Timeout: If no audio_chunk is received for timeout_s seconds (default 180), the server sends timeout with elapsed_s and releases the GPU. - External stop: An HTTP POST /api/half_duplex/stop with the session_id forces generation to stop mid-turn.

After termination, the session recording is finalized and available for playback.

WebSocket — wss://host/ws/half_duplex/{session_id}

Client → Server

Message Type Key Fields Description
prepare system_prompt, config, ref_audio_base64, system_content Initialize session; must be the first message after queue_done
audio_chunk audio_base64 Send microphone audio (float32 PCM 16kHz, ~0.5s per chunk). Must be sent continuously during listening phase. Must stop during AI audio playback to prevent echo
stop Gracefully stop the session and release GPU

prepare example:

{
  "type": "prepare",
  "system_prompt": "You are a helpful assistant.",
  "config": {
    "vad": {
      "threshold": 0.8,
      "min_speech_duration_ms": 128,
      "min_silence_duration_ms": 800,
      "speech_pad_ms": 30
    },
    "generation": {
      "max_new_tokens": 256,
      "length_penalty": 1.1,
      "temperature": 0.7
    },
    "tts": {
      "enabled": true
    },
    "session": {
      "timeout_s": 180
    }
  },
  "ref_audio_base64": "<base64 reference audio>"
}

config fields:

Category Field Default Description
vad threshold 0.8 Speech probability threshold. SileroVAD slides a 1024-sample window and outputs a probability per window. Values >= threshold mark "speech started". Higher values reduce false triggers but may miss soft speech
vad min_speech_duration_ms 128 Minimum speech duration to be considered valid. Segments shorter than this are discarded as noise
vad min_silence_duration_ms 800 Sustained silence required to confirm end of speech. Lower values make turn-taking faster but risk cutting off pauses mid-sentence
vad speech_pad_ms 30 Padding added to each side of the detected speech segment to avoid clipping word boundaries
generation max_new_tokens 256 Maximum tokens per turn
generation length_penalty 1.1 Length penalty coefficient (> 1.0 encourages longer responses)
generation temperature 0.7 Sampling temperature
tts enabled true Enable voice response. When false, only text is generated
session timeout_s 180 Session timeout in seconds. Timer resets on each audio_chunk received

audio_chunk example:

{
  "type": "audio_chunk",
  "audio_base64": "<base64 PCM float32, 16kHz, ~0.5s>"
}

Server → Client

Messages follow a strict lifecycle order within each turn:

Message Type Key Fields Lifecycle Phase Description
queued position, estimated_wait_s Connection Request placed in queue
queue_done Connection Queue exited; GPU Worker assigned. Client should now send prepare
prepared session_id, timeout_s, recording_session_id Preparation Session initialized. System prompt prefilled, VAD ready, TTS loaded. Client should begin sending audio_chunk
vad_state speaking (bool) Listening VAD state transition. true = speech detected (user started talking). false = speech ended (user stopped talking)
generating speech_duration_ms Turn start Server is processing the speech segment and starting generation
chunk text_delta, audio_data Generation One streaming chunk. text_delta is incremental text; audio_data is the corresponding audio segment at 24kHz. Client should buffer and play audio in order
turn_done turn_index, text Turn end Turn generation complete. text is the full response text for this turn. Client should finish playing buffered audio, then resume sending audio_chunk after a ~800ms delay
timeout elapsed_s Termination Session timed out due to inactivity. Connection will close
error error Any Error occurred. Connection will close

turn_done example:

{
  "type": "turn_done",
  "turn_index": 2,
  "text": "Sure, I can help you with that."
}

REST — POST /api/half_duplex/stop

Force-stop an ongoing half-duplex generation from outside the WebSocket connection. Useful for implementing a "stop" button in the UI that operates independently of the audio stream.

Request Body:

{"session_id": "stream_abc123"}

Example: Full Lifecycle

JavaScript

const sessionId = 'hdx_' + Math.random().toString(36).slice(2, 10);
const ws = new WebSocket(`wss://${location.host}/ws/half_duplex/${sessionId}`);
let aiSpeaking = false;
let audioContext, captureNode;

// -- Reference audio for voice cloning (base64 PCM float32, 16kHz) --
const refAudioBase64 = getRefAudioBase64();

ws.onopen = () => console.log('Connected, waiting for queue...');

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  switch (msg.type) {
    case 'queued':
      console.log(`Queue position: #${msg.position}, ETA: ${msg.estimated_wait_s}s`);
      break;

    case 'queue_done':
      // GPU assigned — send prepare with system_content containing ref audio.
      // system_content follows the model's best practice: [text, audio, text].
      // The audio item embeds the reference voice used for both LLM context and TTS cloning.
      ws.send(JSON.stringify({
        type: 'prepare',
        system_content: [
          { type: 'text', text: 'Mimic the voice from the audio sample.' },
          { type: 'audio', data: refAudioBase64 },         // reference voice
          { type: 'text', text: 'You are a helpful assistant. Reply naturally.' },
        ],
        config: {
          vad: { threshold: 0.8, min_silence_duration_ms: 800 },
          generation: { max_new_tokens: 256, temperature: 0.7 },
          tts: { enabled: true },
          session: { timeout_s: 180 },
        },
      }));
      break;

    case 'prepared':
      console.log(`Session ready (timeout: ${msg.timeout_s}s)`);
      startMicCapture();  // begin sending audio_chunk
      break;

    case 'vad_state':
      // Server-side VAD detected speech start / end
      console.log(msg.speaking ? 'User speaking...' : 'User stopped');
      break;

    case 'generating':
      // Server is processing user speech and starting generation.
      // Stop sending audio to prevent echo feedback.
      aiSpeaking = true;
      console.log(`Generating (speech: ${msg.speech_duration_ms}ms)`);
      break;

    case 'chunk':
      // Streaming token: incremental text and/or audio segment
      if (msg.text_delta) process.stdout.write(msg.text_delta);
      if (msg.audio_data) playAudio(msg.audio_data);  // PCM float32, 24kHz
      break;

    case 'turn_done':
      // Turn complete — resume mic after playback finishes + 800ms buffer
      // to avoid capturing the AI's own audio output.
      console.log(`\nTurn ${msg.turn_index} done: ${msg.text}`);
      setTimeout(() => { aiSpeaking = false; }, getPlaybackRemaining() + 800);
      break;

    case 'timeout':
      console.log(`Session timed out (${msg.elapsed_s}s)`);
      break;
    case 'error':
      console.error('Error:', msg.error);
      break;
  }
};

async function startMicCapture() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000 } });
  audioContext = new AudioContext({ sampleRate: 16000 });
  await audioContext.audioWorklet.addModule('capture-processor.js');
  const source = audioContext.createMediaStreamSource(stream);
  captureNode = new AudioWorkletNode(audioContext, 'capture-processor');
  source.connect(captureNode);

  // AudioWorklet is event-driven, NOT timer-based:
  // The audio rendering thread accumulates mic samples and fires 'chunk'
  // when the buffer reaches ~0.5s. No sleep or polling needed.
  captureNode.port.onmessage = (e) => {
    if (e.data.type === 'chunk' && !aiSpeaking && ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({
        type: 'audio_chunk',
        audio_base64: float32ToBase64(e.data.audio),
      }));
    }
  };
}

function stopSession() {
  ws.send(JSON.stringify({ type: 'stop' }));
  ws.close();
}

Python

import asyncio, json, base64
import numpy as np
import websockets

def load_ref_audio(path: str) -> str:
    """Load a WAV file and return base64-encoded PCM float32 at 16kHz."""
    import soundfile as sf
    audio, _ = sf.read(path, dtype="float32", samplerate=16000)
    return base64.b64encode(audio.tobytes()).decode()

def audio_file_to_chunks(path, chunk_duration=0.5, sr=16000):
    """Read a WAV file and yield 0.5s float32 chunks as base64."""
    import soundfile as sf
    audio, _ = sf.read(path, dtype="float32", samplerate=sr)
    chunk_size = int(sr * chunk_duration)
    for i in range(0, len(audio), chunk_size):
        yield base64.b64encode(audio[i:i + chunk_size].tobytes()).decode()

async def half_duplex_session(
    audio_path: str,
    server="wss://localhost:8006",
    ref_audio_path: str | None = "ref.wav",
):
    session_id = f"hdx_{id(object()):x}"
    url = f"{server}/ws/half_duplex/{session_id}"

    async with websockets.connect(url) as ws:
        # 1. Wait for queue assignment
        while True:
            msg = json.loads(await ws.recv())
            if msg["type"] == "queue_done":
                break
            print(f"Queued at #{msg.get('position')}")

        # 2. Prepare — system_content embeds reference audio for voice cloning.
        #    Format follows model best practice: [text, audio, text].
        ref_b64 = load_ref_audio(ref_audio_path) if ref_audio_path else None
        system_content = [
            {"type": "text", "text": "Mimic the voice from the audio sample."},
            {"type": "audio", "data": ref_b64},         # reference voice
            {"type": "text", "text": "You are a helpful assistant. Reply naturally."},
        ] if ref_b64 else None

        prepare_msg = {
            "type": "prepare",
            "config": {
                "vad": {"threshold": 0.8, "min_silence_duration_ms": 800},
                "generation": {"max_new_tokens": 256, "temperature": 0.7},
                "tts": {"enabled": True},
                "session": {"timeout_s": 60},
            },
        }
        if system_content:
            prepare_msg["system_content"] = system_content
        await ws.send(json.dumps(prepare_msg))

        msg = json.loads(await ws.recv())
        assert msg["type"] == "prepared"
        print(f"Session ready: {msg['session_id']}")

        # 3. Concurrently send audio and receive responses
        async def send_audio():
            for chunk_b64 in audio_file_to_chunks(audio_path):
                await ws.send(json.dumps({
                    "type": "audio_chunk",
                    "audio_base64": chunk_b64,
                }))
                # Simulate real-time microphone cadence: in a browser, the
                # AudioWorklet fires chunk events driven by the audio rendering
                # thread — no sleep needed. Here we sleep because we're reading
                # from a file and need to pace the chunks to match real time.
                await asyncio.sleep(0.5)
            # Wait for the server to finish generating the final turn
            await asyncio.sleep(5)
            await ws.send(json.dumps({"type": "stop"}))

        async def recv_messages():
            async for raw in ws:
                msg = json.loads(raw)
                if msg["type"] == "vad_state":
                    print("Speaking..." if msg["speaking"] else "Stopped speaking")
                elif msg["type"] == "generating":
                    print(f"Generating ({msg['speech_duration_ms']}ms speech)")
                elif msg["type"] == "chunk":
                    if msg.get("text_delta"):
                        print(msg["text_delta"], end="", flush=True)
                elif msg["type"] == "turn_done":
                    print(f"\n--- Turn {msg['turn_index']} done ---")
                elif msg["type"] in ("stopped", "timeout"):
                    break

        await asyncio.gather(send_audio(), recv_messages())

asyncio.run(half_duplex_session("test_audio.wav"))

Processor Method Chain

The internal processing pipeline for a Half-Duplex session:

Phase Method Description
Init UnifiedProcessor.set_half_duplex_mode() Switch to Half-Duplex mode (< 0.1ms), returns HalfDuplexView
Init HalfDuplexView.init_ref_audio(path) or init_ref_audio_from_data(ndarray) Load reference audio for TTS voice cloning
Prepare HalfDuplexView.prefill(request) Prefill system prompt into KV Cache; creates a rollback snapshot
Each turn HalfDuplexView.prefill(request) Prefill user speech segment into KV Cache
Each turn HalfDuplexView.generate(session_id, ...) Streaming generation, yields StreamingChunk with text + audio
Recovery HalfDuplexView.can_rollback()rollback() Check if rollback is possible, then restore KV Cache to last snapshot (e.g., on generation error)
Recovery HalfDuplexView.clear_rollback_point() Discard the snapshot after a successful turn