Chat Mode (Turn-based Chat)

Audio Format Specification

Direction	Format	Sample Rate	Channels	Encoding
Client → Server	PCM Float32	16000 Hz	Mono	Base64
Server → Client	PCM Float32	24000 Hz	Mono	Base64

Base64 encodes the raw bytes of the Float32 array directly (not WAV, not Opus).

Common Data Types

Message

{
  "role": "system | user | assistant",
  "content": "plain text string"
}

content can also be a multimodal content list:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe this image"},
    {"type": "image", "data": "<base64>"},
    {"type": "audio", "data": "<base64 PCM float32>", "sample_rate": 16000},
    {"type": "video", "data": "<base64 video file>", "stack_frames": 1}
  ]
}

Content Type	Required Fields	Optional Fields	Description
`text`	`text`	—	Text content
`image`	`data` (Base64)	—	Image content
`audio`	`data` (Base64 PCM float32)	`sample_rate` (default: 16000)	Audio content
`video`	`data` (Base64 video file)	`stack_frames` (default: 1)	Video content; frames and audio are auto-extracted

GenerationConfig

Field	Type	Default	Description
`max_new_tokens`	int	512	Maximum number of tokens to generate
`temperature`	float	0.7	Sampling temperature
`top_p`	float	0.8	Top-P (nucleus) sampling
`length_penalty`	float	1.0	Length penalty coefficient

TTSConfig

Field	Type	Default	Description
`enabled`	bool	true	Enable voice output
`mode`	string	`"default"`	TTS mode: `default`, `audio_assistant`, `omni`, `audio_roleplay`, `voice_cloning`
`ref_audio_path`	string	—	Server-side reference audio path
`ref_audio_data`	string	—	Base64-encoded reference audio (takes precedence over `ref_audio_path`)
`language`	string	—	Language hint

ImageConfig

Field	Type	Default	Description
`max_slice_nums`	int	null	Maximum image slices for high-resolution processing
`use_image_id`	bool	true	Assign image IDs in multi-image contexts

Overview

Turn-based Chat provides stateless multimodal dialogue with support for text, image, audio, and video inputs. It supports both streaming (token-by-token) and non-streaming (one-shot) generation.

Each request is stateless — the server performs a full prefill of all messages without reusing KV Cache from previous requests. The GPU is released immediately after inference completes, making it the most resource-efficient mode.

Capabilities: Text + Image + Audio + Video input, Text + Audio output, streaming output, multi-turn dialogue (client-managed history).

Lifecycle

sequenceDiagram
    participant C as Client
    participant S as Server

    C->>S: WS /ws/chat
    S-->>C: (queued → queue_done)
    S-->>C: prefill_done
    alt Streaming mode
        loop Until generation complete
            S-->>C: chunk (text_delta + audio_data)
        end
    end
    S-->>C: done (full text + audio)
    Note over C,S: Connection closed, GPU released

Phase 1 — Request & Queue: Client opens a WebSocket connection and sends the request. The server places it in a shared FIFO queue. When a GPU Worker becomes available, the request is dequeued and processing begins.

Phase 2 — Prefill: All messages are encoded and fed into the model in one shot. Upon completion, the server sends prefill_done with input_tokens count. This phase is the main latency contributor — it scales with total input size including images and audio.

Phase 3 — Generation: In streaming mode, the model generates tokens one by one, sending each as a chunk message containing text_delta and optionally audio_data. In non-streaming mode, all tokens are generated internally and only the final result is returned. Streaming mode provides much lower time-to-first-token.

Phase 4 — Completion: The server sends done with the full generated text, audio data, and token statistics. The WebSocket connection closes. The GPU Worker is returned to the pool.

Error handling: If any error occurs during processing, the server sends an error message and closes the connection. The GPU Worker is still released.

WebSocket — wss://host/ws/chat

Turn-based Chat via WebSocket, supporting both streaming and non-streaming modes. Delivers incremental results in real time.

Full Session Lifecycle

Client opens WebSocket connection to wss://host/ws/chat.
Client sends one JSON message containing the full request (messages, config, etc.).
Server enqueues the request. If the queue is not empty, the client waits.
Server performs prefill and sends prefill_done.
If streaming: true, server sends a sequence of chunk messages as tokens are generated.
Server sends done with the complete result.
Connection closes automatically.

If an error occurs at any stage, the server sends error and closes the connection.

Client → Server

One JSON message sent immediately after connection:

{
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "streaming": true,
  "generation": {"max_new_tokens": 256, "length_penalty": 1.1},
  "tts": {"enabled": true, "ref_audio_data": "<base64>"},
  "image": {"max_slice_nums": null},
  "omni_mode": false,
  "enable_thinking": false
}

Field	Type	Default	Description
`messages`	Message[]	—	Conversation messages
`streaming`	bool	true	Enable streaming output (token-by-token)
`generation`	GenerationConfig	—	Generation parameters
`tts`	TTSConfig	—	TTS configuration
`image`	ImageConfig	—	Image processing parameters
`omni_mode`	bool	false	Enable omni mode for video input
`enable_thinking`	bool	false	Enable thinking mode

Server → Client

Messages arrive in strict order: prefill_done → (chunk...) → done.

Message Type	Key Fields	Description
`prefill_done`	`input_tokens`	Prefill complete; generation starting
`chunk`	`text_delta`, `audio_data`	One streaming token (only when `streaming: true`). `text_delta` is the incremental text; `audio_data` is the corresponding audio segment (may be null for some chunks)
`done`	`text`, `generated_tokens`, `input_tokens`, `audio_data`, `recording_session_id`	Generation complete. `text` contains the full accumulated text; `audio_data` contains the full audio (in non-streaming mode) or the final segment (in streaming mode)
`error`	`error`	Error message; connection will close

chunk example:

{
  "type": "chunk",
  "text_delta": "Hello",
  "audio_data": "<base64, 24kHz>"
}

done example:

{
  "type": "done",
  "text": "Hello! How can I help you?",
  "generated_tokens": 15,
  "input_tokens": 42,
  "audio_data": "<base64, 24kHz>",
  "recording_session_id": "chat_abc123"
}

Example: Full Lifecycle

JavaScript

// -- Reference audio for voice cloning (base64 PCM float32, 16kHz) --
// Loaded from an uploaded file or the server default (/api/default_ref_audio).
const refAudioBase64 = getRefAudioBase64();

const ws = new WebSocket(`wss://${location.host}/ws/chat`);
let fullText = '';

ws.onopen = () => {
  const payload = {
    messages: [
      // System message uses content-list format: [text, audio, text].
      // The audio item embeds the reference voice into the LLM context.
      { role: 'system', content: [
        { type: 'text', text: 'Mimic the voice from the audio sample.' },
        { type: 'audio', data: refAudioBase64 },          // reference voice
        { type: 'text', text: 'You are a helpful assistant. Reply naturally.' },
      ]},
      { role: 'user', content: 'Hello!' },
    ],
    streaming: true,
    generation: { max_new_tokens: 256, temperature: 0.7 },
    tts: {
      enabled: true,
      mode: 'audio_assistant',
      ref_audio_data: refAudioBase64,  // same audio initializes TTS vocoder
    },
  };
  ws.send(JSON.stringify(payload));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  switch (msg.type) {
    case 'prefill_done':
      // Server finished prefilling the prompt into KV Cache
      console.log(`Prefill done, ${msg.input_tokens} input tokens`);
      break;
    case 'chunk':
      // Streaming token: incremental text and/or audio segment
      if (msg.text_delta) {
        fullText += msg.text_delta;
        console.log('Streaming:', fullText);
      }
      if (msg.audio_data) playAudio(msg.audio_data);  // PCM float32, 24kHz
      break;
    case 'done':
      // Generation complete — full text and token stats
      console.log('Final text:', msg.text);
      console.log(`Tokens: ${msg.input_tokens} in, ${msg.generated_tokens} out`);
      ws.close();
      break;
    case 'error':
      console.error('Error:', msg.error);
      ws.close();
      break;
  }
};

ws.onerror = () => console.error('WebSocket error');
ws.onclose = () => console.log('Connection closed');

Python

import asyncio, json, base64
import numpy as np
import websockets

def load_ref_audio(path: str) -> str:
    """Load a WAV file and return base64-encoded PCM float32 at 16kHz."""
    import soundfile as sf
    audio, _ = sf.read(path, dtype="float32", samplerate=16000)
    return base64.b64encode(audio.tobytes()).decode()

async def chat_streaming(
    server_url="wss://localhost:8006/ws/chat",
    ref_audio_path: str | None = "ref.wav",
):
    async with websockets.connect(server_url) as ws:
        # Load reference audio for voice cloning
        ref_b64 = load_ref_audio(ref_audio_path) if ref_audio_path else None

        # System message uses content-list format: [text, audio, text].
        # The audio item embeds the reference voice into the LLM context.
        system_content = "You are a helpful assistant."
        if ref_b64:
            system_content = [
                {"type": "text", "text": "Mimic the voice from the audio sample."},
                {"type": "audio", "data": ref_b64},          # reference voice
                {"type": "text", "text": "You are a helpful assistant. Reply naturally."},
            ]

        # TTS config — ref_audio_data initializes the TTS vocoder separately
        tts_config = {"enabled": True, "mode": "audio_assistant"}
        if ref_b64:
            tts_config["ref_audio_data"] = ref_b64

        await ws.send(json.dumps({
            "messages": [
                {"role": "system", "content": system_content},
                {"role": "user", "content": "Hello!"},
            ],
            "streaming": True,
            "generation": {"max_new_tokens": 256, "temperature": 0.7},
            "tts": tts_config,
        }))

        full_text = ""
        audio_chunks = []

        async for raw in ws:
            msg = json.loads(raw)
            if msg["type"] == "prefill_done":
                print(f"Prefill done, {msg['input_tokens']} input tokens")
            elif msg["type"] == "chunk":
                # Incremental text token
                if msg.get("text_delta"):
                    full_text += msg["text_delta"]
                    print(f"Streaming: {full_text}")
                # Incremental audio segment (PCM float32, 24kHz)
                if msg.get("audio_data"):
                    audio_chunks.append(base64.b64decode(msg["audio_data"]))
            elif msg["type"] == "done":
                print(f"Final: {msg['text']}")
                print(f"Tokens: {msg['input_tokens']} in, {msg['generated_tokens']} out")
                break
            elif msg["type"] == "error":
                print(f"Error: {msg['error']}")
                break

        if audio_chunks:
            pcm = np.frombuffer(b"".join(audio_chunks), dtype=np.float32)
            print(f"Received {len(pcm)/24000:.1f}s of audio")

asyncio.run(chat_streaming())

Processor Method Chain

The internal processing pipeline for each Chat request:

Step	Method	Description
1	`UnifiedProcessor.set_chat_mode()`	Switch to Chat mode (< 0.1ms hot-switch), returns `ChatView`
2	`ChatView.prefill(session_id, msgs, ...)`	Encode all messages (text, image, audio, video) and fill into KV Cache in one shot
3a	`ChatView.streaming_generate(session_id, ...)`	Streaming path: yields `StreamingChunk` objects one by one, each containing a text token and optional audio segment
3b	`ChatView.generate(session_id, ...)`	Non-streaming path: runs HuggingFace `generate()` internally, then TTS, returns the complete result