Chat Mode Details
The core inference mode behind the Turn-based Chat page. Communicates via the /ws/chat WebSocket endpoint, supporting both streaming and non-streaming response modes.
In One Sentence
Feed all history messages into the model at once (Prefill), then let the model generate a reply based on the existing context (Generate).
Overall Flow
Simplified View
Only two steps between the frontend and the model:
sequenceDiagram
participant FE as Frontend
participant MD as Model
FE->>MD: 1. Prefill: all messages
MD-->>FE: 2. Generate: reply (text + optional audio)
Detailed View
Full flow with the intermediate layers (Gateway passthrough proxy + Worker dispatch):
sequenceDiagram
participant FE as Frontend
participant GW as Gateway
participant WK as Worker
participant MD as Model
FE->>GW: WebSocket /ws/chat
GW->>WK: WebSocket passthrough
FE->>WK: {messages, streaming, tts, ...}
WK->>MD: Prefill (fill all messages at once)
MD-->>WK: KV Cache ready
WK-->>FE: prefill_done
alt Streaming response
loop chunk-by-chunk generation
MD-->>WK: text chunk + audio chunk
WK-->>FE: {type: "chunk", text_delta, audio_data}
end
else Non-streaming response
MD-->>WK: complete text + complete audio
end
WK-->>FE: {type: "done", text, generated_tokens}
Prefill Stage
Prefill fills the entire conversation history (system prompt + multi-turn user/assistant messages) into the model's KV Cache in one shot.
Key behaviors:
- Uses apply_chat_template to build the prompt, but without generation prompt (add_generation_prompt=False)
- The generation prompt (<|im_start|>assistant\n) is injected during the Generate stage
- Supports multimodal input: text, images, audio, video
# Pseudocode
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=False)
inputs = processor(prompt, images, audios)
inputs_embeds = get_vllm_embedding(inputs) # vision encoding
inputs_embeds = get_omni_embedding(inputs) # audio encoding
# Fill KV Cache
outputs = llm(inputs_embeds=inputs_embeds, use_cache=True)
kv_cache = outputs.past_key_values
Generate Stage
Generate produces a reply based on the filled KV Cache. It first injects the bos string, then branches:
Bos String Injection
<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n<|tts_bos|>
This is consistent with the model's behavior in chat() and streaming_generate().
Non-streaming Generation
Uses HuggingFace's standard llm.generate() API:
- bos string → embedding → KV Cache forward pass
llm.generate(past_key_values=kv_cache, ...)generates the complete sequence- Decode text
- Optional: call
_generate_speech_non_streaming()for TTS audio
Best for: short replies, scenarios that don't require real-time feedback.
Streaming Generation
Uses the model's internal streaming_generate() method:
- bos string is auto-injected
- Generates in groups of 10 tokens (
ChunkPrefillChunkGenerate) - Each token group is fed to TTS to produce an audio chunk
- Yields (waveform_chunk, text_delta) incrementally
Best for: long replies, scenarios that need real-time token-by-token display.
WebSocket Protocol
Endpoint
wss://host/ws/chat
Request Format
Send one JSON message after connecting:
{
"messages": [
{"role": "system", "content": [{"type": "text", "text": "..."}]},
{"role": "user", "content": "Hello"}
],
"streaming": true,
"generation": {"max_new_tokens": 256, "length_penalty": 1.1},
"tts": {"enabled": true, "ref_audio_data": "<base64>"},
"image": {"max_slice_nums": null},
"omni_mode": false
}
| Field | Description |
|---|---|
messages |
Complete message history (including system prompt) |
streaming |
true for streaming / false for non-streaming |
generation |
Generation parameters (max_new_tokens, length_penalty, etc.) |
tts |
Voice Response config (enabled, ref_audio_data) |
omni_mode |
Set to true for video input |
Response Messages
| Type | Fields | Description |
|---|---|---|
prefill_done |
input_tokens |
Prefill complete, returns input token count |
chunk |
text_delta, audio_data |
One streaming chunk (streaming=true only) |
done |
text, generated_tokens, input_tokens, audio_data |
Generation complete |
error |
error |
Error message |
Call Chain
Frontend turnbased.html
└─ WebSocket /ws/chat
└─ Gateway (WS passthrough proxy)
└─ Worker /ws/chat
├─ chat_prefill()
│ └─ ChatView.prefill()
│ └─ model.non_streaming_prefill()
├─ TTS init
│ └─ model.init_token2wav_cache()
└─ Generation
├─ Streaming: chat_streaming_generate()
│ └─ ChatView.streaming_generate()
│ └─ model.streaming_generate()
└─ Non-streaming: chat_non_streaming_generate()
└─ ChatView.generate()
└─ model.non_streaming_generate()
Voice Response
The frontend provides a "Voice Response" toggle:
| State | Behavior |
|---|---|
| On | Generates voice reply + text; automatically suppresses special characters (Markdown *, #, code symbols, etc.) to ensure text is suitable for voice playback |
| Off | Text-only generation; no character suppression, Markdown, code, etc. can be freely output |
Technical detail: The suppress_forbidden_tokens parameter in ChunkPrefillChunkGenerate.chunk_generate() is controlled by generate_audio.
Video Input
Chat mode supports video input (VideoContent):
- Frontend uploads a video file → Base64 encoding
- Processor decodes to a temp file →
get_video_frame_audio_segments()extracts frames and audio - Frames and audio are interleaved (frame, audio, [stacked_frame], ...) and fed to the model
- Request automatically sets
omni_mode: true,max_slice_nums: 1