System Architecture Overview

Overall Architecture

The system adopts a Frontend - Gateway - Worker Pool three-tier architecture:

graph TB
    subgraph clientLayer [Client Layer]
        Browser["Browser (HTML/JS)"]
    end

    subgraph gatewayLayer [Gateway Layer]
        Gateway["Gateway (:8006, HTTPS)"]
        WPool["WorkerPool Scheduler"]
        Queue["FIFO Request Queue"]
        AppReg["AppRegistry"]
        RefAudio["RefAudioRegistry"]
        Gateway --> WPool
        Gateway --> Queue
        Gateway --> AppReg
        Gateway --> RefAudio
    end

    subgraph workerLayer [Worker Layer]
        W0["Worker 0 (GPU 0)\n:22400"]
        W1["Worker 1 (GPU 1)\n:22401"]
        W2["Worker N (GPU N)\n:22400+N"]
    end

    subgraph modelLayer [Model Layer]
        UP["UnifiedProcessor"]
        ChatV["ChatView"]
        HdxV["HalfDuplexView"]
        DuplexV["DuplexView"]
        UP --> ChatV
        UP --> HdxV
        UP --> DuplexV
    end

    Browser -->|"HTTPS / WSS"| Gateway
    WPool -->|"HTTP / WS (Internal)"| W0
    WPool -->|"HTTP / WS (Internal)"| W1
    WPool -->|"HTTP / WS (Internal)"| W2
    W0 --> UP

Responsibilities of Each Layer

Layer Component Responsibilities
Client Layer Browser Frontend Mode selection, audio/video capture, WebSocket communication, session recording
Gateway Layer Gateway Request routing & dispatch, WebSocket proxy, FIFO queuing, session affinity, ETA estimation
Worker Layer Worker x N Each Worker owns a dedicated GPU, performs model inference, manages KV Cache
Model Layer UnifiedProcessor Unified model loading, millisecond-level hot-switching between Chat / Half-Duplex / Duplex

Four Interaction Modes

The system provides four interaction modes, sharing three WebSocket endpoints under the hood:

Mode Features Input Modalities Output Modalities Interaction Paradigm Endpoint
Turn-based Chat Low-latency streaming interaction, reply triggered by button or VAD, strong base capabilities Audio + Text + Image + Video Audio + Text Turn-based dialogue ChatView
Half-Duplex Audio VAD auto-detects speech boundaries, hands-free voice conversation Voice Text + Voice Half-duplex HalfDuplexView
Omnimodal Full-Duplex Full-modality full-duplex, vision + voice input and voice output occur simultaneously Vision + Voice Text + Voice Full-duplex DuplexView
Audio Full-Duplex Voice full-duplex, voice input and output occur simultaneously Voice Text + Voice Full-duplex DuplexView
graph LR
    subgraph modes [Four Interaction Modes]
        TB["Turn-based Chat"]
        HD["Half-Duplex Audio"]
        OD["Omnimodal Full-Duplex"]
        AD["Audio Full-Duplex"]
    end

    subgraph apis [Three WebSocket Endpoints]
        ChatAPI["/ws/chat\nChatView"]
        HdxAPI["/ws/half_duplex\nHalfDuplexView"]
        DuplexAPI["/ws/duplex\nDuplexView"]
    end

    TB --> ChatAPI
    HD --> HdxAPI
    OD --> DuplexAPI
    AD --> DuplexAPI

Chat Endpoint — Turn-based Chat

Turn-based Chat uses ChatView (/ws/chat WebSocket) to implement turn-based multimodal dialogue.

ChatView splits inference into prefill and generate stages: prefill fills all messages into the KV Cache in one shot, and generate supports both streaming and non-streaming modes. The frontend can toggle the Streaming switch to choose between real-time token-by-token output or one-shot response.

See ChatView Mode Details for more information.

Half-Duplex Endpoint — Half-Duplex Audio

Half-Duplex Audio uses HalfDuplexView (/ws/half_duplex/{session_id} WebSocket) to implement VAD-based half-duplex voice conversation.

Server-side SileroVAD detects speech boundaries in real-time. After the user finishes speaking, it automatically triggers prefill + streaming generate. The Worker is exclusively occupied for the entire session (default 3-minute timeout). Frontend parameters (VAD threshold, generation params, etc.) are passed at session start.

See Half-Duplex Mode Details for more information.

Duplex Endpoint — Full-Duplex

Omnimodal Full-Duplex and Audio Full-Duplex share the Duplex endpoint (/ws/duplex/{session_id}), differing only in whether video frames are sent:

Both share the same prefill-generate unit loop, and the Worker is exclusively occupied throughout the entire session.

See Duplex Mode Details for more information.