Configuration & Deployment
System Requirements
| Requirement | Minimum |
|---|---|
| GPU | NVIDIA GPU with VRAM > 28GB |
| OS | Linux |
| Python | 3.10 |
| CUDA | Compatible with PyTorch 2.8.0 |
| FFmpeg | For video frame extraction and inference result visualization |
Resource Consumption Reference
| Resource | Token2Wav (Default) |
|---|---|
| VRAM (per Worker, after initialization) | ~21.5 GB |
| Model loading time | ~16s |
| Mode switching latency | < 0.1ms |
torch.compile mode incurs an additional ~60s compilation time on first inference.
Dependency Installation
Using install.sh (Recommended)
# 1. Install Python 3.10 (miniconda recommended)
mkdir -p ./miniconda3_install_tmp
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_25.11.1-1-Linux-x86_64.sh \
-O ./miniconda3_install_tmp/miniconda.sh
bash ./miniconda3_install_tmp/miniconda.sh -b -u -p ./miniconda3
source ./miniconda3/bin/activate
# 2. One-command installation
bash ./install.sh
install.sh automatically performs the following steps:
1. Creates a Python venv virtual environment at .venv/base
2. Installs PyTorch 2.8.0 + torchaudio
3. Installs all dependencies from requirements.txt
4. Verifies the installation
Manual Installation
source ./miniconda3/bin/activate
python -m venv .venv/base
source .venv/base/bin/activate
pip install "torch==2.8.0" "torchaudio==2.8.0"
pip install -r requirements.txt
Python Dependency List
| Category | Package | Version |
|---|---|---|
| Core ML | transformers | 4.51.0 |
| accelerate | 1.12.0 | |
| safetensors | >= 0.7.0 | |
| MiniCPM-o | minicpmo-utils[all] | >= 1.0.5 |
| Web Service | fastapi | >= 0.128.0 |
| uvicorn | >= 0.40.0 | |
| httpx | >= 0.28.0 | |
| websockets | >= 16.0 | |
| python-multipart | — | |
| Data | pydantic | >= 2.11.0 |
| numpy | >= 2.2.0 | |
| Utilities | tqdm | >= 4.67.0 |
| Testing | pytest | >= 9.0.0 |
| pytest-asyncio | >= 1.3.0 |
Configuration
config.json
All configuration is centralized in the config.json file at the project root. Copy from config.example.json for first-time setup:
cp config.example.json config.json
config.json is listed in .gitignore and will not be committed.
Configuration Priority
CLI arguments > config.json > Pydantic defaults
Complete Field Reference
model — Model Configuration
| Field | Type | Default | Description |
|---|---|---|---|
model_path |
str | (required) | HuggingFace format model directory or Hub ID |
pt_path |
str | null | Optional .pt weight override path |
attn_implementation |
str | "auto" |
Attention implementation method |
audio — Audio Configuration
| Field | Type | Default | Description |
|---|---|---|---|
ref_audio_path |
str | assets/ref_audio/ref_minicpm_signature.wav |
Default TTS reference audio path |
playback_delay_ms |
int | 200 | Frontend audio playback delay (ms); higher values are smoother but add latency |
service — Service Configuration
| Field | Type | Default | Description |
|---|---|---|---|
gateway_port |
int | 8006 | Gateway listening port |
worker_base_port |
int | 22400 | Worker base port (Worker N = base + N) |
max_queue_size |
int | 1000 | Maximum queued requests |
request_timeout |
float | 300.0 | Request timeout (seconds) |
compile |
bool | false | Enable torch.compile acceleration |
data_dir |
str | "data" |
Data storage directory |
eta_chat_s |
float | 15.0 | Chat task baseline ETA (seconds) |
eta_streaming_s |
float | 20.0 | Streaming task baseline ETA (seconds) |
eta_audio_duplex_s |
float | 120.0 | Audio Duplex task baseline ETA (seconds) |
eta_omni_duplex_s |
float | 90.0 | Omni Duplex task baseline ETA (seconds) |
eta_ema_alpha |
float | 0.3 | ETA EMA smoothing coefficient |
eta_ema_min_samples |
int | 3 | ETA EMA minimum sample count |
duplex — Duplex Configuration
| Field | Type | Default | Description |
|---|---|---|---|
pause_timeout |
float | 60.0 | Duplex pause timeout (seconds); the Worker is automatically released after timeout |
Minimal Configuration
{
"model": {
"model_path": "openbmb/MiniCPM-o-4_5"
}
}
Full Configuration Example
{
"model": {
"model_path": "openbmb/MiniCPM-o-4_5",
"pt_path": null,
"attn_implementation": "auto"
},
"audio": {
"ref_audio_path": "assets/ref_audio/ref_minicpm_signature.wav",
"playback_delay_ms": 200,
"chat_vocoder": "token2wav"
},
"service": {
"gateway_port": 8006,
"worker_base_port": 22400,
"max_queue_size": 1000,
"request_timeout": 300.0,
"compile": false,
"data_dir": "data",
"eta_chat_s": 15.0,
"eta_streaming_s": 20.0,
"eta_audio_duplex_s": 120.0,
"eta_omni_duplex_s": 90.0,
"eta_ema_alpha": 0.3,
"eta_ema_min_samples": 3
},
"duplex": {
"pause_timeout": 60.0
}
}
Attention Backend
Controls the Attention implementation used for model inference, configured via the attn_implementation field.
| Value | Behavior | Use Case |
|---|---|---|
"auto" (default) |
Detects flash-attn → flash_attention_2; otherwise → sdpa |
Recommended |
"flash_attention_2" |
Forces Flash Attention 2 | When flash-attn is confirmed installed |
"sdpa" |
PyTorch built-in SDPA | When flash-attn cannot be compiled |
"eager" |
Naive Attention | Debug only |
Performance Comparison (A100): flash_attention_2 is ~5-15% faster than sdpa; sdpa is several times faster than eager.
Note: The Audio (Whisper) submodule always uses SDPA (incompatible with flash_attention_2). Vision / LLM / TTS follow the configuration.
Starting & Stopping
One-Command Start (start_all.sh)
# Use all available GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash start_all.sh
# Specify GPUs
CUDA_VISIBLE_DEVICES=0,1 bash start_all.sh
# Enable torch.compile (experimental)
bash start_all.sh --compile
# Downgrade to HTTP (not recommended; microphone/camera APIs require HTTPS)
bash start_all.sh --http
start_all.sh execution flow:
1. Parses command-line arguments (--http, --compile)
2. Reads port configuration from config.py
3. Detects the number of available GPUs
4. Launches one Worker process per GPU (nohup)
5. Waits for all Workers to pass health checks
6. Starts the Gateway process
7. Outputs the access URL and log paths
Manual Start
# Worker (one per GPU)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python worker.py \
--worker-index 0 --gpu-id 0
# Gateway
PYTHONPATH=. .venv/base/bin/python gateway.py \
--port 8006 --workers localhost:22400
CLI Arguments
Worker Arguments:
python worker.py \
--model-path /path/to/model \
--pt-path /path/to/weights.pt \
--ref-audio-path /path/to/ref.wav \
--worker-index 0 \
--gpu-id 0 \
--compile
Gateway Arguments:
python gateway.py \
--port 8006 \
--workers localhost:22400,localhost:22401 \
--http
Stopping the Service
pkill -f "gateway.py|worker.py"
Model Download
Automatic Download (Default)
When model_path is set to openbmb/MiniCPM-o-4_5, the model is automatically downloaded from HuggingFace on first startup.
Manual Download
HuggingFace CLI:
pip install -U huggingface_hub
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5
hf-mirror (China):
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5
ModelScope (China):
pip install modelscope
modelscope download --model OpenBMB/MiniCPM-o-4_5 --local_dir /path/to/MiniCPM-o-4_5
Testing
Schema Unit Tests (No GPU Required)
PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_schemas.py -v
Processor Tests (GPU Required)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python -m pytest \
tests/test_chat.py tests/test_streaming.py tests/test_duplex.py -v -s
API Integration Tests (Service Must Be Running)
PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_api.py -v -s
Test File Reference
| File | Description |
|---|---|
test_schemas.py |
Schema unit tests |
test_chat.py |
Chat inference tests |
test_streaming.py |
Streaming inference tests |
test_duplex.py |
Duplex inference tests |
test_api.py |
API integration tests |
test_queue.py |
Queue logic tests |
test_queue_stress.py |
Queue stress tests |
test_integration.py |
Integration tests |
test_e2e.py |
End-to-end tests |
bench_duplex_ws.py |
Duplex WebSocket performance benchmark |
mock_worker.py |
Mock Worker (for GPU-free testing) |
js/queue-scenario.test.js |
Frontend queue scenario tests (Vitest) |
js/countdown-timer.test.js |
Countdown component tests (Vitest) |
Runtime Directory Structure
data/
├── sessions/ # Session recording data
│ ├── omni_abc123/
│ │ ├── meta.json
│ │ ├── recording.json
│ │ ├── user_audio/
│ │ ├── ai_audio/
│ │ └── ...
│ └── ...
└── ref_audio/ # Uploaded reference audios
├── registry.json
└── *.wav
tmp/
├── gateway.pid # Gateway process PID
├── gateway.log # Gateway log
├── worker_0.pid # Worker 0 PID
├── worker_0.log # Worker 0 log
└── diag_omni_*.jsonl # Diagnostic logs