Configuration & Deployment

System Requirements

Requirement	Minimum
GPU	NVIDIA GPU with VRAM > 28GB
OS	Linux
Python	3.10
CUDA	Compatible with PyTorch 2.8.0
FFmpeg	For video frame extraction and inference result visualization

Resource Consumption Reference

Resource	Token2Wav (Default)
VRAM (per Worker, after initialization)	~21.5 GB
Model loading time	~16s
Mode switching latency	< 0.1ms

torch.compile mode incurs an additional ~60s compilation time on first inference.

Dependency Installation

Using install.sh (Recommended)

# 1. Install Python 3.10 (miniconda recommended)
mkdir -p ./miniconda3_install_tmp
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_25.11.1-1-Linux-x86_64.sh \
    -O ./miniconda3_install_tmp/miniconda.sh
bash ./miniconda3_install_tmp/miniconda.sh -b -u -p ./miniconda3
source ./miniconda3/bin/activate

# 2. One-command installation
bash ./install.sh

install.sh automatically performs the following steps: 1. Creates a Python venv virtual environment at .venv/base 2. Installs PyTorch 2.8.0 + torchaudio 3. Installs all dependencies from requirements.txt 4. Verifies the installation

Manual Installation

source ./miniconda3/bin/activate
python -m venv .venv/base
source .venv/base/bin/activate

pip install "torch==2.8.0" "torchaudio==2.8.0"
pip install -r requirements.txt

Python Dependency List

Category	Package	Version
Core ML	transformers	4.51.0
	accelerate	1.12.0
	safetensors	>= 0.7.0
MiniCPM-o	minicpmo-utils[all]	>= 1.0.5
Web Service	fastapi	>= 0.128.0
	uvicorn	>= 0.40.0
	httpx	>= 0.28.0
	websockets	>= 16.0
	python-multipart	—
Data	pydantic	>= 2.11.0
	numpy	>= 2.2.0
Utilities	tqdm	>= 4.67.0
Testing	pytest	>= 9.0.0
	pytest-asyncio	>= 1.3.0

Configuration

config.json

All configuration is centralized in the config.json file at the project root. Copy from config.example.json for first-time setup:

cp config.example.json config.json

config.json is listed in .gitignore and will not be committed.

Configuration Priority

CLI arguments > config.json > Pydantic defaults

Complete Field Reference

model — Model Configuration

Field	Type	Default	Description
`model_path`	str	(required)	HuggingFace format model directory or Hub ID
`pt_path`	str	null	Optional .pt weight override path
`attn_implementation`	str	`"auto"`	Attention implementation method

audio — Audio Configuration

Field	Type	Default	Description
`ref_audio_path`	str	`assets/ref_audio/ref_minicpm_signature.wav`	Default TTS reference audio path
`playback_delay_ms`	int	200	Frontend audio playback delay (ms); higher values are smoother but add latency

service — Service Configuration

Field	Type	Default	Description
`gateway_port`	int	8006	Gateway listening port
`worker_base_port`	int	22400	Worker base port (Worker N = base + N)
`max_queue_size`	int	1000	Maximum queued requests
`request_timeout`	float	300.0	Request timeout (seconds)
`compile`	bool	false	Enable torch.compile acceleration
`data_dir`	str	`"data"`	Data storage directory
`eta_chat_s`	float	15.0	Chat task baseline ETA (seconds)
`eta_streaming_s`	float	20.0	Streaming task baseline ETA (seconds)
`eta_audio_duplex_s`	float	120.0	Audio Duplex task baseline ETA (seconds)
`eta_omni_duplex_s`	float	90.0	Omni Duplex task baseline ETA (seconds)
`eta_ema_alpha`	float	0.3	ETA EMA smoothing coefficient
`eta_ema_min_samples`	int	3	ETA EMA minimum sample count

duplex — Duplex Configuration

Field	Type	Default	Description
`pause_timeout`	float	60.0	Duplex pause timeout (seconds); the Worker is automatically released after timeout

Minimal Configuration

{
  "model": {
    "model_path": "openbmb/MiniCPM-o-4_5"
  }
}

Full Configuration Example

{
  "model": {
    "model_path": "openbmb/MiniCPM-o-4_5",
    "pt_path": null,
    "attn_implementation": "auto"
  },
  "audio": {
    "ref_audio_path": "assets/ref_audio/ref_minicpm_signature.wav",
    "playback_delay_ms": 200,
    "chat_vocoder": "token2wav"
  },
  "service": {
    "gateway_port": 8006,
    "worker_base_port": 22400,
    "max_queue_size": 1000,
    "request_timeout": 300.0,
    "compile": false,
    "data_dir": "data",
    "eta_chat_s": 15.0,
    "eta_streaming_s": 20.0,
    "eta_audio_duplex_s": 120.0,
    "eta_omni_duplex_s": 90.0,
    "eta_ema_alpha": 0.3,
    "eta_ema_min_samples": 3
  },
  "duplex": {
    "pause_timeout": 60.0
  }
}

Attention Backend

Controls the Attention implementation used for model inference, configured via the attn_implementation field.

Value	Behavior	Use Case
`"auto"` (default)	Detects flash-attn → `flash_attention_2`; otherwise → `sdpa`	Recommended
`"flash_attention_2"`	Forces Flash Attention 2	When flash-attn is confirmed installed
`"sdpa"`	PyTorch built-in SDPA	When flash-attn cannot be compiled
`"eager"`	Naive Attention	Debug only

Performance Comparison (A100): flash_attention_2 is ~5-15% faster than sdpa; sdpa is several times faster than eager.

Note: The Audio (Whisper) submodule always uses SDPA (incompatible with flash_attention_2). Vision / LLM / TTS follow the configuration.

Starting & Stopping

One-Command Start (start_all.sh)

# Use all available GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash start_all.sh

# Specify GPUs
CUDA_VISIBLE_DEVICES=0,1 bash start_all.sh

# Enable torch.compile (experimental)
bash start_all.sh --compile

# Downgrade to HTTP (not recommended; microphone/camera APIs require HTTPS)
bash start_all.sh --http

start_all.sh execution flow: 1. Parses command-line arguments (--http, --compile) 2. Reads port configuration from config.py 3. Detects the number of available GPUs 4. Launches one Worker process per GPU (nohup) 5. Waits for all Workers to pass health checks 6. Starts the Gateway process 7. Outputs the access URL and log paths

Manual Start

# Worker (one per GPU)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python worker.py \
    --worker-index 0 --gpu-id 0

# Gateway
PYTHONPATH=. .venv/base/bin/python gateway.py \
    --port 8006 --workers localhost:22400

CLI Arguments

Worker Arguments:

python worker.py \
    --model-path /path/to/model \
    --pt-path /path/to/weights.pt \
    --ref-audio-path /path/to/ref.wav \
    --worker-index 0 \
    --gpu-id 0 \
    --compile

Gateway Arguments:

python gateway.py \
    --port 8006 \
    --workers localhost:22400,localhost:22401 \
    --http

Stopping the Service

pkill -f "gateway.py|worker.py"

Model Download

Automatic Download (Default)

When model_path is set to openbmb/MiniCPM-o-4_5, the model is automatically downloaded from HuggingFace on first startup.

Manual Download

HuggingFace CLI:

pip install -U huggingface_hub
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5

hf-mirror (China):

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5

ModelScope (China):

pip install modelscope
modelscope download --model OpenBMB/MiniCPM-o-4_5 --local_dir /path/to/MiniCPM-o-4_5

Testing

Schema Unit Tests (No GPU Required)

PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_schemas.py -v

Processor Tests (GPU Required)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python -m pytest \
    tests/test_chat.py tests/test_streaming.py tests/test_duplex.py -v -s

API Integration Tests (Service Must Be Running)

PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_api.py -v -s

Test File Reference

File	Description
`test_schemas.py`	Schema unit tests
`test_chat.py`	Chat inference tests
`test_streaming.py`	Streaming inference tests
`test_duplex.py`	Duplex inference tests
`test_api.py`	API integration tests
`test_queue.py`	Queue logic tests
`test_queue_stress.py`	Queue stress tests
`test_integration.py`	Integration tests
`test_e2e.py`	End-to-end tests
`bench_duplex_ws.py`	Duplex WebSocket performance benchmark
`mock_worker.py`	Mock Worker (for GPU-free testing)
`js/queue-scenario.test.js`	Frontend queue scenario tests (Vitest)
`js/countdown-timer.test.js`	Countdown component tests (Vitest)

Runtime Directory Structure

data/
├── sessions/                  # Session recording data
│   ├── omni_abc123/
│   │   ├── meta.json
│   │   ├── recording.json
│   │   ├── user_audio/
│   │   ├── ai_audio/
│   │   └── ...
│   └── ...
└── ref_audio/                 # Uploaded reference audios
    ├── registry.json
    └── *.wav

tmp/
├── gateway.pid                # Gateway process PID
├── gateway.log                # Gateway log
├── worker_0.pid               # Worker 0 PID
├── worker_0.log               # Worker 0 log
└── diag_omni_*.jsonl          # Diagnostic logs