torch.compile Acceleration

Overview

On older-generation GPUs such as A100 and RTX 4090, the per-unit computation time in Omni Full-Duplex mode is approximately 0.9s, approaching the 1-second real-time threshold and causing noticeable stuttering.

torch.compile uses Triton to compile core sub-modules into optimized GPU kernels, reducing computation time to approximately 0.5s — meeting real-time requirements for smooth, stutter-free interaction.

How to Enable

Set in config.json:

{
  "service": {
    "compile": true
  }
}

Pre-compilation (Recommended)

The first-time compilation (cold start) takes approximately 15 minutes. To avoid this wait on the first service start, run the pre-compilation script ahead of time:

CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_CACHE_DIR=./torch_compile_cache .venv/base/bin/python precompile.py

The generated Triton kernel cache is saved to the ./torch_compile_cache directory (configured via the TORCHINDUCTOR_CACHE_DIR environment variable in start_all.sh). The cache persists on disk and is automatically reused on all subsequent starts, with no need to recompile.

After pre-compilation, start the service normally:

CUDA_VISIBLE_DEVICES=0,1,2,3 bash start_all.sh

Loading from cache takes approximately 5 minutes (compared to 15 minutes for cold compilation).

Compilation Pipeline

flowchart TB
    subgraph workerStart [Worker Startup]
        Load["load_model()"]
        UP["UnifiedProcessor.__init__()"]
        InitUnified["model.init_unified()"]
        Apply["model.apply_torch_compile()"]
        Warmup["model.warmup_compile()"]
        Ready["Worker IDLE"]
    end

    Load --> UP --> InitUnified --> Apply --> Warmup --> Ready

    subgraph applyDetail [apply_torch_compile]
        VPM["torch.compile(vpm)"]
        LLM["torch.compile(llm.model)"]
        RES["torch.compile(resampler)"]
        TTS["torch.compile(tts.model)"]
        TF32["set_float32_matmul_precision('high')"]
    end

    Apply --> VPM
    Apply --> LLM
    Apply --> RES
    Apply --> TTS
    Apply --> TF32

    subgraph warmupDetail [warmup_compile — Real Duplex Session]
        Extract["Extract MP4 audio + frames"]
        Prepare["duplex.prepare()"]
        Loop["Per-chunk loop:\nprefill → generate → finalize"]
        TTSfb["TTS fallback\n(if not triggered)"]
        Clean["Cleanup + empty_cache()"]
    end

    Warmup --> Extract --> Prepare --> Loop --> TTSfb --> Clean

Compilation Targets

Compiled Sub-modules

Sub-module	Original Class	Reason
`vpm`	`SiglipVisionTransformer`	Vision encoder, compute-intensive Transformer
`llm.model`	`Qwen3Model`	Core LLM backbone, primary inference bottleneck
`resampler`	`Resampler`	Visual feature resampling, Perceiver architecture
`tts.model`	`LlamaModel`	Core TTS backbone, audio token generation

Only the inner backbone is compiled (e.g., llm.model), not the outer wrapper (e.g., Qwen3ForCausalLM), because the outer layer contains Python control flow (generate() loop), where compilation provides little benefit and easily causes graph breaks.

Parts Not Compiled

Sub-module	Reason
`apm` (Whisper audio encoder)	Special streaming behavior + dynamic shapes; low compilation benefit
`tts.audio_tokenizer` (Token2Wav/CosyVoice2)	External library, non-standard `nn.Module`
`MiniCPMO` outer layer	Heavy Python control flow (chat/streaming/duplex branches); low compilation benefit
`lm_head`	Inside the outer wrapper, called within the generate loop

Compilation Parameters

Parameter	Default	Description
`mode`	`"default"`	Compilation mode
`dynamic`	`True`	Enable dynamic shape support

Compilation Modes

Mode	Compilation Time	Runtime Speed	Use Case
`default`	Moderate	Faster	Recommended; balances compilation time and runtime speed
`reduce-overhead`	Moderate	Fastest	Uses CUDA Graphs; only suitable for static shapes
`max-autotune`	Very long	Fastest	Maximum optimization; compilation may take several minutes

The project defaults to mode="default", dynamic=True because sequence lengths, image sizes, etc. vary dynamically during inference. dynamic=True avoids recompilation when shapes change.

TF32 Precision Boost

torch.set_float32_matmul_precision("high") enables TF32 matrix multiplication, providing an additional ~5-10% speedup on Ampere+ GPUs with negligible precision loss.

warmup_compile() — Real Duplex Warmup

torch.compile only wraps the modules — actual Triton kernel compilation is triggered on the first forward pass. warmup_compile() runs a complete Omni Full-Duplex inference session using a real MP4 video (prepare → prefill → generate → finalize), triggering Triton compilation for all compiled sub-modules in their real execution context.

Warmup Steps

Extract media from MP4 — ffmpeg extracts 16kHz audio in 1s chunks, one frame per second
Load reference audio — for TTS voice cloning
Start duplex session — duplex.prepare() initializes StreamDecoder, prefills system prompt + ref audio
Per-chunk inference — each chunk runs streaming_prefill → streaming_generate → finalize_unit, triggering Triton compilation for vpm / resampler / llm / tts
TTS fallback — if the model stayed in LISTEN throughout and TTS was not triggered, warms up tts.model with synthetic data
Cleanup — resets duplex state, releases token2wav caches, torch.cuda.empty_cache()

Warmup Logs

Each unit prints a detailed timing breakdown:

[warmup] unit=0/10 | prefill: vis_proc=120ms vis_emb=8500ms vis_feed=350ms
  aud_proc=45ms aud_emb=1200ms aud_feed=180ms total=10500ms |
  generate: llm=2100ms tts_prep=0ms tts=0ms token2wav=0ms total=2200ms |
  decision=LISTEN | elapsed=125s remaining~875s

The first chunk takes significantly longer due to Triton compilation; subsequent chunks use the already-compiled kernels and are much faster.

Cache Mechanism

PyTorch Inductor has built-in multi-layer caching. Compilation results persist in the TORCHINDUCTOR_CACHE_DIR directory (project default: ./torch_compile_cache):

Cache Layer	Purpose	Default State
Inductor Kernel Cache	Caches generated Triton kernel .so files	Enabled by default
FX Graph Cache	Caches compiled FX computation graphs	Enabled by default (PyTorch 2.8+)
Autotune Cache	Caches kernel autotuning results	Enabled by default

The cache is invalidated and recompilation is required when: - PyTorch version is upgraded - CUDA / Triton version changes - Model code structure changes (number of layers, architecture, etc.) - GPU architecture changes (e.g., switching from A100 to H100)

Call Chain

config.json: "service": { "compile": true }
    ↓
worker.py: WORKER_CONFIG["compile"] = cfg.compile
    ↓
MiniCPMOWorker.__init__(compile=True)
    ↓
UnifiedProcessor.__init__(compile=True)
    ↓
UnifiedProcessor._load_model():
    model.init_unified()
    model.apply_torch_compile(mode="default", dynamic=True)
    model.warmup_compile(ref_audio_path=...)

DuplexCapability Automatically Benefits

DuplexCapability accesses the model's llm, vpm, tts, and other sub-modules by reference and does not hold independent copies. After apply_torch_compile(), Duplex inference automatically uses the compiled versions with no additional action required.

Performance

Metric	Without compile	With compile
Omni Full-Duplex per-unit latency (A100)	~0.9s	~0.5s
Additional startup time (cold compilation)	—	~15 min
Additional startup time (cached)	—	~5 min
Runtime VRAM usage	~21.5 GB	~21.5 GB

Known Limitations

Certain extreme input shapes may trigger recompilation
Requires PyTorch 2.x+; some older CUDA drivers may be incompatible with Triton
Compiled code is difficult to step through in a debugger