MiniCPMO45 Model Module Details

MiniCPMO45 is the system's core model module, implementing multimodal large language model inference capabilities with support for text, image, audio, and video input, as well as text and audio output.

Module Structure

MiniCPMO45/
├── configuration_minicpmo.py       # Model configuration definitions
├── modeling_minicpmo.py            # Main model implementation
├── modeling_minicpmo_unified.py    # Unified model (supports hot-switching)
├── modeling_navit_siglip.py        # SigLIP vision encoder
├── processing_minicpmo.py          # Multimodal processor
├── tokenization_minicpmo_fast.py   # Fast tokenizer
├── utils.py                        # Utility functions
├── tokenizer_config.json           # Tokenizer configuration
├── generation_config.json          # Generation configuration
├── preprocessor_config.json        # Preprocessor configuration
├── special_tokens_map.json         # Special token mapping
└── added_tokens.json               # Extended tokens

Multimodal Architecture Overview

graph TB
    subgraph inputLayer [Input Layer]
        TextIn["Text Input"]
        ImageIn["Image Input"]
        AudioIn["Audio Input\n(16kHz)"]
        VideoIn["Video Input\n(auto-extract frames+audio)"]
    end

    subgraph encoderLayer [Encoder Layer]
        Tokenizer["Qwen2 Tokenizer"]
        VPM["SigLIP Vision Encoder\n(ViT)"]
        Resampler["Resampler\n(Perceiver)"]
        APM["Whisper Audio Encoder"]
        AudioProj["Audio Projection\nMultiModalProjector"]
    end

    subgraph fusionLayer [Fusion Layer]
        Embedding["Multimodal Embedding Fusion"]
    end

    subgraph llmLayer [Language Model]
        LLM["Qwen3 LLM Backbone\n(Causal LM)"]
    end

    subgraph outputLayer [Output Layer]
        TextOut["Text Output"]
        TTSBlock["TTS Module"]
        T2W["Token2Wav"]
        AudioOut["Audio Output\n(24kHz)"]
    end

    TextIn --> Tokenizer --> Embedding
    ImageIn --> VPM --> Resampler --> Embedding
    AudioIn --> APM --> AudioProj --> Embedding
    VideoIn -->|"frames"| VPM
    VideoIn -->|"audio segments"| APM
    Embedding --> LLM
    LLM --> TextOut
    LLM --> TTSBlock
    TTSBlock --> T2W --> AudioOut

Core Model Structure

MiniCPMO is the central class of the system (inherits from Qwen3PreTrainedModel), composed of 6 sub-modules, each responsible for a specific modality or function:

Each sub-module can be independently enabled or disabled via the init_vision / init_audio / init_tts configuration flags.

Unified Model and Mode Switching

The MiniCPMO in modeling_minicpmo_unified.py extends the base model with unified mode management, supporting hot-switching between three modes via the ProcessorMode enum:

Switching is done via set_mode(mode), which only resets session state (KV Cache, Token2Wav cache, etc.) without reloading model weights, making mode transitions extremely lightweight.


Input Encoding

Text Encoding

Text input is tokenized by the Qwen2 Tokenizer, then converted to embedding vectors via llm.model.embed_tokens. An optional scale_emb scaling factor is applied after embedding.

Vision Encoding (Image and Video)

Image processing follows three steps:

  1. Image Slicing (MiniCPMVImageProcessor) — Large images are sliced into multiple patches according to MiniCPMVSliceConfig (up to max_slice_nums=9 patches, each 448x448), while retaining a global thumbnail. This enables the model to handle high-resolution images.
  2. VPM Encoding (SiglipVisionTransformer) — Each patch is encoded by the SigLIP ViT. The ViT consists of SiglipVisionEmbeddings (Conv2d patch embedding + positional encoding) and multi-layer SiglipEncoder (multi-head self-attention + FFN), with Flash Attention 2 support. The output is a variable-length patch feature sequence.
  3. Resampler Compression (Resampler) — Learnable query vectors (64 by default) perform cross-attention over the VPM output, compressing variable-length visual features into a fixed length. Positional information is injected via 2D sincos positional encoding. Output shape: (num_queries, embed_dim).

Video is decomposed into a frame sequence + audio segments. Frames follow the vision encoding path; audio segments follow the audio encoding path.

Audio Encoding

Audio processing follows three steps:

  1. Mel Spectrogram Extraction (MiniCPMAAudioProcessor) — 16kHz audio input is converted to 80-dimensional mel spectrogram features.
  2. APM Encoding (MiniCPMWhisperEncoder) — A Whisper-based encoder that first downsamples via two 1D convolutions (Conv1 stride=1 → GELU → Conv2 stride=2), then processes through multi-layer Transformer encoder layers. Supports KV Cache for streaming audio encoding, with optional context overlap via prefix_extra_frames / suffix_extra_frames.
  3. Projection + PoolingMultiModalProjector (Linear → ReLU → Linear) maps audio features to the LLM embedding dimension, followed by AvgPool1d (stride audio_pool_step=5) to further compress the sequence length.

Multimodal Embedding Fusion

After each modality is encoded, they are merged into a unified embedding sequence through two steps:

  1. Vision Fusion (get_vllm_embedding) — Vision placeholder tokens are reserved in the text sequence. Using image_bound (which records the start and end positions of each image placeholder), the corresponding text embeddings are replaced with the Resampler's visual embeddings via a scatter operation.
  2. Audio Fusion (get_omni_embedding) — On the vision-fused sequence, audio_bounds (which records the start and end positions of audio placeholders) is used to replace the corresponding embeddings with the audio encoder's output embeddings.

The result is a unified inputs_embeds (containing text + vision + audio) that is fed into the LLM for causal reasoning.


Language Model Inference

The fused inputs_embeds is fed into Qwen3ForCausalLM for autoregressive generation. The LLM is modality-agnostic — all modalities share the same embedding space after fusion.

Text generation supports two modes:


Output Generation

Text Output

The LLM directly outputs a token sequence, which is decoded into text by the Tokenizer.

Text-to-Speech (TTS)

When speech output is needed, the LLM's hidden states are converted to audio tokens by the MiniCPMTTS module, then synthesized into waveforms by a vocoder.

MiniCPMTTS Architecture:

The input layout is [Text BOS | Speaker Embedding | Text Tokens | Audio BOS | Audio Tokens...], and the model autoregressively predicts the audio token sequence.

Four attention modes (configured via attention_type):

Token2Wav Vocoder — Converts the TTS audio tokens into 24kHz waveforms. Supports both streaming (chunk-by-chunk conversion) and non-streaming (batch conversion) modes.


Duplex Capability (DuplexCapability)

DuplexCapability is a composition component (not inherited) that references the main MiniCPMO model's parameters via self.model, accessed as model.duplex. It implements real-time listen-speak interaction.

Three-Step Workflow

  1. prepare — Initializes the duplex session. Prefills the system prompt into the KV Cache, loads TTS reference audio (for voice cloning), and registers special tokens (<|listen|>, <|speak|>, <|tts_bos|>, <|tts_eos|>, etc.).
  2. streaming_prefill — Chunk-by-chunk prefill. At each time step, audio features and/or video frames are encoded and fed into the KV Cache, keeping the model continuously "listening" to input.
  3. streaming_generate — Step-by-step generation. At each step, the model decides whether to continue "listening" (output listen token) or start "speaking" (output speak token followed by text and audio tokens). Generated audio tokens are converted to waveforms in real-time via Token2Wav.

Sliding Window Strategies

During long duplex conversations, the KV Cache grows continuously. Sliding window strategies control memory usage:


Configuration Reference

MiniCPMOConfig

Inherits from Qwen3Config and contains four sub-configurations:

Key parameters:

MiniCPMTTSConfig