######################################## How to Support a New LLM: Step-by-Step ######################################## This guide walks you through adding a brand-new Large Language Model (LLM) to the **mllm** inference framework. We use **Qwen3** as a running example, but the same workflow applies to any transformer-style model. ********************************************* Step 1: Acquire the Model Assets ********************************************* 1. Download the original model from Hugging-Face (or any other reputable source). Typical files you need: * ``config.json`` * ``tokenizer.json`` / ``tokenizer.model`` * PyTorch / Safetensors checkpoints (``.bin``, ``.safetensors``) 2. Place everything under a single directory, e.g. ``~/models/Qwen3-0.6B``. .. note:: Models obtained from hosting platforms such as Hugging Face or ModelScope (via ``git clone`` or their official CLI) are already organized in a single directory that contains ``config.json``, ``tokenizer.json``, ``tokenizer.model``, checkpoint shards, etc. You can download Qwen3-0.6B from ModelScope with the following command: .. code-block:: bash git clone https://www.modelscope.cn/Qwen/Qwen3-0.6B.git .. note:: **About Model Version:** Most mllm models on Hugging-Face are in v1 version file format, which has no tensor shape and supports less data types. When loading a model, you can specify the model version. We maintain the compatibility of both v1 and v2 formats in mllm. It is recommended to use v2 format for new models whenever possible. ********************************************* Step 2: Convert to mllm Format ********************************************* mllm ships a dedicated converter called ``mllm-convertor``. It translates Hugging-Face / PyTorch checkpoints into the internal ``*.mllm`` format. Install pymllm (Python bindings) ================================ .. code-block:: bash bash ./scripts/install_pymllm.sh .. note:: Once the `mllm` organisation is approved on PyPI you will be able to run: .. code-block:: bash pip install pymllm Run the converter ================= .. code-block:: bash mllm-convertor \ --input_path ./Qwen3-0.6B/model.safetensors \ --output_path ./Qwen3-0.6B/w4a32.mllm \ --cfg_path ./Qwen3-0.6B/quant_config.json \ --pipeline w4a32_kai_pipeline For sharded checkpoints the converter automatically follows the ``*.index.json`` file—no manual merging required. Custom quantization recipe (optional) ======================================== Supply a JSON map if you need quantization for specific layers. Example snippet: .. code-block:: json { // KAI Config "^model\\.layers\\.\\d+\\.self_attn\\.q_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [2048, 1024], "replace": true } } // GGUF Config "^model\\.layers\\.\\d+\\.self_attn\\.q_proj.(bias|weight)": { "hints": { "quant_method": "gguf", "gguf_type": "Q4_0", "shape": [2048, 1024], "replace": true } } (...) } Save the file (e.g. ``quant_config.json``) and pass it with ``--cfg_path quant_config.json``. See full ``quant_config.json`` in Appendix. *************************************************************************** Step 3: (Optional) On-Device Quantization with mllm-quantizer *************************************************************************** ``mllm-convertor`` already embeds the quantization engine, but **mllm-quantizer** is still useful when: * you target an Android handset and want to quantize **on the phone**, or * the quantization kernel is only compiled for ARM. .. code-block:: bash mllm-quantizer \ -i ./Qwen3-0.6B/model.mllm \ -c ./Qwen3-0.6B/quant_config.json \ -iv v2 \ -o ./Qwen3-0.6B/w4a32.mllm \ -ov v2 .. note:: Basically, if you have no ARM DEVICE(Mac with apple silicon or Arm PC) to quantize your model through pymllm in kai settings. You should use mllm-quantizer to quantize your model on your arm devices (maybe android phone). Supported Quantization Types in mllm-quantizer ============================================== The `mllm-quantizer` tool supports several quantization types, allowing you to optimize model size and inference speed for different hardware targets. Below is a summary of the main quantization types available: **GGUF Quantization Types (CPU, cross-platform):** - `Q4_0`, `Q8_0`, `Q2_K`, `Q3_K`, `Q4_K`, `Q6_K`, `Q8_K`: GGUF per-group quantization. **KAI Quantization Types (ARM, Apple Silicon):** - `KAI_fp16_fp16_fp16p_mxk_kxn`: FP16 result, FP16 activation, FP16 weight, packed in kai format. - `KAI_f32_qai8dxp_qsi4c32p_mxk_nxk`: FP16 result, Int8 activation(asymmetric, per-token), Int4 weight(symmetric, per-group, 32 pack, transposed layout). - `KAI_f16_qsi8d32p_qai4c32p_mxk_nxk`: FP16 result, Int8 activation(symmetric, per-group, 32 pack), Int4 weight(asymmetric, per-group, 32 pack, transposed layout). **How to Select Quantization Type:** - For general CPU deployment, use GGUF types (`Q4_0`, `Q8_0`, etc.). - For ARM devices (Android, Apple Silicon), use KAI types for best performance and compatibility. - Specify the quantization type in your config or pipeline when running `mllm-quantizer`. For more details on each quantization type and their configuration, refer to the quantization implementation in the source code or the `kleidiai documentation `_. ********************************************* Step 4: Implement Core C++ Files ********************************************* mllm mirrors the Hugging-Face *config / tokenizer / model* split. Create three files under ``mllm/models/qwen3/``: 1. ``configuration_qwen3.hpp`` 2. ``tokenization_qwen3.hpp`` 3. ``modeling_qwen3.hpp`` See full ``configuration_qwen3.hpp``, ``tokenization_qwen3.hpp``, ``modeling_qwen3.hpp`` in Appendix. ********************************************* Step 5: Create an Example Application ********************************************* Directory layout .. code-block:: text examples/qwen3/ ├── main.cpp └── config_0.6B_w4a32_kai.json main.cpp ============= .. code-block:: cpp #include "mllm/mllm.hpp" #include "mllm/models/qwen3/modeling_qwen3.hpp" #include "mllm/models/qwen3/tokenization_qwen3.hpp" #include "mllm/models/qwen3/configuration_qwen3.hpp" int main(int argc, char* argv[]) { mllm::init(); std::string config_path = "..."; std::string tokenizer_path= "..."; std::string prompt = "Once upon a time"; auto cfg = mllm::models::qwen3::Qwen3Config(config_path); auto tokenizer = mllm::models::qwen3::Qwen3Tokenizer(tokenizer_path); auto model = mllm::models::qwen3::Qwen3ForCausalLM(cfg); auto inputs = tokenizer.convertMessage({.prompt = prompt}); for (auto& step : model.chat(inputs)) { std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush; } return 0; } config.json (example) ========================= .. code-block:: json { "architectures": ["Qwen3ForCausalLM"], "bos_token_id": 151643, "eos_token_id": 151645, "attention_bias": false, "hidden_size": 1024, "head_dim": 128, "intermediate_size": 3072, "num_attention_heads": 16, "num_key_value_heads": 8, "num_hidden_layers": 28, "max_position_embeddings": 40960, "rms_norm_eps": 1e-06, "vocab_size": 151936, "max_cache_length": 2048, "rope_theta": 1000000.0, "tie_word_embeddings": true, "linear_impl_type": "KaiLinear_f32_qai8dxp_qsi4c32p_mxk_nxk_qai8dxp1x8_qsi4c32p8x8_1x8x32" } Build & run .. code-block:: bash python task.py tasks/build_<...>.py ****************************** Step 6: Open a Pull Request ****************************** 1. Fork the official repository. 2. Create a feature branch: ``git checkout -b add-qwen3``. 3. Commit atomic changes with clear messages: .. code-block:: text [Qwen3] Add configuration loader [Qwen3] Implement Sentence-Piece tokenizer [Qwen3] Add CI unit test and 2-sample generation 4. Push and open a PR. A maintainer will review numerical correctness, coding style, and CI status before merging. *************** Conclusion *************** You have now walked through the complete life-cycle of integrating a new LLM into mllm: 1. Acquire Models → 2. Convert to MLLM models → 3. (Optionally) Quantize → 4. Implement C++ stubs → 5. Example & test → 6. PR. Following the checklist above guarantees that your model will load efficiently, run everywhere mllm runs, and is maintainable by the community. Happy hacking! *************** Appendix *************** quant_config.json =================== .. code-block:: json { "^model\\.layers\\.\\d+\\.self_attn\\.q_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 2048, 1024 ], "replace": true } }, "^model\\.layers\\.\\d+\\.self_attn\\.k_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 1024, 1024 ], "replace": true } }, "^model\\.layers\\.\\d+\\.self_attn\\.v_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 1024, 1024 ], "replace": true } }, "^model\\.layers\\.\\d+\\.self_attn\\.o_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 1024, 2048 ], "replace": true } }, "^model\\.layers\\.\\d+\\.mlp\\.gate_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 3072, 1024 ], "replace": true } }, "^model\\.layers\\.\\d+\\.mlp\\.up_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 3072, 1024 ], "replace": true } }, "^model\\.layers\\.\\d+\\.mlp\\.down_proj.(bias|weight)": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 1024, 3072 ], "replace": true } }, "lm_head.weight": { "hints": { "quant_method": "kai", "kai_matmul_triplet": "f32_qai8dxp_qsi4c32p", "kai_matmul_layout": "mxk_nxk", "kai_matmul_tile_cfg": "qai8dxp1x8_qsi4c32p8x8_1x8x32", "shape": [ 151936, 1024 ], "replace": false, "rename": "lm_head_out.weight" } } } configuration_qwen3.hpp ========================== .. code-block:: cpp // Copyright (c) MLLM Team. // Licensed under the MIT License. #pragma once #include "mllm/core/aops/LinearOp.hpp" #include "mllm/engine/ConfigFile.hpp" namespace mllm::models::qwen3 { struct Qwen3Config : protected ConfigFile { Qwen3Config() = default; explicit Qwen3Config(const std::string& file_path) : ConfigFile(file_path) { // Init all attention_bias = data()["attention_bias"]; hidden_size = data()["hidden_size"]; intermediate_size = data()["intermediate_size"]; num_attention_heads = data()["num_attention_heads"]; num_key_value_heads = data()["num_key_value_heads"]; num_hidden_layers = data()["num_hidden_layers"]; max_position_embeddings = data()["max_position_embeddings"]; rms_norm_eps = data()["rms_norm_eps"]; vocab_size = data()["vocab_size"]; head_dim = data()["head_dim"]; bos_token_id = data()["bos_token_id"]; eos_token_id = data()["eos_token_id"]; rope_theta = data()["rope_theta"]; tie_word_embeddings = data()["tie_word_embeddings"]; max_cache_length = data()["max_cache_length"]; linear_impl_type = aops::str2LinearImplTypes(data()["linear_impl_type"]); } bool attention_bias = false; int32_t hidden_size = 1024; int32_t head_dim = 128; int32_t intermediate_size = 3072; int32_t num_attention_heads = 16; int32_t num_key_value_heads = 8; int32_t num_hidden_layers = 28; int32_t max_position_embeddings = 40960; float rms_norm_eps = 1e-06; int32_t vocab_size = 151936; int64_t bos_token_id = 151643; int64_t eos_token_id = 151645; float rope_theta = 1000000.0; bool tie_word_embeddings = true; int32_t max_cache_length = 2048; int32_t end_of_text_token_id = 151645; aops::LinearImplTypes linear_impl_type = aops::LinearImplTypes::kDefault; }; } // namespace mllm::models::qwen3 modeling_qwen3.hpp ===================== .. code-block:: cpp // Copyright (c) MLLM Team. // Licensed under the MIT License. #include "mllm/mllm.hpp" #include "mllm/nn/Module.hpp" #include "mllm/nn/Nn.hpp" #include "mllm/nn/Functional.hpp" #include "mllm/nn/lmcache/StaticCache.hpp" #include "mllm/models/qwen3/configuration_qwen3.hpp" #include "mllm/utils/Enumerate.hpp" #include "mllm/models/ARGeneration.hpp" namespace mllm::models::qwen3 { inline auto makeRoPEInvFreq(int output_dim, float rope_theta) -> Tensor { auto inv_freq = Tensor::empty({output_dim / 2}, kFloat32, kCPU).alloc(); auto inv_freq_ptr = inv_freq.ptr(); for (int i = 0; i < output_dim / 2; i++) { inv_freq_ptr[i] = 1.0 / std::pow(rope_theta, 2.0 * i / output_dim); } return inv_freq; } inline auto makeRotaryPosEmbedding(Tensor& position_ids, const Tensor& inv_freq, float attention_scaling = 1.0f) -> std::pair { auto batch_size = position_ids.shape()[0]; auto seq_len = position_ids.shape()[1]; auto inv_freq_len = inv_freq.shape()[0]; auto dim = inv_freq_len * 2; // Create freqs tensor: position_ids @ inv_freq auto freqs = Tensor::empty({batch_size, seq_len, inv_freq_len}, kFloat32, kCPU).alloc(); auto freqs_ptr = freqs.ptr(); auto position_ids_ptr = position_ids.ptr(); auto inv_freq_ptr = inv_freq.ptr(); // Compute freqs = position_ids[:, :, None] @ inv_freq[None, :] for (int b = 0; b < batch_size; ++b) { for (int s = 0; s < seq_len; ++s) { auto pos = position_ids_ptr[b * seq_len + s]; for (int d = 0; d < inv_freq_len; ++d) { freqs_ptr[b * seq_len * inv_freq_len + s * inv_freq_len + d] = static_cast(pos) * inv_freq_ptr[d]; } } } // Create sin and cos tensors with shape [batch_size, seq_len, dim] auto sin_emb = Tensor::empty({batch_size, seq_len, dim}, kFloat32, kCPU).alloc(); auto cos_emb = Tensor::empty({batch_size, seq_len, dim}, kFloat32, kCPU).alloc(); auto sin_ptr = sin_emb.ptr(); auto cos_ptr = cos_emb.ptr(); // Compute sin and cos embeddings: emb = [freqs, freqs] for (int b = 0; b < batch_size; ++b) { for (int s = 0; s < seq_len; ++s) { for (int d = 0; d < inv_freq_len; ++d) { auto freq = freqs_ptr[b * seq_len * inv_freq_len + s * inv_freq_len + d]; auto sin_val = std::sin(freq) * attention_scaling; auto cos_val = std::cos(freq) * attention_scaling; // Store the same values in both halves: [freqs, freqs] sin_ptr[b * seq_len * dim + s * dim + d] = sin_val; sin_ptr[b * seq_len * dim + s * dim + d + inv_freq_len] = sin_val; cos_ptr[b * seq_len * dim + s * dim + d] = cos_val; cos_ptr[b * seq_len * dim + s * dim + d + inv_freq_len] = cos_val; } } } return {sin_emb, cos_emb}; } class Qwen3MLP final : public nn::Module { nn::Linear gate_proj_; nn::Linear up_proj_; nn::Linear down_proj_; nn::SiLU silu_; public: Qwen3MLP() = default; Qwen3MLP(const std::string& name, const Qwen3Config& cfg) : nn::Module(name) { gate_proj_ = reg("gate_proj", cfg.hidden_size, cfg.intermediate_size, false, cfg.linear_impl_type); silu_ = reg("act"); up_proj_ = reg("up_proj", cfg.hidden_size, cfg.intermediate_size, false, cfg.linear_impl_type); down_proj_ = reg("down_proj", cfg.intermediate_size, cfg.hidden_size, false, cfg.linear_impl_type); } std::vector forward(const std::vector& inputs, const std::vector& args) override { auto x = gate_proj_(inputs[0]); x = silu_(x); auto y = up_proj_(inputs[0]); x = x * y; x = down_proj_(x); return {x}; } }; class Qwen3Attention final : public nn::Module { nn::Linear q_proj_; nn::Linear k_proj_; nn::Linear v_proj_; nn::Linear o_proj_; nn::RMSNorm rms_norm_q_; nn::RMSNorm rms_norm_k_; nn::RoPE q_rope_; nn::RoPE k_rope_; nn::CausalMask mask_; nn::Softmax softmax_; int hidden_size_; int head_dim_; int num_attention_heads_; int num_key_value_heads_; int num_key_value_groups_; public: Qwen3Attention() = default; Qwen3Attention(const std::string& name, const Qwen3Config& cfg) : nn::Module(name) { hidden_size_ = cfg.hidden_size; num_attention_heads_ = cfg.num_attention_heads; num_key_value_heads_ = cfg.num_key_value_heads; head_dim_ = cfg.head_dim; num_key_value_groups_ = num_attention_heads_ / num_key_value_heads_; q_proj_ = reg("q_proj", hidden_size_, head_dim_ * num_attention_heads_, cfg.attention_bias, cfg.linear_impl_type); k_proj_ = reg("k_proj", hidden_size_, head_dim_ * num_key_value_heads_, cfg.attention_bias, cfg.linear_impl_type); v_proj_ = reg("v_proj", hidden_size_, head_dim_ * num_key_value_heads_, cfg.attention_bias, cfg.linear_impl_type); o_proj_ = reg("o_proj", head_dim_ * num_attention_heads_, hidden_size_, cfg.attention_bias, cfg.linear_impl_type); rms_norm_q_ = reg("q_norm", cfg.rms_norm_eps); rms_norm_k_ = reg("k_norm", cfg.rms_norm_eps); q_rope_ = reg("q_rope", cfg.rope_theta, cfg.max_position_embeddings); k_rope_ = reg("k_rope", cfg.rope_theta, cfg.max_position_embeddings); mask_ = reg("mask"); softmax_ = reg("softmax", -1); } std::vector forward(const std::vector& inputs, const std::vector& args) override { auto x = inputs[0]; auto llm_embedding_sin = inputs[1]; auto llm_embedding_cos = inputs[2]; auto past_kv_cache = args[0].get(); // [B, S, H * D] auto query_states = q_proj_(x); auto key_states = k_proj_(x); auto value_states = v_proj_(x); int B = inputs[0].shape()[0]; int S = inputs[0].shape()[1]; // [B, S, H, D] query_states = query_states.view({B, S, num_attention_heads_, head_dim_}); key_states = key_states.view({B, S, num_key_value_heads_, head_dim_}); value_states = value_states.view({B, S, num_key_value_heads_, head_dim_}); // [B, S, H, D] query_states = rms_norm_q_(query_states); key_states = rms_norm_k_(key_states); // [B, H, S, D] query_states = query_states.transpose(1, 2); key_states = key_states.transpose(1, 2); value_states = value_states.transpose(1, 2); // [B, H, S, D] query_states = q_rope_(query_states, llm_embedding_sin, llm_embedding_cos); key_states = k_rope_(key_states, llm_embedding_sin, llm_embedding_cos); // [B, H, S, D] std::tie(key_states, value_states) = past_kv_cache->updateKVCache(layer_idx_, key_states, value_states); Tensor attn; if (key_states.dtype() == kFloat32) { // attention weight // [B, H, S, S] attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); } else if (key_states.dtype() == kFloat16) { attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); attn = attn.to(kFloat16); } // attn output // [B, H, S, S] @ [B, H, S, D] -> [B, H, S, D] auto output = nn::functional::matmul(attn, value_states); // [B, H, S, D] -> [B, S, H, D] -> [B, S, H * D] output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); output = o_proj_(output); return {output}; } int layer_idx_; }; class Qwen3Decoder final : public nn::Module { public: Qwen3Attention self_attn_; Qwen3MLP mlp_; nn::RMSNorm input_layer_norm_; nn::RMSNorm post_attention_layer_norm_; Qwen3Decoder() = default; Qwen3Decoder(const std::string& name, const Qwen3Config& cfg) : nn::Module(name) { self_attn_ = reg("self_attn", cfg); mlp_ = reg("mlp", cfg); input_layer_norm_ = reg("input_layernorm", cfg.rms_norm_eps); post_attention_layer_norm_ = reg("post_attention_layernorm", cfg.rms_norm_eps); } std::vector forward(const std::vector& inputs, const std::vector& args) override { auto llm_embedding_sin = inputs[1]; auto llm_embedding_cos = inputs[2]; auto& kv_cache = args[0]; auto x = input_layer_norm_(inputs[0]); x = self_attn_(x, llm_embedding_sin, llm_embedding_cos, kv_cache)[0]; auto tmp = x + inputs[0]; x = post_attention_layer_norm_(tmp); x = mlp_(x)[0]; x = x + tmp; return {x}; } }; class Qwen3Text final : public nn::Module { nn::ModuleList decode_blocks_; nn::RMSNorm norm_; nn::Embedding embedding_; public: Qwen3Text() = default; Qwen3Text(const std::string& name, const Qwen3Config& cfg) : nn::Module(name) { decode_blocks_ = reg>("layers", cfg.num_hidden_layers, cfg); for (auto [idx, b] : enumerate(decode_blocks_.list())) { b.self_attn_.layer_idx_ = idx; } norm_ = reg("norm", cfg.rms_norm_eps); embedding_ = reg("embed_tokens", cfg.vocab_size, cfg.hidden_size); } std::vector forward(const std::vector& inputs, const std::vector& args) override { auto& blocks = decode_blocks_.list(); // X is already embedded auto x = embedding_(inputs[0]); auto llm_embedding_sin = inputs[1]; auto llm_embedding_cos = inputs[2]; auto& kv_cache = args[0]; for (auto& block : blocks) { x = block(x, llm_embedding_sin, llm_embedding_cos, kv_cache)[0]; } x = norm_(x); return {x}; } }; class Qwen3ForCausalLM : public ARGeneration, public nn::Module { public: explicit Qwen3ForCausalLM(const Qwen3Config& cfg) : cfg(cfg) { kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.num_hidden_layers, cfg.num_attention_heads, // q_heads cfg.num_key_value_heads, // kv_heads cfg.head_dim, // kv_dim kFloat32, // k_dtype kFloat32, // v_dtype kCPU, // device_type false // use_fa2 ); eos_token_id_ = cfg.end_of_text_token_id; max_length_ = cfg.max_cache_length; tie_word_embeddings_ = cfg.tie_word_embeddings; llm = reg("model", cfg); if (cfg.tie_word_embeddings) { // NOTE: // model.lm_head.weight is quantization weights of model.embed_tokens.weight lm_head_ = reg("lm_head_out", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type); } // Init inv freq auto inv = makeRoPEInvFreq(cfg.head_dim, cfg.rope_theta); registerBuffer("inv_freq", inv); } ARGenerationOutputPast forward(const ARGenerationOutputPast& input, const ARGenerationArgs& args) override { auto sequence = input.at("sequence"); // Generate position_ids for the current sequence auto batch_size = sequence.shape()[0]; auto seq_len = sequence.shape()[1]; Tensor position_ids = Tensor::nil(); if (input.count("position_ids")) { // Use existing position_ids for decode phase position_ids = input.at("position_ids"); // For decode phase, increment the last position if (seq_len == 1) { auto last_pos = *position_ids.offsettedPtr({0, position_ids.shape()[1] - 1}); position_ids = Tensor::empty({batch_size, 1}, kInt64, kCPU).alloc(); *position_ids.offsettedPtr({0, 0}) = last_pos + 1; } } else { // Generate position_ids for prefill phase position_ids = Tensor::empty({batch_size, seq_len}, kInt64, kCPU).alloc(); auto position_ids_ptr = position_ids.ptr(); for (int b = 0; b < batch_size; ++b) { for (int s = 0; s < seq_len; ++s) { position_ids_ptr[b * seq_len + s] = s; } } } // Generate RoPE embeddings using the inv_freq buffer auto [llm_embedding_sin, llm_embedding_cos] = makeRotaryPosEmbedding(position_ids, getBuffer("inv_freq"), 1.0f); sequence = llm(sequence, llm_embedding_sin, llm_embedding_cos, AnyValue(&kv_cache_))[0]; // clip x to one seq length { auto S = sequence.shape()[1]; sequence = sequence[{kAll, {S - 1}, kAll}]; } if (tie_word_embeddings_) { sequence = lm_head_(sequence); } return { {"sequence", sequence}, {"position_ids", position_ids}, }; } private: const Qwen3Config& cfg; Qwen3Text llm; nn::Linear lm_head_; bool tie_word_embeddings_; nn::StaticCache kv_cache_; }; } // namespace mllm::models::qwen3 tokenization_qwen3.hpp ======================== .. code-block:: cpp // Copyright (c) MLLM Team. // Licensed under the MIT License. #pragma once #include #include #include "mllm/preprocessor/tokenizers/BPE.hpp" #include "mllm/models/ARGeneration.hpp" #include "mllm/preprocessor/tokenizers/Unicode.hpp" #include "mllm/preprocessor/tokenizers/AutoTokenizer.hpp" namespace mllm::models::qwen3 { // we need to handle this: // // (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| // ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+ inline bool qwen3TokenizerMatchPattern(const std::wstring& str, size_t& pos, std::wstring& matched) { if (pos >= str.size()) return false; // 1. Match contractions: "'s|'t|'re|'ve|'m|'ll|'d" static const std::wstring contractions[] = {L"'s", L"'t", L"'re", L"'ve", L"'m", L"'ll", L"'d"}; for (const auto& contraction : contractions) { if (pos + contraction.size() <= str.size() && str.compare(pos, contraction.size(), contraction) == 0) { matched = contraction; pos += contraction.size(); return true; } } // 2. Match [^\r\n\p{L}\p{N}]?\p{L}+ (non-letter/digit followed by letters) { size_t original_pos = pos; bool has_prefix = false; matched.clear(); // Check optional non-letter/digit prefix (excluding \r\n) if (!preprocessor::isLetter(str[pos]) && !preprocessor::isDigit(str[pos]) && str[pos] != L'\r' && str[pos] != L'\n') { matched += str[pos]; ++pos; has_prefix = true; } // Require at least one letter if (pos < str.size() && preprocessor::isLetter(str[pos])) { do { matched += str[pos]; ++pos; } while (pos < str.size() && preprocessor::isLetter(str[pos])); return true; } else { // Rollback if no letters after prefix if (has_prefix) { pos = original_pos; matched.clear(); } } } // 3. Match \p{N} (digits) if (preprocessor::isDigit(str[pos])) { matched = str.substr(pos, 1); ++pos; return true; } // 4. Match ?[^\s\p{L}\p{N}]+[\r\n]* (punctuation/symbols with optional space prefix) { size_t original_pos = pos; matched.clear(); size_t start = pos; // Optional space if (str[pos] == L' ') { ++pos; } // Require at least one non-letter/digit/whitespace if (pos < str.size() && !std::iswspace(str[pos]) && !preprocessor::isLetter(str[pos]) && !preprocessor::isDigit(str[pos])) { do { ++pos; } while (pos < str.size() && !std::iswspace(str[pos]) && !preprocessor::isLetter(str[pos]) && !preprocessor::isDigit(str[pos])); // Capture from start (after optional space) to current pos matched = str.substr(start, pos - start); // Capture trailing newlines while (pos < str.size() && (str[pos] == L'\r' || str[pos] == L'\n')) { matched += str[pos]; ++pos; } return true; } else { // Rollback if no symbols found pos = original_pos; } } // 5. Match \s*[\r\n]+ (newlines with leading whitespace) { size_t start = pos; while (pos < str.size() && std::iswspace(str[pos])) ++pos; if (pos < str.size() && (str[pos] == L'\r' || str[pos] == L'\n')) { while (pos < str.size() && (str[pos] == L'\r' || str[pos] == L'\n')) ++pos; matched = str.substr(start, pos - start); return true; } else { pos = start; } } // 6. Match \s+(?!\S) (whitespace not followed by non-space) if (std::iswspace(str[pos])) { size_t start = pos; while (pos < str.size() && std::iswspace(str[pos])) ++pos; // Check if at end or followed by whitespace if (pos >= str.size() || std::iswspace(str[pos])) { matched = str.substr(start, pos - start); return true; } else { pos = start; } } // 7. Match remaining whitespace if (std::iswspace(str[pos])) { size_t start = pos; while (pos < str.size() && std::iswspace(str[pos])) ++pos; matched = str.substr(start, pos - start); return true; } return false; } inline bool qwen3Regex(const std::string& str, std::vector& splitted) { auto w_string = preprocessor::utf8string2WideString(str); size_t pos = 0; while (pos < w_string.size()) { std::wstring matched; if (qwen3TokenizerMatchPattern(w_string, pos, matched)) { splitted.push_back(matched); } else { ++pos; } } return true; } struct Qwen3Message { std::string prompt; static inline std::string message_template = "<|im_start|>user\n{{{prompt}}}<|im_end|>\n<|im_start|>assistant\n\n\n\n\n"; }; class Qwen3Tokenizer final : public mllm::preprocessor::AutoTokenizer { public: explicit Qwen3Tokenizer(const std::string& file_path) { preprocessor::initLocal(); preprocessor::makeBytes2UnicodeMap(bytes_2_unicode_dict_); for (auto& kv : bytes_2_unicode_dict_) { bytes_2_unicode_dict_inverse_.insert({kv.second, kv.first}); } bpe_.initFromSentencePieceJson(file_path); special_tokens_trie_.add(L"<|endoftext|>"); special_tokens_trie_.add(L"<|im_start|>"); special_tokens_trie_.add(L"<|im_end|>"); special_tokens_trie_.add(L"<|object_ref_start|>"); special_tokens_trie_.add(L"<|object_ref_end|>"); special_tokens_trie_.add(L"<|box_start|>"); special_tokens_trie_.add(L"<|box_end|>"); special_tokens_trie_.add(L"<|quad_start|>"); special_tokens_trie_.add(L"<|quad_end|>"); special_tokens_trie_.add(L"<|vision_start|>"); special_tokens_trie_.add(L"<|vision_end|>"); special_tokens_trie_.add(L"<|vision_pad|>"); special_tokens_trie_.add(L"<|image_pad|>"); special_tokens_trie_.add(L"<|video_pad|>"); special_tokens_trie_.add(L""); special_tokens_trie_.add(L""); } std::vector _tokenize(const std::string& str) override { std::vector ret; std::vector splitted; ::mllm::models::qwen3::qwen3Regex(str, splitted); for (const auto& s : splitted) { auto utf_8_str = preprocessor::wideString2Utf8String(s); std::wstring mapped_str; for (unsigned char c : utf_8_str) { mapped_str.push_back(bytes_2_unicode_dict_[c]); } auto bpe_ts = bpe_._bpe(mapped_str); for (const auto& bpe_t : bpe_ts) { ret.push_back(bpe_t); } } return ret; } std::vector tokenize(const std::string& str) override { auto tokens = special_tokens_trie_.split(preprocessor::utf8string2WideString(str)); std::vector all_tokens; for (const auto& token : tokens) { if (special_tokens_trie_.isSpecialToken(token)) { all_tokens.emplace_back(token); continue; } auto tmp_tokens = _tokenize(preprocessor::wideString2Utf8String(token)); all_tokens.insert(all_tokens.end(), tmp_tokens.begin(), tmp_tokens.end()); } return all_tokens; } std::wstring _detokenize(int64_t pos_idx) override { return bpe_._lookup_inverse_vocab(pos_idx); } std::wstring detokenize(int64_t pos_idx) override { auto str = _detokenize(pos_idx); std::string utf_8_str; for (wchar_t c : str) { utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c])); } return {mllm::preprocessor::utf8string2WideString(utf_8_str)}; } Tensor convert2Ids(const std::vector& strs) override { std::vector ids; ids.reserve(strs.size()); for (const auto& str : strs) { ids.emplace_back(bpe_._lookup_vocab(str)); } Tensor ret = Tensor::empty({/*batch*/ 1, /*seq*/ (int32_t)ids.size()}, kInt64, kCPU) .setMemType(kExtraInput) .setName("qwen2-tokenizer-i0") .alloc(); auto ptr = ret.ptr(); for (size_t i = 0; i < ids.size(); ++i) { ptr[i] = ids[i]; } return ret; } ARGenerationOutputPast convertMessage(const Qwen3Message& message) { // process prompt auto applied_string = Qwen3Message::message_template; size_t pos = applied_string.find("{{{prompt}}}"); applied_string.replace(pos, 12, message.prompt); // process sequence auto sequence_str = tokenize(applied_string); std::vector ids; ids.reserve(sequence_str.size()); for (const auto& str : sequence_str) { ids.emplace_back(bpe_._lookup_vocab(str)); } // Get sequence Tensor Tensor sequence = Tensor::empty({/*batch*/ 1, /*seq*/ (int32_t)ids.size()}, kInt64, kCPU) .setMemType(kNormal) .setName("qwen2-tokenizer-i0") .alloc(); auto ptr = sequence.ptr(); for (size_t i = 0; i < ids.size(); ++i) { ptr[i] = ids[i]; } return { {"sequence", sequence}, }; } private: // For text preprocessor::BPE bpe_; std::unordered_map bytes_2_unicode_dict_; std::unordered_map bytes_2_unicode_dict_inverse_; }; } // namespace mllm::models::qwen3