.. raw:: html

MLLM

Mobile x Multimodal

Fast and lightweight LLM inference engine for mobile and edge devices

| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |

MLLM is a lightweight, on-device inference engine optimized for multi-modal models. It supports diverse hardware platforms including ARM CPUs, x86 architectures, and Qualcomm NPUs. Featuring a Torch-like API, MLLM enables developers to rapidly deploy AI algorithms directly on edge devices—ideal for future AI PCs, smart assistants, drones, satellites, and embodied intelligence applications. Latest News ----------- Key Features ------------ 1. **Pythonic eager execution** - Rapid model development 2. **Unified hardware support** - Arm CPU, OpenCL GPU, QNN NPU 3. **Advanced optimizations** - Quantization, pruning, speculative execution 4. **NPU-ready IR** - Seamless integration with NPU frameworks 5. **Deployment toolkit** - SDK + CLI inference tool Tested Devices -------------- +---------------------+----------------+------------------------+----------+------------------------+ | Device | OS | CPU | GPU | NPU | +=====================+================+========================+==========+========================+ | PC-X86-w/oAVX512 | Ubuntu 22.04 | |build-pending| | - | - | +---------------------+----------------+------------------------+----------+------------------------+ | Xiaomi14-8Elite | Android 15 | |build-passing| | - | |build-pending| | +---------------------+----------------+------------------------+----------+------------------------+ | OnePlus13-8Elite | Android 15 | |build-passing| | - | |build-pending| | +---------------------+----------------+------------------------+----------+------------------------+ | MacMini-M4 | MacOS 15.5 | |build-passing| | - | - | +---------------------+----------------+------------------------+----------+------------------------+ .. |build-pending| image:: https://img.shields.io/badge/build-pending-gray :alt: build-pending .. |build-passing| image:: https://img.shields.io/badge/build-passing-green :alt: build-passing Quick Starts ------------- Serving LLMs with mllm-cli ~~~~~~~~~~~~~~~~~~~~~~~~~~ We have developed a C SDK wrapper for the MLLM C++ SDK to enable seamless integration with Golang. Leveraging this wrapper, we've built the mllm-cli command-line tool in Golang, which is about to be released soon. Inference with VLM using C++ API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following example demonstrates how to perform inference on a multimodal vision-language model (VLM), specifically Qwen2-VL, using the mllm framework's C++ API. The process includes loading the model configuration, initializing the tokenizer, loading pretrained weights, processing image-text inputs, and performing streaming text generation. .. code-block:: c++ auto qwen2vl_cfg = Qwen2VLConfig(config_path); auto qwen2vl_tokenizer = Qwen2VLTokenizer(tokenizer_path); auto qwen2vl = Qwen2VLForCausalLM(qwen2vl_cfg); qwen2vl.load(mllm::load(model_path)); auto inputs = qwen2vl_tokenizer.convertMessage({.prompt = prompt_text, .img_file_path = image_path}); for (auto& step : qwen2vl.chat(inputs)) { std::wcout << qwen2vl_tokenizer.detokenize(step.cur_token_id) << std::flush; } more examples can be found in `examples <./examples/>`_ Custom Models ~~~~~~~~~~~~~ MLLM offers a highly Pythonic API to simplify model implementation for users. For instance, consider the following concise ``VisionMLP`` implementation: .. code-block:: c++ class VisionMlp final : public nn::Module { int32_t dim_; int32_t hidden_dim_; nn::QuickGELU act_; nn::Linear fc_1_; nn::Linear fc_2_; public: VisionMlp() = default; inline VisionMlp(const std::string& name, const Qwen2VLConfig& cfg) : nn::Module(name) { dim_ = cfg.visual_embed_dim; hidden_dim_ = cfg.visual_embed_dim * cfg.visual_mlp_ratio; fc_1_ = reg("fc1", dim_, hidden_dim_, true, cfg.linear_impl_type); fc_2_ = reg("fc2", hidden_dim_, dim_, true, cfg.linear_impl_type); act_ = reg("act"); } std::vector forward(const std::vector& inputs, const std::vector& args) override { return {fc_2_(act_(fc_1_(inputs[0])))}; } }; To utilize this ``VisionMLP``, instantiate and execute it as follows: .. code-block:: c++ auto mlp = VisionMlp(the_mlp_name, your_cfg); print(mlp); auto out = mlp(Tensor::random({1, 1024, 1024})); print(out); Model Tracing ~~~~~~~~~~~~~ MLLM enables **computational graph extraction** through its ``trace`` API, converting dynamic model execution into an optimized static representation. This is essential for model optimization, serialization, and deployment. For example: .. code-block:: c++ auto ir = mllm::ir::trace(mlp, Tensor::random({1, 1024, 1024})); print(ir); Installation ------------- Arm Android ~~~~~~~~~~~ .. code-block:: shell pip install -r requirements.txt python task.py tasks/build_android.yaml If you need to compile QNN Backends, please install the QNN SDK first. For instructions on setting up the QNN environment, please refer to `QNN README `_. Once the environment is configured, you can compile MLLM using the following command. .. code-block:: shell pip install -r requirements.txt python task.py tasks/build_android_qnn.yaml X86 PC ~~~~~~~~~~~ .. code-block:: shell pip install -r requirements.txt python task.py tasks/build_x86.yaml OSX (Apple Silicon) ~~~~~~~~~~~~~~~~~~~ .. code-block:: shell pip install -r requirements-mini.txt python task.py tasks/build_osx_apple_silicon.yaml if you want to use apple's accelerate library, you can use the following command. .. code-block:: shell pip install -r requirements-mini.txt python task.py tasks/build_osx_apple_silicon_accelerate.yaml Use Docker ~~~~~~~~~~~ The MLLM Team provides Dockerfile to help you get started quickly, and we recommend using Docker images. In the ``./docker/`` folder, we provide images for arm (cross-compile to arm, host: x86) and qnn (cross-compile to arm, host: x86). Both ARM and QNN images support compilation of X86 Backends. .. code-block:: shell git clone https://github.com/UbiquitousLearning/mllm.git cd mllm/docker docker build -t mllm_arm -f Dockerfile.arm . docker run -it --cap-add=SYS_ADMIN --network=host --cap-add=SYS_PTRACE --shm-size=4G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --name mllm_arm_dev mllm_arm bash Important Notes: 1. Dockerfile.arm includes NDK downloads. By using this image, you agree to NDK's additional terms. 2. QNN SDK contains proprietary licensing terms. We don't bundle it in Dockerfile.qnn - please configure QNN SDK manually. The details of how to use Dockerfile can be found in `Easy Setup with Docker and DevContainer for MLLM `_ Building the C++ SDK ~~~~~~~~~~~~~~~~~~~~ You can build the SDK using the following commands: .. code-block:: shell pip install -r requirements.txt python task.py tasks/build_sdk_.yaml # Example for macOS on Apple Silicon: python task.py tasks/build_sdk_osx_apple_silicon.yaml By default, the SDK installs to the root directory of the ``mllm`` project. To customize the installation path, modify the ``-DCMAKE_INSTALL_PREFIX`` option in the task YAML file. Once installed, integrate this library into your CMake project using ``find_package(mllm)``. Below is a minimal working example: .. code-block:: cmake cmake_minimum_required(VERSION 3.21) project(fancy_algorithm VERSION 1.0.0 LANGUAGES CXX C ASM) # Set C++20 standard and enable compile commands export set(CMAKE_CXX_STANDARD 20) set(CMAKE_EXPORT_COMPILE_COMMANDS ON) # Find mllm library find_package(mllm REQUIRED) add_executable(fancy_algorithm main.cpp) # Link against Mllm runtime and CPU backend targets target_link_libraries(fancy_algorithm PRIVATE mllm::MllmRT mllm::MllmCPUBackend) Building the Documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~ You can build the documentation using the following commands: .. code-block:: shell pip install -r docs/requirements.txt python task.py tasks/build_doc.yaml If you need to generate Doxygen documentation, please ensure that Doxygen is installed on your system. Then, set the ``enable_doxygen`` option to ``true`` in the ``tasks/build_doc.yaml`` configuration file. Running ``python task.py tasks/build_doc.yaml`` afterward will generate the C++ API documentation. Model Convert --------------- mllm provides a set of model converters to convert models from other popular model formats to MLLM. Before you start, please make sure you have installed the **pymllm** !!! .. code-block:: shell bash ./scripts/install_pymllm.sh **future:** Once PyPI approves the creation of the mllm organization, we will publish it there. Afterwards, you can use the command below to install it in the future. .. code-block:: shell pip install pymllm After installing pymllm, you can use the following command to convert the model: .. code-block:: shell mllm-convertor --input_path --output_path --cfg_path --pipeline For more usage instructions, please refer to ``mllm-convertor --help``. Tools ----- Join us & Contribute -------------------- Acknowledgements ---------------- mllm reuses many low-level kernel implementation from `ggml `_ on ARM CPU. It also utilizes `stb `_ and `wenet `_ for pre-processing images and audios. mllm also has benefitted from following projects: `llama.cpp `_ and `MNN `_. License -------- Overall Project License ~~~~~~~~~~~~~~~~~~~~~~~ This project is licensed under the terms of the MIT License. Please see the `LICENSE `_ file in the root directory for the full text of the MIT License. Apache 2.0 Licensed Components ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Certain component(`wenet `_) of this project is licensed under the Apache License 2.0. These component is clearly identified in their respective subdirectories along with a copy of the Apache License 2.0. For the full text of the Apache License 2.0, please refer to the `LICENSE-APACHE `_ file located in the relevant subdirectories. Citation -------- .. code-block:: bibtex @article{xu2025fast, title={Fast On-device LLM Inference with NPUs}, author={Xu, Daliang and Zhang, Hao and Yang, Liming and Liu, Ruiqi and Huang, Gang and Xu, Mengwei and Liu, Xuanzhe}, booktitle={International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, year={2025} } @misc{yi2023mllm, title = {mllm: fast and lightweight multimodal LLM inference engine for mobile and edge devices}, author = {Rongjie Yi and Xiang Li and Zhenyan Lu and Hao Zhang and Daliang Xu and Liming Yang and Weikai Xie and Chenghua Wang and Xuanzhe Liu and Mengwei Xu}, year = {2023}, publisher = {mllm Team}, url = {https://github.com/UbiquitousLearning/mllm} } Documents ---------- .. toctree:: :maxdepth: 2 quick_start/index .. toctree:: :maxdepth: 2 service/index .. toctree:: :maxdepth: 2 arch/index .. toctree:: :maxdepth: 2 compile/index .. toctree:: :maxdepth: 2 quantization/index .. toctree:: :maxdepth: 2 cache/index .. toctree:: :maxdepth: 2 cpu_backend/index .. toctree:: :maxdepth: 2 qnn_backend/index .. toctree:: :maxdepth: 2 api/index .. toctree:: :maxdepth: 2 contribute/index .. toctree:: :maxdepth: 2 talks/index .. toctree:: :maxdepth: 2 algorithms/index .. toctree:: :maxdepth: 2 qa/index .. toctree:: :maxdepth: 2 :caption: Pymllm API autoapi/pymllm/index .. toctree:: :maxdepth: 2 :caption: C++ API CppAPI/library_root