Roadmap & Help wanted! ====================== August - October 2025 --------------------- P0 ~~~ Benchmarks ^^^^^^^^^^^^ Benchmark MLLM, llama.cpp, mnn. - W4A32 & PPL - Qwen3 - Qwen2.5VL Model Supports ^^^^^^^^^^^^^^^^ Transform models supported by v1 to v2. - Qwen3 Series - Qwen2 Series - Llama3 Series - TinyLlama Performance Optimization ^^^^^^^^^^^^^^^^^^^^^^^^^^ Using 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. To archive high performance in eager mode. - Inplace kernels for all backends - MulbyConst - AddFrom - Activation Functions - Sigmoid - GeLU - QuickGeLU - ✅ SiLU - ReLU, ReLU2 - LayerNorm - RMSNorm - Softmax - Fused Kernels - Softmax + TopK - Matmul + RoPE - Softmax + Causal Mask - Well optimized models (modeling_xxx_fast version) - Using Fused Kernels - Using inplace operators - Manually free tensors before its lifetime ends - !!! Kernel Selector Table (Tune) - GEMV and GEMM kernels tile size - Thread numbers - Quantized KVCache - MllmBlas used in Qwen2.5-VL is slow, use ggml's matmul(llama file) in the feature. Arm Kernel support ^^^^^^^^^^^^^^^^^^ - ✅ MLLM-BLAS fp32 GEMM Kernels (transpose_a=False, transpose_b=True) [@chenghua] - Element-wise Kernels has slightly performance issues - ✅ Arm I8-Gemm and I8-Gemv Kernels. (Co-works with bitspack) [@chenghua] - Arm U1-7 Group Quantized Embedding Kernels. (Co-works with bitspack) - More KleidiAI Kernels (SME Supports) - Optimizing MLLM-BLAS-SGEMV and MLLM-BLAS-SGEMM Kernels, for Shapes in LLM Scenarios. - Full coverage of the correctness of current Arm operators - MXFP4 Linear Kernels - ✅ Paged Attention Kernels (Attentions as one of outputs) X86 Backend support ^^^^^^^^^^^^^^^^^^^^ - Highway kernels for dbg purpose QNN Backend support ^^^^^^^^^^^^^^^^^^^^ - Migration from mllmv1 to mllm v2 - QNN Kernel Benchmarks CANN Backend support ^^^^^^^^^^^^^^^^^^^^ - CANN Kernels Quantization ^^^^^^^^^^^^^^ - Model Convertor & Quantizer - Shared weight Embedding(For tie-embedding scenario). Applications & Productions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Multi-turn Chat - mllm-cli's modelscope integration P1 ~~~ pymllm API ^^^^^^^^^^^ - C++ Tensor and Python Tensor lifetime conflict in some test cases. Tests ^^^^^^ - PPL Tests Long term 2025 --------------------- P1 ~~~ FFI ABI ^^^^^^^^^^^ - One C_api for all languages(Using tvm-ffi, thanks @tianqi) ARM PMU Tools Workflow ^^^^^^^^^^^^^^^^^^^^^^^^ - A Kernel Benchmark workflow that using PMU in ARM Arch. - Software Pipeline & multi-issue will be benefited.