Roadmap & Help wanted!

August - October 2025

P0

Benchmarks

Benchmark MLLM, llama.cpp, mnn.

  • W4A32 & PPL

    • Qwen3

    • Qwen2.5VL

Model Supports

Transform models supported by v1 to v2.

  • Qwen3 Series

  • Qwen2 Series

  • Llama3 Series

  • TinyLlama

Performance Optimization

Using 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. To archive high performance in eager mode.

  • Inplace kernels for all backends

    • MulbyConst

    • AddFrom

    • Activation Functions

      • Sigmoid

      • GeLU

      • QuickGeLU

      • ✅ SiLU

      • ReLU, ReLU2

    • LayerNorm

    • RMSNorm

    • Softmax

  • Fused Kernels

    • Softmax + TopK

    • Matmul + RoPE

    • Softmax + Causal Mask

  • Well optimized models (modeling_xxx_fast version)

    • Using Fused Kernels

    • Using inplace operators

    • Manually free tensors before its lifetime ends

  • !!! Kernel Selector Table (Tune)

    • GEMV and GEMM kernels tile size

    • Thread numbers

  • Quantized KVCache

  • MllmBlas used in Qwen2.5-VL is slow, use ggml’s matmul(llama file) in the feature.

Arm Kernel support

  • ✅ MLLM-BLAS fp32 GEMM Kernels (transpose_a=False, transpose_b=True) [@chenghua]

  • Element-wise Kernels has slightly performance issues

  • ✅ Arm I8-Gemm and I8-Gemv Kernels. (Co-works with bitspack) [@chenghua]

  • Arm U1-7 Group Quantized Embedding Kernels. (Co-works with bitspack)

  • More KleidiAI Kernels (SME Supports)

  • Optimizing MLLM-BLAS-SGEMV and MLLM-BLAS-SGEMM Kernels, for Shapes in LLM Scenarios.

  • Full coverage of the correctness of current Arm operators

  • MXFP4 Linear Kernels

  • ✅ Paged Attention Kernels (Attentions as one of outputs)

X86 Backend support

  • Highway kernels for dbg purpose

QNN Backend support

  • Migration from mllmv1 to mllm v2

  • QNN Kernel Benchmarks

CANN Backend support

  • CANN Kernels

Quantization

  • Model Convertor & Quantizer

  • Shared weight Embedding(For tie-embedding scenario).

Applications & Productions

  • Multi-turn Chat

  • mllm-cli’s modelscope integration

P1

pymllm API

  • C++ Tensor and Python Tensor lifetime conflict in some test cases.

Tests

  • PPL Tests

Long term 2025

P1

FFI ABI

  • One C_api for all languages(Using tvm-ffi, thanks @tianqi)

ARM PMU Tools Workflow

  • A Kernel Benchmark workflow that using PMU in ARM Arch.

  • Software Pipeline & multi-issue will be benefited.