Roadmap & Help wanted!
======================

August - October 2025
---------------------

P0
~~~

Benchmarks
^^^^^^^^^^^^

Benchmark MLLM, llama.cpp, mnn.

- W4A32 & PPL

  - Qwen3
  - Qwen2.5VL

Model Supports
^^^^^^^^^^^^^^^^

Transform models supported by v1 to v2.

- Qwen3 Series
- Qwen2 Series
- Llama3 Series
- TinyLlama

Performance Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^

Using 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. To archive high performance in eager mode.

- Inplace kernels for all backends

  - MulbyConst
  - AddFrom
  - Activation Functions

    - Sigmoid
    - GeLU
    - QuickGeLU
    - ✅ SiLU
    - ReLU, ReLU2
  - LayerNorm
  - RMSNorm
  - Softmax

- Fused Kernels

  - Softmax + TopK
  - Matmul + RoPE
  - Softmax + Causal Mask

- Well optimized models (modeling_xxx_fast version)

  - Using Fused Kernels
  - Using inplace operators
  - Manually free tensors before its lifetime ends

- !!! Kernel Selector Table (Tune)

  - GEMV and GEMM kernels tile size
  - Thread numbers

- Quantized KVCache
- MllmBlas used in Qwen2.5-VL is slow, use ggml's matmul(llama file) in the feature. 

Arm Kernel support
^^^^^^^^^^^^^^^^^^

- ✅ MLLM-BLAS fp32 GEMM Kernels (transpose_a=False, transpose_b=True) [@chenghua]
- Element-wise Kernels has slightly performance issues
- ✅ Arm I8-Gemm and I8-Gemv Kernels. (Co-works with bitspack) [@chenghua]
- Arm U1-7 Group Quantized Embedding Kernels. (Co-works with bitspack)
- More KleidiAI Kernels (SME Supports)
- Optimizing MLLM-BLAS-SGEMV and MLLM-BLAS-SGEMM Kernels, for Shapes in LLM Scenarios.
- Full coverage of the correctness of current Arm operators
- MXFP4 Linear Kernels
- ✅ Paged Attention Kernels (Attentions as one of outputs)

X86 Backend support
^^^^^^^^^^^^^^^^^^^^

- Highway kernels for dbg purpose

QNN Backend support
^^^^^^^^^^^^^^^^^^^^

- Migration from mllmv1 to mllm v2
- QNN Kernel Benchmarks

CANN Backend support
^^^^^^^^^^^^^^^^^^^^

- CANN Kernels

Quantization
^^^^^^^^^^^^^^

- Model Convertor & Quantizer
- Shared weight Embedding(For tie-embedding scenario).

Applications & Productions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Multi-turn Chat
- mllm-cli's modelscope integration

P1
~~~

pymllm API
^^^^^^^^^^^

- C++ Tensor and Python Tensor lifetime conflict in some test cases.


Tests
^^^^^^

- PPL Tests

Long term 2025
---------------------

P1
~~~

FFI ABI
^^^^^^^^^^^

- One C_api for all languages(Using tvm-ffi, thanks @tianqi)

ARM PMU Tools Workflow
^^^^^^^^^^^^^^^^^^^^^^^^

- A Kernel Benchmark workflow that using PMU in ARM Arch.
- Software Pipeline & multi-issue will be benefited.