Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
MLLM
2.0.0 documentation
MLLM
2.0.0 documentation
  • Quick Start
    • How to Support a New LLM: Step-by-Step
    • How to Add a New Operator in MLLM
    • How to run modules async
    • How to perf modules
  • Mllm API Service
    • MLLM CLI
  • Architectures
    • MLLM Framework Core Architecture
    • Tensor
    • Supported MLLM aops Operations
    • Op Plugin System
  • Compile
    • MLLM IR
  • Quantization
    • Data Types in MLLM
    • How to Add New Data Types
  • MLLM LM Cache
  • CPU Backend
    • Parallel API and Thread Configuration in MLLM
    • FA2, Radix, Paged
    • CPU ARM Backend
      • MLLM Self Hosted BLAS
      • Multithread Behaviors in CPU Backend
    • CPU X86 Backend
  • Ascend Backend
    • Ascend Backend
  • QNN Backend
    • QNN Environment Setup
    • QNN Backend Design
    • QNN AOT Execution Flow
  • OpenCL Backend
  • MLLM C++ API
    • mllm API
    • Neural Network Layers API
    • Functional API
    • Tensor API
    • ARGeneration API
    • Module API
    • Layer API
  • Contribute
    • Roadmap & Help wanted!
    • Guidelines
    • Model Supports
  • Talks
  • Algorithm
    • Pruning
  • FAQ

Pymllm API

  • pymllm
    • pymllm.__main__
    • pymllm.configs
      • pymllm.configs.global_config
      • pymllm.configs.model_config
      • pymllm.configs.quantization_config
      • pymllm.configs.server_config
    • pymllm.engine
      • pymllm.engine.forward_batch
      • pymllm.engine.io_struct
      • pymllm.engine.launch
    • pymllm.executor
      • pymllm.executor.cuda_graph_runner
      • pymllm.executor.model_runner
    • pymllm.layers
      • pymllm.layers.attention
        • pymllm.layers.attention.attention_backend
        • pymllm.layers.attention.flashinfer_backend
        • pymllm.layers.attention.gdn
        • pymllm.layers.attention.gdn_backend
        • pymllm.layers.attention.hybrid_backend
        • pymllm.layers.attention.radix_attention
        • pymllm.layers.attention.radix_linear_attention
      • pymllm.layers.base
      • pymllm.layers.custom_event
      • pymllm.layers.embedding
      • pymllm.layers.gated_delta_net
      • pymllm.layers.layer_norm
      • pymllm.layers.linear
      • pymllm.layers.mlp
      • pymllm.layers.quantize_base
      • pymllm.layers.rms_norm
      • pymllm.layers.rms_norm_gated
      • pymllm.layers.rope
      • pymllm.layers.sampling
      • pymllm.layers.utils
    • pymllm.mem_cache
      • pymllm.mem_cache.base_prefix_cache
      • pymllm.mem_cache.chunk_cache
      • pymllm.mem_cache.mamba_radix_cache
      • pymllm.mem_cache.memory_pool
      • pymllm.mem_cache.radix_cache
    • pymllm.mobile
      • pymllm.mobile.backends
        • pymllm.mobile.backends.qualcomm
          • pymllm.mobile.backends.qualcomm.nn
          • pymllm.mobile.backends.qualcomm.qnn_aot_env
          • pymllm.mobile.backends.qualcomm.transformers
            • pymllm.mobile.backends.qualcomm.transformers.core
              • pymllm.mobile.backends.qualcomm.transformers.core.embedding
              • pymllm.mobile.backends.qualcomm.transformers.core.observer
              • pymllm.mobile.backends.qualcomm.transformers.core.qdq
              • pymllm.mobile.backends.qualcomm.transformers.core.qlinear
              • pymllm.mobile.backends.qualcomm.transformers.core.rms_norm
      • pymllm.mobile.convertor
        • pymllm.mobile.convertor.mllm_type_mapping
        • pymllm.mobile.convertor.model_file_v1
        • pymllm.mobile.convertor.model_file_v2
      • pymllm.mobile.ffi
        • pymllm.mobile.ffi.base
      • pymllm.mobile.nn
        • pymllm.mobile.nn.functional
      • pymllm.mobile.quantize
        • pymllm.mobile.quantize.cast2fp32_pass
        • pymllm.mobile.quantize.gguf
        • pymllm.mobile.quantize.kai
          • pymllm.mobile.quantize.kai.w4a32
        • pymllm.mobile.quantize.pipeline
        • pymllm.mobile.quantize.quantize_pass
        • pymllm.mobile.quantize.solver
        • pymllm.mobile.quantize.spinquant
      • pymllm.mobile.service
        • pymllm.mobile.service.models_hub
        • pymllm.mobile.service.network
        • pymllm.mobile.service.rr_process
        • pymllm.mobile.service.tools
      • pymllm.mobile.utils
        • pymllm.mobile.utils.adb
        • pymllm.mobile.utils.error_handler
        • pymllm.mobile.utils.mllm_convertor
    • pymllm.models
      • pymllm.models.qwen3_5
      • pymllm.models.qwen3_moe
      • pymllm.models.qwen3_vl
    • pymllm.orchestrator
      • pymllm.orchestrator.cuda_ipc_transport
      • pymllm.orchestrator.detokenizer_process
      • pymllm.orchestrator.group_coordinator
      • pymllm.orchestrator.ipc_utils
      • pymllm.orchestrator.model_runner_process
      • pymllm.orchestrator.parallel_state
      • pymllm.orchestrator.request_response_process
      • pymllm.orchestrator.scheduler_process
      • pymllm.orchestrator.shared_memory_queue
      • pymllm.orchestrator.tokenizer_process
    • pymllm.parsers
      • pymllm.parsers.reasoning_parser
      • pymllm.parsers.tool_call_parser
    • pymllm.quantization
      • pymllm.quantization.methods
        • pymllm.quantization.methods.awq_marlin
      • pymllm.quantization.quant_config
    • pymllm.server
      • pymllm.server.launch
Back to top
View this page

pymllm.executor¶

Executor module: model loading, forward pass, and sampling.

Submodules¶

  • pymllm.executor.cuda_graph_runner
  • pymllm.executor.model_runner
Next
pymllm.executor.cuda_graph_runner
Previous
pymllm.engine.launch
Copyright © 2024-2025, MLLM Contributors
Made with Sphinx and @pradyunsg's Furo
On this page
  • pymllm.executor
    • Submodules