pymllm.executor.model_runner ============================ .. py:module:: pymllm.executor.model_runner .. autoapi-nested-parse:: ModelRunner runs the forward passes of the models. pymllm's single-GPU inference architecture. Handles: * Model loading (HuggingFace checkpoint via ``transformers``) * KV-cache memory pool initialisation * Attention backend setup (FlashInfer) * Forward pass dispatch (extend / decode / idle) * Token sampling from logits Typical lifecycle:: runner = ModelRunner(server_config, model_config) runner.initialize() # --- inside the inference loop --- forward_batch = runner.prepare_forward_batch_decode(...) logits_output = runner.forward(forward_batch) next_token_ids = runner.sample(logits_output, forward_batch) Typical data flow ----------------- SchedulerProcess builds a batch dict ↓ ModelRunnerProcess calls ModelRunner.forward(forward_batch) ↓ attn_backend.init_forward_metadata(forward_batch) ↓ model.forward(input_ids, positions, forward_batch) ↓ ModelRunner.sample(logits_output, forward_batch) ↓ next_token_ids returned to scheduler Attributes ---------- .. autoapisummary:: pymllm.executor.model_runner.logger Classes ------- .. autoapisummary:: pymllm.executor.model_runner.LogitsProcessorOutput pymllm.executor.model_runner.ModelRunner Functions --------- .. autoapisummary:: pymllm.executor.model_runner.get_available_gpu_memory pymllm.executor.model_runner.get_total_gpu_memory Module Contents --------------- .. py:data:: logger .. py:function:: get_available_gpu_memory(device = 'cuda', gpu_id = 0) Return available GPU memory in GB. .. py:function:: get_total_gpu_memory(device = 'cuda', gpu_id = 0) Return total GPU memory in GB. .. py:class:: LogitsProcessorOutput Container for output logits produced by the model's forward pass. .. attribute:: next_token_logits Raw logits for the last token of each sequence in the batch, shape ``[batch_size, vocab_size]``. .. attribute:: hidden_states Optional hidden states from the model (e.g. for speculative decoding or auxiliary loss computation). .. py:attribute:: next_token_logits :type: torch.Tensor .. py:attribute:: hidden_states :type: Optional[torch.Tensor] :value: None .. py:class:: ModelRunner(server_config = None, model_config = None, gpu_id = 0) Runs the forward passes of the models. This is the core execution component that owns the model, memory pools, and attention backend. It is used by :class:`~pymllm.orchestrator.model_runner_process.ModelRunnerProcess` to execute batches dispatched by the scheduler. :param server_config: Server runtime configuration. Falls back to the global singleton when ``None``. :param model_config: Model configuration (wraps a HuggingFace ``PretrainedConfig``). Falls back to the global singleton when ``None``. :param gpu_id: GPU device index to use. .. py:attribute:: server_config .. py:attribute:: model_config .. py:attribute:: gpu_id :value: 0 .. py:attribute:: device :type: str :value: 'cuda' .. py:attribute:: dtype :type: torch.dtype .. py:attribute:: model :type: Optional[torch.nn.Module] :value: None .. py:attribute:: req_to_token_pool :type: Optional[pymllm.mem_cache.memory_pool.ReqToTokenPool] :value: None .. py:attribute:: token_to_kv_pool :type: Optional[pymllm.mem_cache.memory_pool.KVPool] :value: None .. py:attribute:: token_to_kv_pool_allocator :type: Optional[pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator] :value: None .. py:attribute:: gdn_pool :type: Optional[pymllm.mem_cache.memory_pool.GDNPool] :value: None .. py:attribute:: attn_backend :type: Optional[pymllm.layers.attention.attention_backend.AttentionBackend] :value: None .. py:attribute:: graph_runner :type: Optional[pymllm.executor.cuda_graph_runner.CudaGraphRunner] :value: None .. py:attribute:: max_total_num_tokens :type: int :value: 0 .. py:attribute:: max_running_requests :type: int :value: 0 .. py:attribute:: num_hidden_layers :type: int :value: 0 .. py:attribute:: num_attention_heads :type: int :value: 0 .. py:attribute:: num_kv_heads :type: int :value: 0 .. py:attribute:: head_dim :type: int :value: 0 .. py:attribute:: hidden_size :type: int :value: 0 .. py:attribute:: vocab_size :type: int :value: 0 .. py:attribute:: context_len :type: int :value: 0 .. py:attribute:: kv_cache_dtype :type: torch.dtype .. py:attribute:: forward_pass_id :type: int :value: 0 .. py:method:: initialize() Full initialisation: set device, load model, init memory + backend. Call this once before any forward pass. .. py:method:: load_model() Load the model from a HuggingFace checkpoint. First checks the pymllm model registry for a custom implementation that uses ``RadixAttention``. If found, instantiates it with the HuggingFace config and loads weights via ``load_weights()``. Otherwise falls back to ``AutoModelForCausalLM.from_pretrained``. .. py:method:: init_memory_pool() Initialise KV-cache memory pools and request-to-token mapping. 1. Profiles available GPU memory to determine the maximum number of KV-cache token slots (``max_total_num_tokens``). 2. Derives ``max_running_requests`` from config or heuristic. 3. Creates :class:`~pymllm.mem_cache.memory_pool.ReqToTokenPool`, :class:`~pymllm.mem_cache.memory_pool.KVPool`, and :class:`~pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator`. .. py:method:: init_attention_backend() Initialise the attention backend. Creates a :class:`FlashInferAttnBackend` for standard models, or a :class:`HybridAttnBackend` (FlashInfer + GDN) for hybrid models. .. py:method:: init_cuda_graphs() Capture CUDA graphs for decode-step acceleration. Skipped when: * The device is not CUDA. * ``server_config.disable_cuda_graph`` is ``True``. * The model is not a generation model. .. py:method:: prepare_forward_batch_extend(input_ids, req_pool_indices, seq_lens, extend_seq_lens, extend_prefix_lens, out_cache_loc, return_logprob = False, top_logprobs_nums = None) Build a :class:`ForwardBatch` for an extend (prefill) pass. :param input_ids: Token IDs for all new tokens, shape ``[total_new_tokens]``. :param req_pool_indices: Index of each request in ``ReqToTokenPool``, shape ``[batch_size]``. :param seq_lens: Total (prefix + new) length of each sequence, shape ``[batch_size]``. :param extend_seq_lens: Number of new tokens per sequence, shape ``[batch_size]``. :param extend_prefix_lens: Cached prefix length per sequence, shape ``[batch_size]``. :param out_cache_loc: KV-pool slot indices for each new token, shape ``[total_new_tokens]``. :param return_logprob: Whether to return per-token log-probabilities. :param top_logprobs_nums: Number of top log-probs per sequence. .. py:method:: prepare_forward_batch_decode(input_ids, req_pool_indices, seq_lens, out_cache_loc, return_logprob = False, top_logprobs_nums = None, mrope_position_deltas = None) Build a :class:`ForwardBatch` for a decode step. :param input_ids: Token IDs (one per sequence), shape ``[batch_size]``. :param req_pool_indices: Index of each request in ``ReqToTokenPool``, shape ``[batch_size]``. :param seq_lens: Total sequence length of each request, shape ``[batch_size]``. :param out_cache_loc: KV-pool slot for each sequence's new token, shape ``[batch_size]``. :param return_logprob: Whether to return per-token log-probabilities. :param top_logprobs_nums: Number of top log-probs per sequence. :param mrope_position_deltas: Per-request M-RoPE position deltas, shape ``[batch_size]`` (int64). Used by multimodal models (e.g. Qwen3-VL) to offset decode-step positions by the spatial extent of prefill images. .. py:method:: forward(forward_batch) Run a forward pass through the model. Dispatches to the appropriate method based on the batch's :attr:`~pymllm.engine.forward_batch.ForwardMode`. For decode batches, automatically uses CUDA-graph replay when a captured graph is available. :param forward_batch: The prepared batch (from ``prepare_forward_batch_*``). :returns: Contains ``next_token_logits`` of shape ``[batch_size, vocab_size]``. :rtype: LogitsProcessorOutput .. py:method:: forward_decode(forward_batch) Run a decode forward pass (one new token per sequence). Calls ``attn_backend.init_forward_metadata`` followed by ``model.forward``. .. py:method:: forward_extend(forward_batch) Run an extend (prefill) forward pass. Calls ``attn_backend.init_forward_metadata`` followed by ``model.forward``. .. py:method:: sample(logits_output, forward_batch, temperatures = None, top_ps = None, top_ks = None, penalty_params = None) Sample next-token IDs from logits. Supports per-request temperature, top-p, top-k, and penalties (repetition, frequency, presence). :param logits_output: The logits from :meth:`forward`. :param forward_batch: The current forward batch. :param temperatures: Per-request temperature, shape ``[batch_size]``. :param top_ps: Per-request top-p, shape ``[batch_size]``. :param top_ks: Per-request top-k, shape ``[batch_size]``. :param penalty_params: Optional dict with keys ``repetition_penalties``, ``frequency_penalties``, ``presence_penalties`` (tensors of shape ``[batch_size]``), and ``token_histories`` (list of list of int). :returns: Next-token IDs, shape ``[batch_size]``, dtype ``int32``. :rtype: torch.Tensor .. py:method:: shutdown() Release model and memory resources. .. py:property:: is_generation :type: bool True if the model is a generation (causal-LM) model. .. py:property:: sliding_window_size :type: Optional[int] Sliding-window attention span, or ``None`` for full context.