pymllm.executor.model_runner
============================

.. py:module:: pymllm.executor.model_runner

.. autoapi-nested-parse::

   ModelRunner runs the forward passes of the models.

   pymllm's single-GPU inference architecture.  Handles:

   * Model loading (HuggingFace checkpoint via ``transformers``)
   * KV-cache memory pool initialisation
   * Attention backend setup (FlashInfer)
   * Forward pass dispatch (extend / decode / idle)
   * Token sampling from logits

   Typical lifecycle::

       runner = ModelRunner(server_config, model_config)
       runner.initialize()

       # --- inside the inference loop ---
       forward_batch = runner.prepare_forward_batch_decode(...)
       logits_output = runner.forward(forward_batch)
       next_token_ids = runner.sample(logits_output, forward_batch)

   Typical data flow
   -----------------
       SchedulerProcess builds a batch dict
           ↓
       ModelRunnerProcess calls ModelRunner.forward(forward_batch)
           ↓
       attn_backend.init_forward_metadata(forward_batch)
           ↓
       model.forward(input_ids, positions, forward_batch)
           ↓
       ModelRunner.sample(logits_output, forward_batch)
           ↓
       next_token_ids returned to scheduler


Attributes
----------

.. autoapisummary::

   pymllm.executor.model_runner.logger


Classes
-------

.. autoapisummary::

   pymllm.executor.model_runner.LogitsProcessorOutput
   pymllm.executor.model_runner.ModelRunner


Functions
---------

.. autoapisummary::

   pymllm.executor.model_runner.get_available_gpu_memory
   pymllm.executor.model_runner.get_total_gpu_memory


Module Contents
---------------

.. py:data:: logger

.. py:function:: get_available_gpu_memory(device = 'cuda', gpu_id = 0)

   Return available GPU memory in GB.


.. py:function:: get_total_gpu_memory(device = 'cuda', gpu_id = 0)

   Return total GPU memory in GB.


.. py:class:: LogitsProcessorOutput

   Container for output logits produced by the model's forward pass.

   .. attribute:: next_token_logits

      Raw logits for the last token of each sequence in the batch,
      shape ``[batch_size, vocab_size]``.

   .. attribute:: hidden_states

      Optional hidden states from the model (e.g. for speculative decoding
      or auxiliary loss computation).


   .. py:attribute:: next_token_logits
      :type:  torch.Tensor


   .. py:attribute:: hidden_states
      :type:  Optional[torch.Tensor]
      :value: None


.. py:class:: ModelRunner(server_config = None, model_config = None, gpu_id = 0)

   Runs the forward passes of the models.

   This is the core execution component that owns the model, memory pools,
   and attention backend.  It is used by
   :class:`~pymllm.orchestrator.model_runner_process.ModelRunnerProcess` to
   execute batches dispatched by the scheduler.

   :param server_config: Server runtime configuration.  Falls back to the global singleton
                         when ``None``.
   :param model_config: Model configuration (wraps a HuggingFace ``PretrainedConfig``).
                        Falls back to the global singleton when ``None``.
   :param gpu_id: GPU device index to use.


   .. py:attribute:: server_config


   .. py:attribute:: model_config


   .. py:attribute:: gpu_id
      :value: 0


   .. py:attribute:: device
      :type:  str
      :value: 'cuda'


   .. py:attribute:: dtype
      :type:  torch.dtype


   .. py:attribute:: model
      :type:  Optional[torch.nn.Module]
      :value: None


   .. py:attribute:: req_to_token_pool
      :type:  Optional[pymllm.mem_cache.memory_pool.ReqToTokenPool]
      :value: None


   .. py:attribute:: token_to_kv_pool
      :type:  Optional[pymllm.mem_cache.memory_pool.KVPool]
      :value: None


   .. py:attribute:: token_to_kv_pool_allocator
      :type:  Optional[pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator]
      :value: None


   .. py:attribute:: gdn_pool
      :type:  Optional[pymllm.mem_cache.memory_pool.GDNPool]
      :value: None


   .. py:attribute:: attn_backend
      :type:  Optional[pymllm.layers.attention.attention_backend.AttentionBackend]
      :value: None


   .. py:attribute:: graph_runner
      :type:  Optional[pymllm.executor.cuda_graph_runner.CudaGraphRunner]
      :value: None


   .. py:attribute:: max_total_num_tokens
      :type:  int
      :value: 0


   .. py:attribute:: max_running_requests
      :type:  int
      :value: 0


   .. py:attribute:: num_hidden_layers
      :type:  int
      :value: 0


   .. py:attribute:: num_attention_heads
      :type:  int
      :value: 0


   .. py:attribute:: num_kv_heads
      :type:  int
      :value: 0


   .. py:attribute:: head_dim
      :type:  int
      :value: 0


   .. py:attribute:: hidden_size
      :type:  int
      :value: 0


   .. py:attribute:: vocab_size
      :type:  int
      :value: 0


   .. py:attribute:: context_len
      :type:  int
      :value: 0


   .. py:attribute:: kv_cache_dtype
      :type:  torch.dtype


   .. py:attribute:: forward_pass_id
      :type:  int
      :value: 0


   .. py:method:: initialize()

      Full initialisation: set device, load model, init memory + backend.

      Call this once before any forward pass.


   .. py:method:: load_model()

      Load the model from a HuggingFace checkpoint.

      First checks the pymllm model registry for a custom implementation
      that uses ``RadixAttention``.  If found, instantiates it with the
      HuggingFace config and loads weights via ``load_weights()``.
      Otherwise falls back to ``AutoModelForCausalLM.from_pretrained``.


   .. py:method:: init_memory_pool()

      Initialise KV-cache memory pools and request-to-token mapping.

      1. Profiles available GPU memory to determine the maximum number of
         KV-cache token slots (``max_total_num_tokens``).
      2. Derives ``max_running_requests`` from config or heuristic.
      3. Creates :class:`~pymllm.mem_cache.memory_pool.ReqToTokenPool`,
         :class:`~pymllm.mem_cache.memory_pool.KVPool`, and
         :class:`~pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator`.


   .. py:method:: init_attention_backend()

      Initialise the attention backend.

      Creates a :class:`FlashInferAttnBackend` for standard models, or a
      :class:`HybridAttnBackend` (FlashInfer + GDN) for hybrid models.


   .. py:method:: init_cuda_graphs()

      Capture CUDA graphs for decode-step acceleration.

      Skipped when:
      * The device is not CUDA.
      * ``server_config.disable_cuda_graph`` is ``True``.
      * The model is not a generation model.


   .. py:method:: prepare_forward_batch_extend(input_ids, req_pool_indices, seq_lens, extend_seq_lens, extend_prefix_lens, out_cache_loc, return_logprob = False, top_logprobs_nums = None)

      Build a :class:`ForwardBatch` for an extend (prefill) pass.

      :param input_ids: Token IDs for all new tokens, shape ``[total_new_tokens]``.
      :param req_pool_indices: Index of each request in ``ReqToTokenPool``,
                               shape ``[batch_size]``.
      :param seq_lens: Total (prefix + new) length of each sequence,
                       shape ``[batch_size]``.
      :param extend_seq_lens: Number of new tokens per sequence, shape ``[batch_size]``.
      :param extend_prefix_lens: Cached prefix length per sequence, shape ``[batch_size]``.
      :param out_cache_loc: KV-pool slot indices for each new token,
                            shape ``[total_new_tokens]``.
      :param return_logprob: Whether to return per-token log-probabilities.
      :param top_logprobs_nums: Number of top log-probs per sequence.


   .. py:method:: prepare_forward_batch_decode(input_ids, req_pool_indices, seq_lens, out_cache_loc, return_logprob = False, top_logprobs_nums = None, mrope_position_deltas = None)

      Build a :class:`ForwardBatch` for a decode step.

      :param input_ids: Token IDs (one per sequence), shape ``[batch_size]``.
      :param req_pool_indices: Index of each request in ``ReqToTokenPool``,
                               shape ``[batch_size]``.
      :param seq_lens: Total sequence length of each request, shape ``[batch_size]``.
      :param out_cache_loc: KV-pool slot for each sequence's new token,
                            shape ``[batch_size]``.
      :param return_logprob: Whether to return per-token log-probabilities.
      :param top_logprobs_nums: Number of top log-probs per sequence.
      :param mrope_position_deltas: Per-request M-RoPE position deltas, shape ``[batch_size]`` (int64).
                                    Used by multimodal models (e.g. Qwen3-VL) to offset decode-step
                                    positions by the spatial extent of prefill images.


   .. py:method:: forward(forward_batch)

      Run a forward pass through the model.

      Dispatches to the appropriate method based on the batch's
      :attr:`~pymllm.engine.forward_batch.ForwardMode`.  For decode
      batches, automatically uses CUDA-graph replay when a captured
      graph is available.

      :param forward_batch: The prepared batch (from ``prepare_forward_batch_*``).

      :returns: Contains ``next_token_logits`` of shape
                ``[batch_size, vocab_size]``.
      :rtype: LogitsProcessorOutput


   .. py:method:: forward_decode(forward_batch)

      Run a decode forward pass (one new token per sequence).

      Calls ``attn_backend.init_forward_metadata`` followed by
      ``model.forward``.


   .. py:method:: forward_extend(forward_batch)

      Run an extend (prefill) forward pass.

      Calls ``attn_backend.init_forward_metadata`` followed by
      ``model.forward``.


   .. py:method:: sample(logits_output, forward_batch, temperatures = None, top_ps = None, top_ks = None, penalty_params = None)

      Sample next-token IDs from logits.

      Supports per-request temperature, top-p, top-k, and penalties
      (repetition, frequency, presence).

      :param logits_output: The logits from :meth:`forward`.
      :param forward_batch: The current forward batch.
      :param temperatures: Per-request temperature, shape ``[batch_size]``.
      :param top_ps: Per-request top-p, shape ``[batch_size]``.
      :param top_ks: Per-request top-k, shape ``[batch_size]``.
      :param penalty_params: Optional dict with keys ``repetition_penalties``,
                             ``frequency_penalties``, ``presence_penalties`` (tensors of
                             shape ``[batch_size]``), and ``token_histories`` (list of
                             list of int).

      :returns: Next-token IDs, shape ``[batch_size]``, dtype ``int32``.
      :rtype: torch.Tensor


   .. py:method:: shutdown()

      Release model and memory resources.


   .. py:property:: is_generation
      :type: bool


      True if the model is a generation (causal-LM) model.


   .. py:property:: sliding_window_size
      :type: Optional[int]


      Sliding-window attention span, or ``None`` for full context.