pymllm.engine.forward_batch
===========================

.. py:module:: pymllm.engine.forward_batch

.. autoapi-nested-parse::

   ForwardMode and ForwardBatch for pymllm.

   Simplified forward-batch abstraction: no speculative decoding, no
   encoder-decoder support, and no distributed-attention complexity (DP/TP
   head splitting is handled at the layer level by the model code, not here).

   Typical data flow
   -----------------
      ModelRunner builds a ForwardBatch
          ↓
      attn_backend.init_forward_metadata(forward_batch)
          ↓
      model.forward(input_ids, positions, forward_batch)
          ↓
      RadixAttention.forward(q, k, v, forward_batch)
          ↓
      forward_batch.attn_backend.forward(q, k, v, layer, forward_batch)


Classes
-------

.. autoapisummary::

   pymllm.engine.forward_batch.ForwardMode
   pymllm.engine.forward_batch.ForwardBatch


Module Contents
---------------

.. py:class:: ForwardMode

   Bases: :py:obj:`enum.IntEnum`


   Describes what kind of forward pass is being performed.

   Covers standard prefill / decode inference without speculative decoding.


   .. py:attribute:: EXTEND


   .. py:attribute:: DECODE


   .. py:attribute:: MIXED


   .. py:attribute:: IDLE


   .. py:method:: is_extend()

      True for EXTEND or MIXED (i.e. any prefill-style pass).


   .. py:method:: is_prefill()

      Alias for ``is_extend()``.


   .. py:method:: is_decode()


   .. py:method:: is_mixed()


   .. py:method:: is_idle()


   .. py:method:: is_decode_or_idle()


.. py:class:: ForwardBatch

   All tensors required by a single forward pass through the model.

   :param forward_mode: The kind of pass being performed (EXTEND / DECODE / MIXED / IDLE).
   :param batch_size: Number of sequences in the batch.
   :param input_ids: Token ids for every position in the batch, shape ``[num_tokens]``.
                     For decode, ``num_tokens == batch_size``; for extend,
                     ``num_tokens == extend_num_tokens``.
   :param req_pool_indices: Index of each sequence in ``ReqToTokenPool``, shape ``[batch_size]``
                            (int32 or int64, on the target device).
   :param seq_lens: Total (prefix + new) length of each sequence, shape ``[batch_size]``
                    (int32).
   :param out_cache_loc: KV-pool slot that each *output* token is written to, shape
                         ``[num_tokens]`` (int64).
   :param seq_lens_sum: Python ``int`` equal to ``seq_lens.sum()``.  Cached to avoid repeated
                        device-to-host syncs.
   :param seq_lens_cpu: CPU copy of ``seq_lens`` (optional; used by some attention backends
                        for plan computation without a device sync).
   :param positions: Token position for each input token, shape ``[num_tokens]``
                     (int32 or int64).
   :param extend_num_tokens: Total number of new (non-prefix) tokens across the batch.  Only set
                             during EXTEND / MIXED passes.
   :param extend_seq_lens: Number of *new* tokens for each sequence, shape ``[batch_size]``
                           (int32).  Only set during EXTEND / MIXED.
   :param extend_prefix_lens: Length of the already-cached prefix for each sequence,
                              shape ``[batch_size]`` (int32).  Only set during EXTEND / MIXED.
   :param extend_start_loc: Cumulative start offset of each sequence in the flattened extend
                            token stream, shape ``[batch_size]`` (int32).
   :param extend_prefix_lens_cpu: CPU list mirror of ``extend_prefix_lens``.
   :param extend_seq_lens_cpu: CPU list mirror of ``extend_seq_lens``.
   :param return_logprob: Whether to compute per-token log-probabilities.
   :param top_logprobs_nums: Number of top log-probs to return per sequence (None or list of ints).
   :param req_to_token_pool: Reference to the ``ReqToTokenPool`` (set by the model runner).
   :param token_to_kv_pool: Reference to the ``KVPool`` (set by the model runner).
   :param attn_backend: The attention backend to use (set by the model runner before calling
                        ``model.forward``).


   .. py:attribute:: forward_mode
      :type:  ForwardMode


   .. py:attribute:: batch_size
      :type:  int


   .. py:attribute:: input_ids
      :type:  torch.Tensor


   .. py:attribute:: req_pool_indices
      :type:  torch.Tensor


   .. py:attribute:: seq_lens
      :type:  torch.Tensor


   .. py:attribute:: out_cache_loc
      :type:  torch.Tensor


   .. py:attribute:: seq_lens_sum
      :type:  int


   .. py:attribute:: seq_lens_cpu
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: positions
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: extend_num_tokens
      :type:  Optional[int]
      :value: None


   .. py:attribute:: extend_seq_lens
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: extend_prefix_lens
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: extend_start_loc
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: extend_prefix_lens_cpu
      :type:  Optional[List[int]]
      :value: None


   .. py:attribute:: extend_seq_lens_cpu
      :type:  Optional[List[int]]
      :value: None


   .. py:attribute:: return_logprob
      :type:  bool
      :value: False


   .. py:attribute:: top_logprobs_nums
      :type:  Optional[List[int]]
      :value: None


   .. py:attribute:: req_to_token_pool
      :type:  Optional[pymllm.mem_cache.memory_pool.ReqToTokenPool]
      :value: None


   .. py:attribute:: token_to_kv_pool
      :type:  Optional[pymllm.mem_cache.memory_pool.KVPool]
      :value: None


   .. py:attribute:: attn_backend
      :type:  Optional[pymllm.layers.attention.attention_backend.AttentionBackend]
      :value: None


   .. py:attribute:: mrope_position_deltas
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: pixel_values
      :type:  Optional[torch.Tensor]
      :value: None


   .. py:attribute:: image_grid_thw
      :type:  Optional[torch.Tensor]
      :value: None