pymllm.engine.forward_batch =========================== .. py:module:: pymllm.engine.forward_batch .. autoapi-nested-parse:: ForwardMode and ForwardBatch for pymllm. Simplified forward-batch abstraction: no speculative decoding, no encoder-decoder support, and no distributed-attention complexity (DP/TP head splitting is handled at the layer level by the model code, not here). Typical data flow ----------------- ModelRunner builds a ForwardBatch ↓ attn_backend.init_forward_metadata(forward_batch) ↓ model.forward(input_ids, positions, forward_batch) ↓ RadixAttention.forward(q, k, v, forward_batch) ↓ forward_batch.attn_backend.forward(q, k, v, layer, forward_batch) Classes ------- .. autoapisummary:: pymllm.engine.forward_batch.ForwardMode pymllm.engine.forward_batch.ForwardBatch Module Contents --------------- .. py:class:: ForwardMode Bases: :py:obj:`enum.IntEnum` Describes what kind of forward pass is being performed. Covers standard prefill / decode inference without speculative decoding. .. py:attribute:: EXTEND .. py:attribute:: DECODE .. py:attribute:: MIXED .. py:attribute:: IDLE .. py:method:: is_extend() True for EXTEND or MIXED (i.e. any prefill-style pass). .. py:method:: is_prefill() Alias for ``is_extend()``. .. py:method:: is_decode() .. py:method:: is_mixed() .. py:method:: is_idle() .. py:method:: is_decode_or_idle() .. py:class:: ForwardBatch All tensors required by a single forward pass through the model. :param forward_mode: The kind of pass being performed (EXTEND / DECODE / MIXED / IDLE). :param batch_size: Number of sequences in the batch. :param input_ids: Token ids for every position in the batch, shape ``[num_tokens]``. For decode, ``num_tokens == batch_size``; for extend, ``num_tokens == extend_num_tokens``. :param req_pool_indices: Index of each sequence in ``ReqToTokenPool``, shape ``[batch_size]`` (int32 or int64, on the target device). :param seq_lens: Total (prefix + new) length of each sequence, shape ``[batch_size]`` (int32). :param out_cache_loc: KV-pool slot that each *output* token is written to, shape ``[num_tokens]`` (int64). :param seq_lens_sum: Python ``int`` equal to ``seq_lens.sum()``. Cached to avoid repeated device-to-host syncs. :param seq_lens_cpu: CPU copy of ``seq_lens`` (optional; used by some attention backends for plan computation without a device sync). :param positions: Token position for each input token, shape ``[num_tokens]`` (int32 or int64). :param extend_num_tokens: Total number of new (non-prefix) tokens across the batch. Only set during EXTEND / MIXED passes. :param extend_seq_lens: Number of *new* tokens for each sequence, shape ``[batch_size]`` (int32). Only set during EXTEND / MIXED. :param extend_prefix_lens: Length of the already-cached prefix for each sequence, shape ``[batch_size]`` (int32). Only set during EXTEND / MIXED. :param extend_start_loc: Cumulative start offset of each sequence in the flattened extend token stream, shape ``[batch_size]`` (int32). :param extend_prefix_lens_cpu: CPU list mirror of ``extend_prefix_lens``. :param extend_seq_lens_cpu: CPU list mirror of ``extend_seq_lens``. :param return_logprob: Whether to compute per-token log-probabilities. :param top_logprobs_nums: Number of top log-probs to return per sequence (None or list of ints). :param req_to_token_pool: Reference to the ``ReqToTokenPool`` (set by the model runner). :param token_to_kv_pool: Reference to the ``KVPool`` (set by the model runner). :param attn_backend: The attention backend to use (set by the model runner before calling ``model.forward``). .. py:attribute:: forward_mode :type: ForwardMode .. py:attribute:: batch_size :type: int .. py:attribute:: input_ids :type: torch.Tensor .. py:attribute:: req_pool_indices :type: torch.Tensor .. py:attribute:: seq_lens :type: torch.Tensor .. py:attribute:: out_cache_loc :type: torch.Tensor .. py:attribute:: seq_lens_sum :type: int .. py:attribute:: seq_lens_cpu :type: Optional[torch.Tensor] :value: None .. py:attribute:: positions :type: Optional[torch.Tensor] :value: None .. py:attribute:: extend_num_tokens :type: Optional[int] :value: None .. py:attribute:: extend_seq_lens :type: Optional[torch.Tensor] :value: None .. py:attribute:: extend_prefix_lens :type: Optional[torch.Tensor] :value: None .. py:attribute:: extend_start_loc :type: Optional[torch.Tensor] :value: None .. py:attribute:: extend_prefix_lens_cpu :type: Optional[List[int]] :value: None .. py:attribute:: extend_seq_lens_cpu :type: Optional[List[int]] :value: None .. py:attribute:: return_logprob :type: bool :value: False .. py:attribute:: top_logprobs_nums :type: Optional[List[int]] :value: None .. py:attribute:: req_to_token_pool :type: Optional[pymllm.mem_cache.memory_pool.ReqToTokenPool] :value: None .. py:attribute:: token_to_kv_pool :type: Optional[pymllm.mem_cache.memory_pool.KVPool] :value: None .. py:attribute:: attn_backend :type: Optional[pymllm.layers.attention.attention_backend.AttentionBackend] :value: None .. py:attribute:: mrope_position_deltas :type: Optional[torch.Tensor] :value: None .. py:attribute:: pixel_values :type: Optional[torch.Tensor] :value: None .. py:attribute:: image_grid_thw :type: Optional[torch.Tensor] :value: None