pymllm.models.qwen3_vl
======================

.. py:module:: pymllm.models.qwen3_vl

.. autoapi-nested-parse::

   Inference-only Qwen3-VL model for pymllm.

   Adapted from sglang's Qwen3-VL implementation for pymllm's single-GPU
   inference architecture.  Uses pymllm layers (RadixAttention, RMSNorm, MLP)
   and conforms to the pymllm forward interface::

       model.forward(input_ids, positions, forward_batch)

   Designed for a single accelerator card — no tensor / pipeline parallelism.


Attributes
----------

.. autoapisummary::

   pymllm.models.qwen3_vl.logger


Classes
-------

.. autoapisummary::

   pymllm.models.qwen3_vl.Qwen3VisionMLP
   pymllm.models.qwen3_vl.Qwen3VLVisionPatchEmbed
   pymllm.models.qwen3_vl.Qwen3VisionAttention
   pymllm.models.qwen3_vl.Qwen3VisionBlock
   pymllm.models.qwen3_vl.Qwen3VLVisionPatchMerger
   pymllm.models.qwen3_vl.Qwen3VLVisionModel
   pymllm.models.qwen3_vl.Qwen3VLAttention
   pymllm.models.qwen3_vl.Qwen3VLDecoderLayer
   pymllm.models.qwen3_vl.Qwen3VLTextModel
   pymllm.models.qwen3_vl.Qwen3VLForConditionalGeneration


Functions
---------

.. autoapisummary::

   pymllm.models.qwen3_vl.get_rope_index


Module Contents
---------------

.. py:data:: logger

.. py:class:: Qwen3VisionMLP(in_features, hidden_features, hidden_act = 'silu', bias = True)

   Bases: :py:obj:`torch.nn.Module`


   MLP block for the vision encoder.


   .. py:attribute:: linear_fc1


   .. py:attribute:: linear_fc2


   .. py:method:: forward(x)


.. py:class:: Qwen3VLVisionPatchEmbed(patch_size = 16, temporal_patch_size = 2, in_channels = 3, embed_dim = 1152)

   Bases: :py:obj:`torch.nn.Module`


   3D convolution patch embedding for video/image patchification.


   .. py:attribute:: patch_size
      :value: 16


   .. py:attribute:: temporal_patch_size
      :value: 2


   .. py:attribute:: in_channels
      :value: 3


   .. py:attribute:: embed_dim
      :value: 1152


   .. py:attribute:: proj


   .. py:method:: forward(hidden_states)


.. py:class:: Qwen3VisionAttention(embed_dim, num_heads)

   Bases: :py:obj:`torch.nn.Module`


   Multi-head self-attention for the vision encoder (no KV cache).


   .. py:attribute:: embed_dim


   .. py:attribute:: num_heads


   .. py:attribute:: head_dim


   .. py:attribute:: qkv_proj


   .. py:attribute:: out_proj


   .. py:method:: forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)

      Forward pass with variable-length sequences via cu_seqlens.

      :param x: [total_tokens, embed_dim]
      :param cu_seqlens: [num_seqs + 1] cumulative sequence lengths
      :param rotary_pos_emb_cos: [total_tokens, rotary_dim]
      :param rotary_pos_emb_sin: [total_tokens, rotary_dim]


.. py:class:: Qwen3VisionBlock(dim, num_heads, intermediate_dim, hidden_act = 'silu', norm_eps = 1e-06)

   Bases: :py:obj:`torch.nn.Module`


   Single vision transformer block.


   .. py:attribute:: norm1


   .. py:attribute:: norm2


   .. py:attribute:: attn


   .. py:attribute:: mlp


   .. py:method:: forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)


.. py:class:: Qwen3VLVisionPatchMerger(dim, context_dim, spatial_merge_size = 2, use_postshuffle_norm = False, norm_eps = 1e-06)

   Bases: :py:obj:`torch.nn.Module`


   Merges spatial patches to reduce sequence length.

   Groups ``spatial_merge_size ** 2`` consecutive patch tokens and projects
   them to the language model hidden dimension.


   .. py:attribute:: hidden_size


   .. py:attribute:: use_postshuffle_norm
      :value: False


   .. py:attribute:: norm


   .. py:attribute:: linear_fc1


   .. py:attribute:: act_fn


   .. py:attribute:: linear_fc2


   .. py:method:: forward(x)


.. py:class:: Qwen3VLVisionModel(depth = 27, hidden_size = 1152, hidden_act = 'gelu_pytorch_tanh', intermediate_size = 4304, num_heads = 16, in_channels = 3, patch_size = 16, spatial_merge_size = 2, temporal_patch_size = 2, out_hidden_size = 3584, num_position_embeddings = 2304, deepstack_visual_indexes = None, norm_eps = 1e-06)

   Bases: :py:obj:`torch.nn.Module`


   Complete vision encoder for Qwen3-VL.

   Produces patch embeddings from raw pixel values, applies a stack of
   vision transformer blocks with 3D rotary embeddings, then merges
   spatial patches.  Supports "deep stack" where intermediate layer
   outputs are captured and concatenated to the final output.


   .. py:attribute:: hidden_size
      :value: 1152


   .. py:attribute:: num_heads
      :value: 16


   .. py:attribute:: num_position_embeddings
      :value: 2304


   .. py:attribute:: num_grid_per_side
      :value: 0


   .. py:attribute:: patch_size
      :value: 16


   .. py:attribute:: spatial_merge_size
      :value: 2


   .. py:attribute:: temporal_patch_size
      :value: 2


   .. py:attribute:: deepstack_visual_indexes
      :value: None


   .. py:attribute:: out_hidden_size


   .. py:attribute:: patch_embed


   .. py:attribute:: pos_embed


   .. py:attribute:: blocks


   .. py:attribute:: merger


   .. py:attribute:: deepstack_merger_list


   .. py:property:: dtype
      :type: torch.dtype


   .. py:property:: device
      :type: torch.device


   .. py:method:: rot_pos_emb(grid_thw)

      Compute rotary pos-emb cos/sin for all images/videos in the batch.


   .. py:method:: fast_pos_embed_interpolate(grid_thw)

      Interpolate position embeddings via bilinear interpolation.


   .. py:method:: forward(x, grid_thw)

      Run the vision encoder.

      :param x: Pixel values, shape ``[total_patches, patch_dim]``.
      :param grid_thw: Grid dimensions ``[num_images, 3]`` with ``(T, H, W)``.

      :returns: Vision features of shape
                ``[num_merged_tokens, out_hidden_size * (1 + num_deepstack)]``.


.. py:function:: get_rope_index(input_ids, image_grid_thw, image_token_id, vision_start_token_id, spatial_merge_size)

   Compute M-RoPE 3-D position IDs for one sequence.

   For text tokens all three (temporal, height, width) indices are equal to
   the sequential counter.  For image tokens the indices follow the spatial
   grid ``(t, h, w)``.

   :param input_ids: Token IDs for one sequence, shape ``[T]``.
   :param image_grid_thw: Grid dimensions for every image in the sequence,
                          shape ``[num_images, 3]``.  ``None`` when there are no images.
   :param image_token_id: Token ID used as placeholder for image patches.
   :param vision_start_token_id: Token ID that precedes each image block.
   :param spatial_merge_size: Number of patches merged per spatial dimension
                              (e.g. 2 → 2x2 merge, so llm_grid_h = H // 2).

   :returns: ``(position_ids, mrope_position_delta)`` where ``position_ids`` has
             shape ``[3, T]`` and ``mrope_position_delta`` is a Python ``int``
             equal to ``max_position_used + 1 - T``.


.. py:class:: Qwen3VLAttention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None, prefix = '')

   Bases: :py:obj:`torch.nn.Module`


   Attention layer for the Qwen3-VL text decoder.

   Uses QK-norm (per-head RMSNorm on Q and K before RoPE) and
   :class:`RadixAttention` for KV-cached inference.  Applies
   interleaved M-RoPE with a precomputed cos/sin cache.


   .. py:attribute:: num_heads


   .. py:attribute:: num_kv_heads


   .. py:attribute:: head_dim


   .. py:attribute:: q_size


   .. py:attribute:: kv_size


   .. py:attribute:: scaling


   .. py:attribute:: mrope_section
      :value: [24, 20, 20]


   .. py:attribute:: mrope_interleaved
      :value: True


   .. py:attribute:: use_fused_qkv


   .. py:attribute:: o_proj


   .. py:attribute:: q_norm


   .. py:attribute:: k_norm


   .. py:attribute:: attn


   .. py:method:: forward(positions, hidden_states, forward_batch)


.. py:class:: Qwen3VLDecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, layer_id, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None, prefix = '')

   Bases: :py:obj:`torch.nn.Module`


   Single decoder layer for the Qwen3-VL text model.


   .. py:attribute:: self_attn


   .. py:attribute:: mlp


   .. py:attribute:: input_layernorm


   .. py:attribute:: post_attention_layernorm


   .. py:method:: forward(positions, hidden_states, forward_batch, deepstack_embeds = None)


.. py:class:: Qwen3VLTextModel(vocab_size = 151936, hidden_size = 4096, intermediate_size = 22016, num_hidden_layers = 32, num_attention_heads = 32, num_key_value_heads = 32, head_dim = 128, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None)

   Bases: :py:obj:`torch.nn.Module`


   Qwen3-VL text backbone (embedding + decoder layers + final norm).


   .. py:attribute:: hidden_size
      :value: 4096


   .. py:attribute:: num_hidden_layers
      :value: 32


   .. py:attribute:: embed_tokens


   .. py:attribute:: layers


   .. py:attribute:: norm


   .. py:method:: forward(input_ids, positions, forward_batch, input_embeds = None, input_deepstack_embeds = None)


.. py:class:: Qwen3VLForConditionalGeneration(config, quant_config=None)

   Bases: :py:obj:`torch.nn.Module`


   Qwen3-VL multimodal model for conditional generation.

   Combines a vision encoder and text decoder.  During prefill, image/video
   tokens are replaced with visual features from the vision encoder.
   During decode, the model runs only the text decoder.

   Forward interface::

       logits = model.forward(input_ids, positions, forward_batch)


   .. py:attribute:: config


   .. py:attribute:: quant_config
      :value: None


   .. py:attribute:: model


   .. py:attribute:: image_token_id


   .. py:attribute:: video_token_id


   .. py:attribute:: vision_start_token_id


   .. py:attribute:: spatial_merge_size


   .. py:method:: get_input_embeddings()


   .. py:method:: forward(input_ids, positions, forward_batch)

      Run forward pass for Qwen3-VL.

      :param input_ids: Flattened input token IDs, shape ``[num_tokens]``.
      :param positions: Position IDs, shape ``[num_tokens]`` (1-D, from model
                        runner).  Overridden internally with 3-D M-RoPE positions.
      :param forward_batch: :class:`ForwardBatch` with attention metadata.

      :returns: Logits tensor of shape ``[num_tokens, vocab_size]``.


   .. py:method:: load_weights(weights)

      Load weights from a HuggingFace checkpoint.

      Handles weight name remapping between HuggingFace Qwen3-VL
      checkpoints and this model's parameter names.