pymllm.models.qwen3_vl ====================== .. py:module:: pymllm.models.qwen3_vl .. autoapi-nested-parse:: Inference-only Qwen3-VL model for pymllm. Adapted from sglang's Qwen3-VL implementation for pymllm's single-GPU inference architecture. Uses pymllm layers (RadixAttention, RMSNorm, MLP) and conforms to the pymllm forward interface:: model.forward(input_ids, positions, forward_batch) Designed for a single accelerator card — no tensor / pipeline parallelism. Attributes ---------- .. autoapisummary:: pymllm.models.qwen3_vl.logger Classes ------- .. autoapisummary:: pymllm.models.qwen3_vl.Qwen3VisionMLP pymllm.models.qwen3_vl.Qwen3VLVisionPatchEmbed pymllm.models.qwen3_vl.Qwen3VisionAttention pymllm.models.qwen3_vl.Qwen3VisionBlock pymllm.models.qwen3_vl.Qwen3VLVisionPatchMerger pymllm.models.qwen3_vl.Qwen3VLVisionModel pymllm.models.qwen3_vl.Qwen3VLAttention pymllm.models.qwen3_vl.Qwen3VLDecoderLayer pymllm.models.qwen3_vl.Qwen3VLTextModel pymllm.models.qwen3_vl.Qwen3VLForConditionalGeneration Functions --------- .. autoapisummary:: pymllm.models.qwen3_vl.get_rope_index Module Contents --------------- .. py:data:: logger .. py:class:: Qwen3VisionMLP(in_features, hidden_features, hidden_act = 'silu', bias = True) Bases: :py:obj:`torch.nn.Module` MLP block for the vision encoder. .. py:attribute:: linear_fc1 .. py:attribute:: linear_fc2 .. py:method:: forward(x) .. py:class:: Qwen3VLVisionPatchEmbed(patch_size = 16, temporal_patch_size = 2, in_channels = 3, embed_dim = 1152) Bases: :py:obj:`torch.nn.Module` 3D convolution patch embedding for video/image patchification. .. py:attribute:: patch_size :value: 16 .. py:attribute:: temporal_patch_size :value: 2 .. py:attribute:: in_channels :value: 3 .. py:attribute:: embed_dim :value: 1152 .. py:attribute:: proj .. py:method:: forward(hidden_states) .. py:class:: Qwen3VisionAttention(embed_dim, num_heads) Bases: :py:obj:`torch.nn.Module` Multi-head self-attention for the vision encoder (no KV cache). .. py:attribute:: embed_dim .. py:attribute:: num_heads .. py:attribute:: head_dim .. py:attribute:: qkv_proj .. py:attribute:: out_proj .. py:method:: forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin) Forward pass with variable-length sequences via cu_seqlens. :param x: [total_tokens, embed_dim] :param cu_seqlens: [num_seqs + 1] cumulative sequence lengths :param rotary_pos_emb_cos: [total_tokens, rotary_dim] :param rotary_pos_emb_sin: [total_tokens, rotary_dim] .. py:class:: Qwen3VisionBlock(dim, num_heads, intermediate_dim, hidden_act = 'silu', norm_eps = 1e-06) Bases: :py:obj:`torch.nn.Module` Single vision transformer block. .. py:attribute:: norm1 .. py:attribute:: norm2 .. py:attribute:: attn .. py:attribute:: mlp .. py:method:: forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin) .. py:class:: Qwen3VLVisionPatchMerger(dim, context_dim, spatial_merge_size = 2, use_postshuffle_norm = False, norm_eps = 1e-06) Bases: :py:obj:`torch.nn.Module` Merges spatial patches to reduce sequence length. Groups ``spatial_merge_size ** 2`` consecutive patch tokens and projects them to the language model hidden dimension. .. py:attribute:: hidden_size .. py:attribute:: use_postshuffle_norm :value: False .. py:attribute:: norm .. py:attribute:: linear_fc1 .. py:attribute:: act_fn .. py:attribute:: linear_fc2 .. py:method:: forward(x) .. py:class:: Qwen3VLVisionModel(depth = 27, hidden_size = 1152, hidden_act = 'gelu_pytorch_tanh', intermediate_size = 4304, num_heads = 16, in_channels = 3, patch_size = 16, spatial_merge_size = 2, temporal_patch_size = 2, out_hidden_size = 3584, num_position_embeddings = 2304, deepstack_visual_indexes = None, norm_eps = 1e-06) Bases: :py:obj:`torch.nn.Module` Complete vision encoder for Qwen3-VL. Produces patch embeddings from raw pixel values, applies a stack of vision transformer blocks with 3D rotary embeddings, then merges spatial patches. Supports "deep stack" where intermediate layer outputs are captured and concatenated to the final output. .. py:attribute:: hidden_size :value: 1152 .. py:attribute:: num_heads :value: 16 .. py:attribute:: num_position_embeddings :value: 2304 .. py:attribute:: num_grid_per_side :value: 0 .. py:attribute:: patch_size :value: 16 .. py:attribute:: spatial_merge_size :value: 2 .. py:attribute:: temporal_patch_size :value: 2 .. py:attribute:: deepstack_visual_indexes :value: None .. py:attribute:: out_hidden_size .. py:attribute:: patch_embed .. py:attribute:: pos_embed .. py:attribute:: blocks .. py:attribute:: merger .. py:attribute:: deepstack_merger_list .. py:property:: dtype :type: torch.dtype .. py:property:: device :type: torch.device .. py:method:: rot_pos_emb(grid_thw) Compute rotary pos-emb cos/sin for all images/videos in the batch. .. py:method:: fast_pos_embed_interpolate(grid_thw) Interpolate position embeddings via bilinear interpolation. .. py:method:: forward(x, grid_thw) Run the vision encoder. :param x: Pixel values, shape ``[total_patches, patch_dim]``. :param grid_thw: Grid dimensions ``[num_images, 3]`` with ``(T, H, W)``. :returns: Vision features of shape ``[num_merged_tokens, out_hidden_size * (1 + num_deepstack)]``. .. py:function:: get_rope_index(input_ids, image_grid_thw, image_token_id, vision_start_token_id, spatial_merge_size) Compute M-RoPE 3-D position IDs for one sequence. For text tokens all three (temporal, height, width) indices are equal to the sequential counter. For image tokens the indices follow the spatial grid ``(t, h, w)``. :param input_ids: Token IDs for one sequence, shape ``[T]``. :param image_grid_thw: Grid dimensions for every image in the sequence, shape ``[num_images, 3]``. ``None`` when there are no images. :param image_token_id: Token ID used as placeholder for image patches. :param vision_start_token_id: Token ID that precedes each image block. :param spatial_merge_size: Number of patches merged per spatial dimension (e.g. 2 → 2x2 merge, so llm_grid_h = H // 2). :returns: ``(position_ids, mrope_position_delta)`` where ``position_ids`` has shape ``[3, T]`` and ``mrope_position_delta`` is a Python ``int`` equal to ``max_position_used + 1 - T``. .. py:class:: Qwen3VLAttention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None, prefix = '') Bases: :py:obj:`torch.nn.Module` Attention layer for the Qwen3-VL text decoder. Uses QK-norm (per-head RMSNorm on Q and K before RoPE) and :class:`RadixAttention` for KV-cached inference. Applies interleaved M-RoPE with a precomputed cos/sin cache. .. py:attribute:: num_heads .. py:attribute:: num_kv_heads .. py:attribute:: head_dim .. py:attribute:: q_size .. py:attribute:: kv_size .. py:attribute:: scaling .. py:attribute:: mrope_section :value: [24, 20, 20] .. py:attribute:: mrope_interleaved :value: True .. py:attribute:: use_fused_qkv .. py:attribute:: o_proj .. py:attribute:: q_norm .. py:attribute:: k_norm .. py:attribute:: attn .. py:method:: forward(positions, hidden_states, forward_batch) .. py:class:: Qwen3VLDecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, layer_id, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None, prefix = '') Bases: :py:obj:`torch.nn.Module` Single decoder layer for the Qwen3-VL text model. .. py:attribute:: self_attn .. py:attribute:: mlp .. py:attribute:: input_layernorm .. py:attribute:: post_attention_layernorm .. py:method:: forward(positions, hidden_states, forward_batch, deepstack_embeds = None) .. py:class:: Qwen3VLTextModel(vocab_size = 151936, hidden_size = 4096, intermediate_size = 22016, num_hidden_layers = 32, num_attention_heads = 32, num_key_value_heads = 32, head_dim = 128, rope_theta = 5000000.0, rms_norm_eps = 1e-06, mrope_section = (24, 20, 20), mrope_interleaved = True, max_position_embeddings = 32768, quant_config=None) Bases: :py:obj:`torch.nn.Module` Qwen3-VL text backbone (embedding + decoder layers + final norm). .. py:attribute:: hidden_size :value: 4096 .. py:attribute:: num_hidden_layers :value: 32 .. py:attribute:: embed_tokens .. py:attribute:: layers .. py:attribute:: norm .. py:method:: forward(input_ids, positions, forward_batch, input_embeds = None, input_deepstack_embeds = None) .. py:class:: Qwen3VLForConditionalGeneration(config, quant_config=None) Bases: :py:obj:`torch.nn.Module` Qwen3-VL multimodal model for conditional generation. Combines a vision encoder and text decoder. During prefill, image/video tokens are replaced with visual features from the vision encoder. During decode, the model runs only the text decoder. Forward interface:: logits = model.forward(input_ids, positions, forward_batch) .. py:attribute:: config .. py:attribute:: quant_config :value: None .. py:attribute:: model .. py:attribute:: image_token_id .. py:attribute:: video_token_id .. py:attribute:: vision_start_token_id .. py:attribute:: spatial_merge_size .. py:method:: get_input_embeddings() .. py:method:: forward(input_ids, positions, forward_batch) Run forward pass for Qwen3-VL. :param input_ids: Flattened input token IDs, shape ``[num_tokens]``. :param positions: Position IDs, shape ``[num_tokens]`` (1-D, from model runner). Overridden internally with 3-D M-RoPE positions. :param forward_batch: :class:`ForwardBatch` with attention metadata. :returns: Logits tensor of shape ``[num_tokens, vocab_size]``. .. py:method:: load_weights(weights) Load weights from a HuggingFace checkpoint. Handles weight name remapping between HuggingFace Qwen3-VL checkpoints and this model's parameter names.