pymllm.models.qwen3_vl

Inference-only Qwen3-VL model for pymllm.

Adapted from sglang’s Qwen3-VL implementation for pymllm’s single-GPU inference architecture. Uses pymllm layers (RadixAttention, RMSNorm, MLP) and conforms to the pymllm forward interface:

model.forward(input_ids, positions, forward_batch)

Designed for a single accelerator card — no tensor / pipeline parallelism.

Attributes

Classes

Qwen3VisionMLP

MLP block for the vision encoder.

Qwen3VLVisionPatchEmbed

3D convolution patch embedding for video/image patchification.

Qwen3VisionAttention

Multi-head self-attention for the vision encoder (no KV cache).

Qwen3VisionBlock

Single vision transformer block.

Qwen3VLVisionPatchMerger

Merges spatial patches to reduce sequence length.

Qwen3VLVisionModel

Complete vision encoder for Qwen3-VL.

Qwen3VLAttention

Attention layer for the Qwen3-VL text decoder.

Qwen3VLDecoderLayer

Single decoder layer for the Qwen3-VL text model.

Qwen3VLTextModel

Qwen3-VL text backbone (embedding + decoder layers + final norm).

Qwen3VLForConditionalGeneration

Qwen3-VL multimodal model for conditional generation.

Functions

get_rope_index(input_ids, image_grid_thw, ...)

Compute M-RoPE 3-D position IDs for one sequence.

Module Contents

pymllm.models.qwen3_vl.logger
class pymllm.models.qwen3_vl.Qwen3VisionMLP(in_features, hidden_features, hidden_act='silu', bias=True)

Bases: torch.nn.Module

MLP block for the vision encoder.

Parameters:
  • in_features (int)

  • hidden_features (int)

  • hidden_act (str)

  • bias (bool)

linear_fc1
linear_fc2
forward(x)
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionPatchEmbed(patch_size=16, temporal_patch_size=2, in_channels=3, embed_dim=1152)

Bases: torch.nn.Module

3D convolution patch embedding for video/image patchification.

Parameters:
  • patch_size (int)

  • temporal_patch_size (int)

  • in_channels (int)

  • embed_dim (int)

patch_size = 16
temporal_patch_size = 2
in_channels = 3
embed_dim = 1152
proj
forward(hidden_states)
Parameters:

hidden_states (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VisionAttention(embed_dim, num_heads)

Bases: torch.nn.Module

Multi-head self-attention for the vision encoder (no KV cache).

Parameters:
  • embed_dim (int)

  • num_heads (int)

embed_dim
num_heads
head_dim
qkv_proj
out_proj
forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)

Forward pass with variable-length sequences via cu_seqlens.

Parameters:
  • x (torch.Tensor) – [total_tokens, embed_dim]

  • cu_seqlens (torch.Tensor) – [num_seqs + 1] cumulative sequence lengths

  • rotary_pos_emb_cos (torch.Tensor) – [total_tokens, rotary_dim]

  • rotary_pos_emb_sin (torch.Tensor) – [total_tokens, rotary_dim]

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VisionBlock(dim, num_heads, intermediate_dim, hidden_act='silu', norm_eps=1e-06)

Bases: torch.nn.Module

Single vision transformer block.

Parameters:
  • dim (int)

  • num_heads (int)

  • intermediate_dim (int)

  • hidden_act (str)

  • norm_eps (float)

norm1
norm2
attn
mlp
forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)
Parameters:
  • x (torch.Tensor)

  • cu_seqlens (torch.Tensor)

  • rotary_pos_emb_cos (torch.Tensor)

  • rotary_pos_emb_sin (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionPatchMerger(dim, context_dim, spatial_merge_size=2, use_postshuffle_norm=False, norm_eps=1e-06)

Bases: torch.nn.Module

Merges spatial patches to reduce sequence length.

Groups spatial_merge_size ** 2 consecutive patch tokens and projects them to the language model hidden dimension.

Parameters:
  • dim (int)

  • context_dim (int)

  • spatial_merge_size (int)

  • use_postshuffle_norm (bool)

  • norm_eps (float)

hidden_size
use_postshuffle_norm = False
norm
linear_fc1
act_fn
linear_fc2
forward(x)
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionModel(depth=27, hidden_size=1152, hidden_act='gelu_pytorch_tanh', intermediate_size=4304, num_heads=16, in_channels=3, patch_size=16, spatial_merge_size=2, temporal_patch_size=2, out_hidden_size=3584, num_position_embeddings=2304, deepstack_visual_indexes=None, norm_eps=1e-06)

Bases: torch.nn.Module

Complete vision encoder for Qwen3-VL.

Produces patch embeddings from raw pixel values, applies a stack of vision transformer blocks with 3D rotary embeddings, then merges spatial patches. Supports “deep stack” where intermediate layer outputs are captured and concatenated to the final output.

Parameters:
  • depth (int)

  • hidden_size (int)

  • hidden_act (str)

  • intermediate_size (int)

  • num_heads (int)

  • in_channels (int)

  • patch_size (int)

  • spatial_merge_size (int)

  • temporal_patch_size (int)

  • out_hidden_size (int)

  • num_position_embeddings (int)

  • deepstack_visual_indexes (Optional[List[int]])

  • norm_eps (float)

hidden_size = 1152
num_heads = 16
num_position_embeddings = 2304
num_grid_per_side = 0
patch_size = 16
spatial_merge_size = 2
temporal_patch_size = 2
deepstack_visual_indexes = None
out_hidden_size
patch_embed
pos_embed
blocks
merger
deepstack_merger_list
property dtype: torch.dtype
Return type:

torch.dtype

property device: torch.device
Return type:

torch.device

rot_pos_emb(grid_thw)

Compute rotary pos-emb cos/sin for all images/videos in the batch.

Parameters:

grid_thw (List[List[int]])

Return type:

Tuple[torch.Tensor, torch.Tensor]

fast_pos_embed_interpolate(grid_thw)

Interpolate position embeddings via bilinear interpolation.

Parameters:

grid_thw (torch.Tensor)

Return type:

torch.Tensor

forward(x, grid_thw)

Run the vision encoder.

Parameters:
  • x (torch.Tensor) – Pixel values, shape [total_patches, patch_dim].

  • grid_thw (torch.Tensor) – Grid dimensions [num_images, 3] with (T, H, W).

Returns:

Vision features of shape [num_merged_tokens, out_hidden_size * (1 + num_deepstack)].

Return type:

torch.Tensor

pymllm.models.qwen3_vl.get_rope_index(input_ids, image_grid_thw, image_token_id, vision_start_token_id, spatial_merge_size)

Compute M-RoPE 3-D position IDs for one sequence.

For text tokens all three (temporal, height, width) indices are equal to the sequential counter. For image tokens the indices follow the spatial grid (t, h, w).

Parameters:
  • input_ids (torch.Tensor) – Token IDs for one sequence, shape [T].

  • image_grid_thw (Optional[torch.Tensor]) – Grid dimensions for every image in the sequence, shape [num_images, 3]. None when there are no images.

  • image_token_id (int) – Token ID used as placeholder for image patches.

  • vision_start_token_id (int) – Token ID that precedes each image block.

  • spatial_merge_size (int) – Number of patches merged per spatial dimension (e.g. 2 → 2x2 merge, so llm_grid_h = H // 2).

Returns:

(position_ids, mrope_position_delta) where position_ids has shape [3, T] and mrope_position_delta is a Python int equal to max_position_used + 1 - T.

Return type:

Tuple[torch.Tensor, int]

class pymllm.models.qwen3_vl.Qwen3VLAttention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')

Bases: torch.nn.Module

Attention layer for the Qwen3-VL text decoder.

Uses QK-norm (per-head RMSNorm on Q and K before RoPE) and RadixAttention for KV-cached inference. Applies interleaved M-RoPE with a precomputed cos/sin cache.

Parameters:
  • hidden_size (int)

  • num_heads (int)

  • num_kv_heads (int)

  • head_dim (int)

  • layer_id (int)

  • rope_theta (float)

  • rms_norm_eps (float)

  • mrope_section (Tuple[int, int, int])

  • mrope_interleaved (bool)

  • max_position_embeddings (int)

  • prefix (str)

num_heads
num_kv_heads
head_dim
q_size
kv_size
scaling
mrope_section = [24, 20, 20]
mrope_interleaved = True
use_fused_qkv
o_proj
q_norm
k_norm
attn
forward(positions, hidden_states, forward_batch)
Parameters:
Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLDecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')

Bases: torch.nn.Module

Single decoder layer for the Qwen3-VL text model.

Parameters:
  • hidden_size (int)

  • num_heads (int)

  • num_kv_heads (int)

  • head_dim (int)

  • intermediate_size (int)

  • layer_id (int)

  • rope_theta (float)

  • rms_norm_eps (float)

  • mrope_section (Tuple[int, int, int])

  • mrope_interleaved (bool)

  • max_position_embeddings (int)

  • prefix (str)

self_attn
mlp
input_layernorm
post_attention_layernorm
forward(positions, hidden_states, forward_batch, deepstack_embeds=None)
Parameters:
Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLTextModel(vocab_size=151936, hidden_size=4096, intermediate_size=22016, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=32, head_dim=128, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None)

Bases: torch.nn.Module

Qwen3-VL text backbone (embedding + decoder layers + final norm).

Parameters:
  • vocab_size (int)

  • hidden_size (int)

  • intermediate_size (int)

  • num_hidden_layers (int)

  • num_attention_heads (int)

  • num_key_value_heads (int)

  • head_dim (int)

  • rope_theta (float)

  • rms_norm_eps (float)

  • mrope_section (Tuple[int, int, int])

  • mrope_interleaved (bool)

  • max_position_embeddings (int)

hidden_size = 4096
num_hidden_layers = 32
embed_tokens
layers
norm
forward(input_ids, positions, forward_batch, input_embeds=None, input_deepstack_embeds=None)
Parameters:
Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLForConditionalGeneration(config, quant_config=None)

Bases: torch.nn.Module

Qwen3-VL multimodal model for conditional generation.

Combines a vision encoder and text decoder. During prefill, image/video tokens are replaced with visual features from the vision encoder. During decode, the model runs only the text decoder.

Forward interface:

logits = model.forward(input_ids, positions, forward_batch)
config
quant_config = None
model
image_token_id
video_token_id
vision_start_token_id
spatial_merge_size
get_input_embeddings()
Return type:

torch.nn.Module

forward(input_ids, positions, forward_batch)

Run forward pass for Qwen3-VL.

Parameters:
  • input_ids (torch.Tensor) – Flattened input token IDs, shape [num_tokens].

  • positions (torch.Tensor) – Position IDs, shape [num_tokens] (1-D, from model runner). Overridden internally with 3-D M-RoPE positions.

  • forward_batch (pymllm.engine.forward_batch.ForwardBatch) – ForwardBatch with attention metadata.

Returns:

Logits tensor of shape [num_tokens, vocab_size].

Return type:

torch.Tensor

load_weights(weights)

Load weights from a HuggingFace checkpoint.

Handles weight name remapping between HuggingFace Qwen3-VL checkpoints and this model’s parameter names.

Parameters:

weights (Iterable[Tuple[str, torch.Tensor]])

Return type:

None