pymllm.models.qwen3_vl¶
Inference-only Qwen3-VL model for pymllm.
Adapted from sglang’s Qwen3-VL implementation for pymllm’s single-GPU inference architecture. Uses pymllm layers (RadixAttention, RMSNorm, MLP) and conforms to the pymllm forward interface:
model.forward(input_ids, positions, forward_batch)
Designed for a single accelerator card — no tensor / pipeline parallelism.
Attributes¶
Classes¶
MLP block for the vision encoder. |
|
3D convolution patch embedding for video/image patchification. |
|
Multi-head self-attention for the vision encoder (no KV cache). |
|
Single vision transformer block. |
|
Merges spatial patches to reduce sequence length. |
|
Complete vision encoder for Qwen3-VL. |
|
Attention layer for the Qwen3-VL text decoder. |
|
Single decoder layer for the Qwen3-VL text model. |
|
Qwen3-VL text backbone (embedding + decoder layers + final norm). |
|
Qwen3-VL multimodal model for conditional generation. |
Functions¶
|
Compute M-RoPE 3-D position IDs for one sequence. |
Module Contents¶
- pymllm.models.qwen3_vl.logger¶
- class pymllm.models.qwen3_vl.Qwen3VisionMLP(in_features, hidden_features, hidden_act='silu', bias=True)¶
Bases:
torch.nn.ModuleMLP block for the vision encoder.
- Parameters:
in_features (int)
hidden_features (int)
hidden_act (str)
bias (bool)
- linear_fc1¶
- linear_fc2¶
- forward(x)¶
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLVisionPatchEmbed(patch_size=16, temporal_patch_size=2, in_channels=3, embed_dim=1152)¶
Bases:
torch.nn.Module3D convolution patch embedding for video/image patchification.
- Parameters:
patch_size (int)
temporal_patch_size (int)
in_channels (int)
embed_dim (int)
- patch_size = 16¶
- temporal_patch_size = 2¶
- in_channels = 3¶
- embed_dim = 1152¶
- proj¶
- forward(hidden_states)¶
- Parameters:
hidden_states (torch.Tensor)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VisionAttention(embed_dim, num_heads)¶
Bases:
torch.nn.ModuleMulti-head self-attention for the vision encoder (no KV cache).
- Parameters:
embed_dim (int)
num_heads (int)
- embed_dim¶
- num_heads¶
- head_dim¶
- qkv_proj¶
- out_proj¶
- forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)¶
Forward pass with variable-length sequences via cu_seqlens.
- Parameters:
x (torch.Tensor) – [total_tokens, embed_dim]
cu_seqlens (torch.Tensor) – [num_seqs + 1] cumulative sequence lengths
rotary_pos_emb_cos (torch.Tensor) – [total_tokens, rotary_dim]
rotary_pos_emb_sin (torch.Tensor) – [total_tokens, rotary_dim]
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VisionBlock(dim, num_heads, intermediate_dim, hidden_act='silu', norm_eps=1e-06)¶
Bases:
torch.nn.ModuleSingle vision transformer block.
- Parameters:
dim (int)
num_heads (int)
intermediate_dim (int)
hidden_act (str)
norm_eps (float)
- norm1¶
- norm2¶
- attn¶
- mlp¶
- forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)¶
- Parameters:
x (torch.Tensor)
cu_seqlens (torch.Tensor)
rotary_pos_emb_cos (torch.Tensor)
rotary_pos_emb_sin (torch.Tensor)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLVisionPatchMerger(dim, context_dim, spatial_merge_size=2, use_postshuffle_norm=False, norm_eps=1e-06)¶
Bases:
torch.nn.ModuleMerges spatial patches to reduce sequence length.
Groups
spatial_merge_size ** 2consecutive patch tokens and projects them to the language model hidden dimension.- Parameters:
dim (int)
context_dim (int)
spatial_merge_size (int)
use_postshuffle_norm (bool)
norm_eps (float)
- use_postshuffle_norm = False¶
- norm¶
- linear_fc1¶
- act_fn¶
- linear_fc2¶
- forward(x)¶
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLVisionModel(depth=27, hidden_size=1152, hidden_act='gelu_pytorch_tanh', intermediate_size=4304, num_heads=16, in_channels=3, patch_size=16, spatial_merge_size=2, temporal_patch_size=2, out_hidden_size=3584, num_position_embeddings=2304, deepstack_visual_indexes=None, norm_eps=1e-06)¶
Bases:
torch.nn.ModuleComplete vision encoder for Qwen3-VL.
Produces patch embeddings from raw pixel values, applies a stack of vision transformer blocks with 3D rotary embeddings, then merges spatial patches. Supports “deep stack” where intermediate layer outputs are captured and concatenated to the final output.
- Parameters:
depth (int)
hidden_size (int)
hidden_act (str)
intermediate_size (int)
num_heads (int)
in_channels (int)
patch_size (int)
spatial_merge_size (int)
temporal_patch_size (int)
out_hidden_size (int)
num_position_embeddings (int)
deepstack_visual_indexes (Optional[List[int]])
norm_eps (float)
- num_heads = 16¶
- num_position_embeddings = 2304¶
- num_grid_per_side = 0¶
- patch_size = 16¶
- spatial_merge_size = 2¶
- temporal_patch_size = 2¶
- deepstack_visual_indexes = None¶
- patch_embed¶
- pos_embed¶
- blocks¶
- merger¶
- deepstack_merger_list¶
- property dtype: torch.dtype¶
- Return type:
torch.dtype
- property device: torch.device¶
- Return type:
torch.device
- rot_pos_emb(grid_thw)¶
Compute rotary pos-emb cos/sin for all images/videos in the batch.
- Parameters:
grid_thw (List[List[int]])
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- fast_pos_embed_interpolate(grid_thw)¶
Interpolate position embeddings via bilinear interpolation.
- Parameters:
grid_thw (torch.Tensor)
- Return type:
torch.Tensor
- forward(x, grid_thw)¶
Run the vision encoder.
- Parameters:
x (torch.Tensor) – Pixel values, shape
[total_patches, patch_dim].grid_thw (torch.Tensor) – Grid dimensions
[num_images, 3]with(T, H, W).
- Returns:
Vision features of shape
[num_merged_tokens, out_hidden_size * (1 + num_deepstack)].- Return type:
torch.Tensor
- pymllm.models.qwen3_vl.get_rope_index(input_ids, image_grid_thw, image_token_id, vision_start_token_id, spatial_merge_size)¶
Compute M-RoPE 3-D position IDs for one sequence.
For text tokens all three (temporal, height, width) indices are equal to the sequential counter. For image tokens the indices follow the spatial grid
(t, h, w).- Parameters:
input_ids (torch.Tensor) – Token IDs for one sequence, shape
[T].image_grid_thw (Optional[torch.Tensor]) – Grid dimensions for every image in the sequence, shape
[num_images, 3].Nonewhen there are no images.image_token_id (int) – Token ID used as placeholder for image patches.
vision_start_token_id (int) – Token ID that precedes each image block.
spatial_merge_size (int) – Number of patches merged per spatial dimension (e.g. 2 → 2x2 merge, so llm_grid_h = H // 2).
- Returns:
(position_ids, mrope_position_delta)whereposition_idshas shape[3, T]andmrope_position_deltais a Pythonintequal tomax_position_used + 1 - T.- Return type:
Tuple[torch.Tensor, int]
- class pymllm.models.qwen3_vl.Qwen3VLAttention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')¶
Bases:
torch.nn.ModuleAttention layer for the Qwen3-VL text decoder.
Uses QK-norm (per-head RMSNorm on Q and K before RoPE) and
RadixAttentionfor KV-cached inference. Applies interleaved M-RoPE with a precomputed cos/sin cache.- Parameters:
hidden_size (int)
num_heads (int)
num_kv_heads (int)
head_dim (int)
layer_id (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)
prefix (str)
- num_heads¶
- num_kv_heads¶
- head_dim¶
- q_size¶
- kv_size¶
- scaling¶
- mrope_section = [24, 20, 20]¶
- mrope_interleaved = True¶
- use_fused_qkv¶
- o_proj¶
- q_norm¶
- k_norm¶
- attn¶
- forward(positions, hidden_states, forward_batch)¶
- Parameters:
positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLDecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')¶
Bases:
torch.nn.ModuleSingle decoder layer for the Qwen3-VL text model.
- Parameters:
hidden_size (int)
num_heads (int)
num_kv_heads (int)
head_dim (int)
intermediate_size (int)
layer_id (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)
prefix (str)
- self_attn¶
- mlp¶
- input_layernorm¶
- post_attention_layernorm¶
- forward(positions, hidden_states, forward_batch, deepstack_embeds=None)¶
- Parameters:
positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
deepstack_embeds (Optional[torch.Tensor])
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLTextModel(vocab_size=151936, hidden_size=4096, intermediate_size=22016, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=32, head_dim=128, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None)¶
Bases:
torch.nn.ModuleQwen3-VL text backbone (embedding + decoder layers + final norm).
- Parameters:
vocab_size (int)
hidden_size (int)
intermediate_size (int)
num_hidden_layers (int)
num_attention_heads (int)
num_key_value_heads (int)
head_dim (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)
- embed_tokens¶
- layers¶
- norm¶
- forward(input_ids, positions, forward_batch, input_embeds=None, input_deepstack_embeds=None)¶
- Parameters:
input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
input_embeds (Optional[torch.Tensor])
input_deepstack_embeds (Optional[torch.Tensor])
- Return type:
torch.Tensor
- class pymllm.models.qwen3_vl.Qwen3VLForConditionalGeneration(config, quant_config=None)¶
Bases:
torch.nn.ModuleQwen3-VL multimodal model for conditional generation.
Combines a vision encoder and text decoder. During prefill, image/video tokens are replaced with visual features from the vision encoder. During decode, the model runs only the text decoder.
Forward interface:
logits = model.forward(input_ids, positions, forward_batch)
- config¶
- quant_config = None¶
- model¶
- image_token_id¶
- video_token_id¶
- vision_start_token_id¶
- spatial_merge_size¶
- get_input_embeddings()¶
- Return type:
torch.nn.Module
- forward(input_ids, positions, forward_batch)¶
Run forward pass for Qwen3-VL.
- Parameters:
input_ids (torch.Tensor) – Flattened input token IDs, shape
[num_tokens].positions (torch.Tensor) – Position IDs, shape
[num_tokens](1-D, from model runner). Overridden internally with 3-D M-RoPE positions.forward_batch (pymllm.engine.forward_batch.ForwardBatch) –
ForwardBatchwith attention metadata.
- Returns:
Logits tensor of shape
[num_tokens, vocab_size].- Return type:
torch.Tensor
- load_weights(weights)¶
Load weights from a HuggingFace checkpoint.
Handles weight name remapping between HuggingFace Qwen3-VL checkpoints and this model’s parameter names.
- Parameters:
weights (Iterable[Tuple[str, torch.Tensor]])
- Return type:
None