pymllm.models.qwen3_vl¶

Inference-only Qwen3-VL model for pymllm.

Adapted from sglang’s Qwen3-VL implementation for pymllm’s single-GPU inference architecture. Uses pymllm layers (RadixAttention, RMSNorm, MLP) and conforms to the pymllm forward interface:

model.forward(input_ids, positions, forward_batch)

Designed for a single accelerator card — no tensor / pipeline parallelism.

Attributes¶

logger

Classes¶

`Qwen3VisionMLP`	MLP block for the vision encoder.
`Qwen3VLVisionPatchEmbed`	3D convolution patch embedding for video/image patchification.
`Qwen3VisionAttention`	Multi-head self-attention for the vision encoder (no KV cache).
`Qwen3VisionBlock`	Single vision transformer block.
`Qwen3VLVisionPatchMerger`	Merges spatial patches to reduce sequence length.
`Qwen3VLVisionModel`	Complete vision encoder for Qwen3-VL.
`Qwen3VLAttention`	Attention layer for the Qwen3-VL text decoder.
`Qwen3VLDecoderLayer`	Single decoder layer for the Qwen3-VL text model.
`Qwen3VLTextModel`	Qwen3-VL text backbone (embedding + decoder layers + final norm).
`Qwen3VLForConditionalGeneration`	Qwen3-VL multimodal model for conditional generation.

Functions¶

get_rope_index(input_ids, image_grid_thw, ...)

Compute M-RoPE 3-D position IDs for one sequence.

Module Contents¶

pymllm.models.qwen3_vl.logger¶

class pymllm.models.qwen3_vl.Qwen3VisionMLP(in_features, hidden_features, hidden_act='silu', bias=True)¶

Bases: torch.nn.Module

MLP block for the vision encoder.

Parameters:

in_features (int)
hidden_features (int)
hidden_act (str)
bias (bool)

linear_fc1¶

linear_fc2¶

forward(x)¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionPatchEmbed(patch_size=16, temporal_patch_size=2, in_channels=3, embed_dim=1152)¶

Bases: torch.nn.Module

3D convolution patch embedding for video/image patchification.

Parameters:

patch_size (int)
temporal_patch_size (int)
in_channels (int)
embed_dim (int)

patch_size = 16¶

temporal_patch_size = 2¶

in_channels = 3¶

embed_dim = 1152¶

proj¶

forward(hidden_states)¶

Parameters:: hidden_states (torch.Tensor)
Return type:: torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VisionAttention(embed_dim, num_heads)¶

Bases: torch.nn.Module

Multi-head self-attention for the vision encoder (no KV cache).

Parameters:

embed_dim (int)
num_heads (int)

embed_dim¶

num_heads¶

head_dim¶

qkv_proj¶

out_proj¶

forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)¶

Forward pass with variable-length sequences via cu_seqlens.

Parameters:

x (torch.Tensor) – [total_tokens, embed_dim]
cu_seqlens (torch.Tensor) – [num_seqs + 1] cumulative sequence lengths
rotary_pos_emb_cos (torch.Tensor) – [total_tokens, rotary_dim]
rotary_pos_emb_sin (torch.Tensor) – [total_tokens, rotary_dim]

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VisionBlock(dim, num_heads, intermediate_dim, hidden_act='silu', norm_eps=1e-06)¶

Bases: torch.nn.Module

Single vision transformer block.

Parameters:

dim (int)
num_heads (int)
intermediate_dim (int)
hidden_act (str)
norm_eps (float)

norm1¶

norm2¶

attn¶

mlp¶

forward(x, cu_seqlens, rotary_pos_emb_cos, rotary_pos_emb_sin)¶

Parameters:

x (torch.Tensor)
cu_seqlens (torch.Tensor)
rotary_pos_emb_cos (torch.Tensor)
rotary_pos_emb_sin (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionPatchMerger(dim, context_dim, spatial_merge_size=2, use_postshuffle_norm=False, norm_eps=1e-06)¶

Bases: torch.nn.Module

Merges spatial patches to reduce sequence length.

Groups spatial_merge_size ** 2 consecutive patch tokens and projects them to the language model hidden dimension.

Parameters:

dim (int)
context_dim (int)
spatial_merge_size (int)
use_postshuffle_norm (bool)
norm_eps (float)

hidden_size¶

use_postshuffle_norm = False¶

norm¶

linear_fc1¶

act_fn¶

linear_fc2¶

forward(x)¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLVisionModel(depth=27, hidden_size=1152, hidden_act='gelu_pytorch_tanh', intermediate_size=4304, num_heads=16, in_channels=3, patch_size=16, spatial_merge_size=2, temporal_patch_size=2, out_hidden_size=3584, num_position_embeddings=2304, deepstack_visual_indexes=None, norm_eps=1e-06)¶

Bases: torch.nn.Module

Complete vision encoder for Qwen3-VL.

Produces patch embeddings from raw pixel values, applies a stack of vision transformer blocks with 3D rotary embeddings, then merges spatial patches. Supports “deep stack” where intermediate layer outputs are captured and concatenated to the final output.

Parameters:

depth (int)
hidden_size (int)
hidden_act (str)
intermediate_size (int)
num_heads (int)
in_channels (int)
patch_size (int)
spatial_merge_size (int)
temporal_patch_size (int)
out_hidden_size (int)
num_position_embeddings (int)
deepstack_visual_indexes (Optional[List[int]])
norm_eps (float)

hidden_size = 1152¶

num_heads = 16¶

num_position_embeddings = 2304¶

num_grid_per_side = 0¶

patch_size = 16¶

spatial_merge_size = 2¶

temporal_patch_size = 2¶

deepstack_visual_indexes = None¶

out_hidden_size¶

patch_embed¶

pos_embed¶

blocks¶

merger¶

deepstack_merger_list¶

property dtype: torch.dtype¶

Return type:: torch.dtype

property device: torch.device¶

Return type:: torch.device

rot_pos_emb(grid_thw)¶

Compute rotary pos-emb cos/sin for all images/videos in the batch.

Parameters:: grid_thw (List[List[int]])
Return type:: Tuple[torch.Tensor, torch.Tensor]

fast_pos_embed_interpolate(grid_thw)¶

Interpolate position embeddings via bilinear interpolation.

Parameters:: grid_thw (torch.Tensor)
Return type:: torch.Tensor

forward(x, grid_thw)¶

Run the vision encoder.

Parameters:

x (torch.Tensor) – Pixel values, shape [total_patches, patch_dim].
grid_thw (torch.Tensor) – Grid dimensions [num_images, 3] with (T, H, W).

Returns:

Vision features of shape [num_merged_tokens, out_hidden_size * (1 + num_deepstack)].

Return type:

torch.Tensor

pymllm.models.qwen3_vl.get_rope_index(input_ids, image_grid_thw, image_token_id, vision_start_token_id, spatial_merge_size)¶

Compute M-RoPE 3-D position IDs for one sequence.

For text tokens all three (temporal, height, width) indices are equal to the sequential counter. For image tokens the indices follow the spatial grid (t, h, w).

Parameters:

input_ids (torch.Tensor) – Token IDs for one sequence, shape [T].
image_grid_thw (Optional[torch.Tensor]) – Grid dimensions for every image in the sequence, shape [num_images, 3]. None when there are no images.
image_token_id (int) – Token ID used as placeholder for image patches.
vision_start_token_id (int) – Token ID that precedes each image block.
spatial_merge_size (int) – Number of patches merged per spatial dimension (e.g. 2 → 2x2 merge, so llm_grid_h = H // 2).

Returns:

(position_ids, mrope_position_delta) where position_ids has shape [3, T] and mrope_position_delta is a Python int equal to max_position_used + 1 - T.

Return type:

Tuple[torch.Tensor, int]

class pymllm.models.qwen3_vl.Qwen3VLAttention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')¶

Bases: torch.nn.Module

Attention layer for the Qwen3-VL text decoder.

Uses QK-norm (per-head RMSNorm on Q and K before RoPE) and RadixAttention for KV-cached inference. Applies interleaved M-RoPE with a precomputed cos/sin cache.

Parameters:

hidden_size (int)
num_heads (int)
num_kv_heads (int)
head_dim (int)
layer_id (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)
prefix (str)

num_heads¶

num_kv_heads¶

head_dim¶

q_size¶

kv_size¶

scaling¶

mrope_section = [24, 20, 20]¶

mrope_interleaved = True¶

use_fused_qkv = True¶

o_proj¶

q_norm¶

k_norm¶

attn¶

forward(positions, hidden_states, forward_batch)¶

Parameters:

positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLDecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, layer_id, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None, prefix='')¶

Bases: torch.nn.Module

Single decoder layer for the Qwen3-VL text model.

Parameters:

hidden_size (int)
num_heads (int)
num_kv_heads (int)
head_dim (int)
intermediate_size (int)
layer_id (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)
prefix (str)

self_attn¶

mlp¶

input_layernorm¶

post_attention_layernorm¶

forward(positions, hidden_states, forward_batch, residual=None)¶

Parameters:

positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
residual (torch.Tensor | None)

Return type:

tuple[torch.Tensor, torch.Tensor]

class pymllm.models.qwen3_vl.Qwen3VLTextModel(vocab_size=151936, hidden_size=4096, intermediate_size=22016, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=32, head_dim=128, rope_theta=5000000.0, rms_norm_eps=1e-06, mrope_section=(24, 20, 20), mrope_interleaved=True, max_position_embeddings=32768, quant_config=None)¶

Bases: torch.nn.Module

Qwen3-VL text backbone (embedding + decoder layers + final norm).

Parameters:

vocab_size (int)
hidden_size (int)
intermediate_size (int)
num_hidden_layers (int)
num_attention_heads (int)
num_key_value_heads (int)
head_dim (int)
rope_theta (float)
rms_norm_eps (float)
mrope_section (Tuple[int, int, int])
mrope_interleaved (bool)
max_position_embeddings (int)

hidden_size = 4096¶

num_hidden_layers = 32¶

embed_tokens¶

layers¶

norm¶

forward(input_ids, positions, forward_batch, input_embeds=None, input_deepstack_embeds=None)¶

Parameters:

input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
input_embeds (Optional[torch.Tensor])
input_deepstack_embeds (Optional[torch.Tensor])

Return type:

torch.Tensor

class pymllm.models.qwen3_vl.Qwen3VLForConditionalGeneration(config, quant_config=None)¶

Bases: torch.nn.Module

Qwen3-VL multimodal model for conditional generation.

Combines a vision encoder and text decoder. During prefill, image/video tokens are replaced with visual features from the vision encoder. During decode, the model runs only the text decoder.

Forward interface:

logits = model.forward(input_ids, positions, forward_batch)

config¶

quant_config = None¶

model¶

image_token_id¶

video_token_id¶

vision_start_token_id¶

spatial_merge_size¶

get_input_embeddings()¶

Return type:: torch.nn.Module

forward(input_ids, positions, forward_batch)¶

Run forward pass for Qwen3-VL.

Parameters:

input_ids (torch.Tensor) – Flattened input token IDs, shape [num_tokens].
positions (torch.Tensor) – Position IDs, shape [num_tokens] (1-D, from model runner). Overridden internally with 3-D M-RoPE positions.
forward_batch (pymllm.engine.forward_batch.ForwardBatch) – ForwardBatch with attention metadata.

Returns:

Logits tensor of shape [num_tokens, vocab_size].

Return type:

torch.Tensor

load_weights(weights)¶

Load weights from a HuggingFace checkpoint.

Handles weight name remapping between HuggingFace Qwen3-VL checkpoints and this model’s parameter names.

Parameters:: weights (Iterable[Tuple[str, torch.Tensor]])
Return type:: None