pymllm.models.qwen3_5¶

Inference-only Qwen3.5 model for pymllm.

Implements the hybrid attention architecture: - Full attention layers (standard transformer with RoPE + output gate) - GDN linear attention layers (Gated Delta Network, O(n) complexity)

Layers alternate: linear, attention, linear, attention, … based on full_attention_interval in the config.

Supports: - Dense (non-MoE) variant - Vision-Language (multimodal) via inheritance from Qwen3VL

Adapted from sglang’s qwen3_5.py.

Attributes¶

logger

Classes¶

`Qwen3_5FullAttention`	Standard multi-head attention with RoPE, QK-norm, and optional output gate.
`Qwen3_5AttentionDecoderLayer`	Decoder layer with full attention + MLP.
`Qwen3_5LinearDecoderLayer`	Decoder layer with GDN linear attention + MLP.
`Qwen3_5ForCausalLM`	Qwen3.5 causal language model with hybrid attention.
`Qwen3_5ForConditionalGeneration`	Qwen3.5 multimodal model (text + vision).

Module Contents¶

pymllm.models.qwen3_5.logger¶

class pymllm.models.qwen3_5.Qwen3_5FullAttention(config, layer_id, quant_config=None, prefix='')¶

Bases: torch.nn.Module

Standard multi-head attention with RoPE, QK-norm, and optional output gate.

Parameters:

layer_id (int)
prefix (str)

hidden_size¶

num_heads¶

num_kv_heads¶

head_dim¶

q_size¶

kv_size¶

scaling¶

layer_id¶

attn_output_gate¶

q_proj¶

k_proj¶

v_proj¶

o_proj¶

q_norm¶

k_norm¶

partial_rotary_factor¶

rope_theta¶

rotary_dim¶

attn¶

forward(positions, hidden_states, forward_batch)¶

Parameters:

positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (Any)

Return type:

torch.Tensor

class pymllm.models.qwen3_5.Qwen3_5AttentionDecoderLayer(config, layer_id, quant_config=None, prefix='')¶

Bases: torch.nn.Module

Decoder layer with full attention + MLP.

Parameters:

layer_id (int)
prefix (str)

self_attn¶

mlp¶

input_layernorm¶

post_attention_layernorm¶

forward(positions, hidden_states, residual, forward_batch)¶

Parameters:

positions (torch.Tensor)
hidden_states (torch.Tensor)
residual (Optional[torch.Tensor])
forward_batch (Any)

class pymllm.models.qwen3_5.Qwen3_5LinearDecoderLayer(config, layer_id, gdn_layer_idx=0, quant_config=None, prefix='')¶

Bases: torch.nn.Module

Decoder layer with GDN linear attention + MLP.

Parameters:

layer_id (int)
gdn_layer_idx (int)
prefix (str)

linear_attn¶

mlp¶

input_layernorm¶

post_attention_layernorm¶

forward(positions, hidden_states, residual, forward_batch)¶

Parameters:

positions (torch.Tensor)
hidden_states (torch.Tensor)
residual (Optional[torch.Tensor])
forward_batch (Any)

class pymllm.models.qwen3_5.Qwen3_5ForCausalLM(config, quant_config=None)¶

Bases: torch.nn.Module

Qwen3.5 causal language model with hybrid attention.

Alternates between full attention and GDN linear attention layers. Dense (non-MoE) variant.

config¶

quant_config = None¶

hidden_size¶

vocab_size¶

embed_tokens¶

layer_types¶

layers¶

full_attn_layer_ids¶

num_gdn_layers = 0¶

norm¶

forward(input_ids, positions, forward_batch, input_embeds=None)¶

Parameters:

input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (Any)
input_embeds (Optional[torch.Tensor])

Return type:

torch.Tensor

load_weights(weights)¶

Load HuggingFace checkpoint weights with name remapping.

Parameters:: weights (Iterable[Tuple[str, torch.Tensor]])

class pymllm.models.qwen3_5.Qwen3_5ForConditionalGeneration(config, quant_config=None)¶

Bases: torch.nn.Module

Qwen3.5 multimodal model (text + vision).

Inherits vision encoder from Qwen3VL and uses Qwen3.5’s hybrid language model.

config¶

quant_config = None¶

model¶

num_gdn_layers = 0¶

full_attn_layer_ids¶

lm_head¶

image_token_id¶

video_token_id¶

forward(input_ids, positions, forward_batch, input_embeds=None, pixel_values=None, image_grid_thw=None)¶

Parameters:

input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (Any)
input_embeds (Optional[torch.Tensor])
pixel_values (Optional[torch.Tensor])
image_grid_thw (Optional[torch.Tensor])

Return type:

torch.Tensor

load_weights(weights)¶

Load weights, dispatching visual vs language params.

Parameters:: weights (Iterable[Tuple[str, torch.Tensor]])