pymllm.models.qwen3_5

Inference-only Qwen3.5 model for pymllm.

Implements the hybrid attention architecture: - Full attention layers (standard transformer with RoPE + output gate) - GDN linear attention layers (Gated Delta Network, O(n) complexity)

Layers alternate: linear, attention, linear, attention, … based on full_attention_interval in the config.

Supports: - Dense (non-MoE) variant - Vision-Language (multimodal) via inheritance from Qwen3VL

Adapted from sglang’s qwen3_5.py.

Attributes

Classes

Qwen3_5FullAttention

Standard multi-head attention with RoPE, QK-norm, and optional output gate.

Qwen3_5AttentionDecoderLayer

Decoder layer with full attention + MLP.

Qwen3_5LinearDecoderLayer

Decoder layer with GDN linear attention + MLP.

Qwen3_5ForCausalLM

Qwen3.5 causal language model with hybrid attention.

Qwen3_5ForConditionalGeneration

Qwen3.5 multimodal model (text + vision).

Module Contents

pymllm.models.qwen3_5.logger
class pymllm.models.qwen3_5.Qwen3_5FullAttention(config, layer_id, quant_config=None, prefix='')

Bases: torch.nn.Module

Standard multi-head attention with RoPE, QK-norm, and optional output gate.

Parameters:
  • layer_id (int)

  • prefix (str)

hidden_size
num_heads
num_kv_heads
head_dim
q_size
kv_size
scaling
layer_id
attn_output_gate
q_proj
k_proj
v_proj
o_proj
q_norm
k_norm
partial_rotary_factor
rope_theta
rotary_dim
attn
forward(positions, hidden_states, forward_batch)
Parameters:
  • positions (torch.Tensor)

  • hidden_states (torch.Tensor)

  • forward_batch (Any)

Return type:

torch.Tensor

class pymllm.models.qwen3_5.Qwen3_5AttentionDecoderLayer(config, layer_id, quant_config=None, prefix='')

Bases: torch.nn.Module

Decoder layer with full attention + MLP.

Parameters:
  • layer_id (int)

  • prefix (str)

self_attn
mlp
input_layernorm
post_attention_layernorm
forward(positions, hidden_states, residual, forward_batch)
Parameters:
  • positions (torch.Tensor)

  • hidden_states (torch.Tensor)

  • residual (Optional[torch.Tensor])

  • forward_batch (Any)

class pymllm.models.qwen3_5.Qwen3_5LinearDecoderLayer(config, layer_id, gdn_layer_idx=0, quant_config=None, prefix='')

Bases: torch.nn.Module

Decoder layer with GDN linear attention + MLP.

Parameters:
  • layer_id (int)

  • gdn_layer_idx (int)

  • prefix (str)

linear_attn
mlp
input_layernorm
post_attention_layernorm
forward(positions, hidden_states, residual, forward_batch)
Parameters:
  • positions (torch.Tensor)

  • hidden_states (torch.Tensor)

  • residual (Optional[torch.Tensor])

  • forward_batch (Any)

class pymllm.models.qwen3_5.Qwen3_5ForCausalLM(config, quant_config=None)

Bases: torch.nn.Module

Qwen3.5 causal language model with hybrid attention.

Alternates between full attention and GDN linear attention layers. Dense (non-MoE) variant.

config
quant_config = None
hidden_size
vocab_size
embed_tokens
layer_types
layers
full_attn_layer_ids
num_gdn_layers = 0
norm
forward(input_ids, positions, forward_batch, input_embeds=None)
Parameters:
  • input_ids (torch.Tensor)

  • positions (torch.Tensor)

  • forward_batch (Any)

  • input_embeds (Optional[torch.Tensor])

Return type:

torch.Tensor

load_weights(weights)

Load HuggingFace checkpoint weights with name remapping.

Parameters:

weights (Iterable[Tuple[str, torch.Tensor]])

class pymllm.models.qwen3_5.Qwen3_5ForConditionalGeneration(config, quant_config=None)

Bases: torch.nn.Module

Qwen3.5 multimodal model (text + vision).

Inherits vision encoder from Qwen3VL and uses Qwen3.5’s hybrid language model.

config
quant_config = None
model
num_gdn_layers = 0
full_attn_layer_ids
lm_head
image_token_id
video_token_id
forward(input_ids, positions, forward_batch, input_embeds=None, pixel_values=None, image_grid_thw=None)
Parameters:
  • input_ids (torch.Tensor)

  • positions (torch.Tensor)

  • forward_batch (Any)

  • input_embeds (Optional[torch.Tensor])

  • pixel_values (Optional[torch.Tensor])

  • image_grid_thw (Optional[torch.Tensor])

Return type:

torch.Tensor

load_weights(weights)

Load weights, dispatching visual vs language params.

Parameters:

weights (Iterable[Tuple[str, torch.Tensor]])