pymllm.models.qwen3_5¶
Inference-only Qwen3.5 model for pymllm.
Implements the hybrid attention architecture: - Full attention layers (standard transformer with RoPE + output gate) - GDN linear attention layers (Gated Delta Network, O(n) complexity)
Layers alternate: linear, attention, linear, attention, … based on
full_attention_interval in the config.
Supports: - Dense (non-MoE) variant - Vision-Language (multimodal) via inheritance from Qwen3VL
Adapted from sglang’s qwen3_5.py.
Attributes¶
Classes¶
Standard multi-head attention with RoPE, QK-norm, and optional output gate. |
|
Decoder layer with full attention + MLP. |
|
Decoder layer with GDN linear attention + MLP. |
|
Qwen3.5 causal language model with hybrid attention. |
|
Qwen3.5 multimodal model (text + vision). |
Module Contents¶
- pymllm.models.qwen3_5.logger¶
- class pymllm.models.qwen3_5.Qwen3_5FullAttention(config, layer_id, quant_config=None, prefix='')¶
Bases:
torch.nn.ModuleStandard multi-head attention with RoPE, QK-norm, and optional output gate.
- Parameters:
layer_id (int)
prefix (str)
- num_heads¶
- num_kv_heads¶
- head_dim¶
- q_size¶
- kv_size¶
- scaling¶
- layer_id¶
- attn_output_gate¶
- q_proj¶
- k_proj¶
- v_proj¶
- o_proj¶
- q_norm¶
- k_norm¶
- partial_rotary_factor¶
- rope_theta¶
- rotary_dim¶
- attn¶
- forward(positions, hidden_states, forward_batch)¶
- Parameters:
positions (torch.Tensor)
hidden_states (torch.Tensor)
forward_batch (Any)
- Return type:
torch.Tensor
- class pymllm.models.qwen3_5.Qwen3_5AttentionDecoderLayer(config, layer_id, quant_config=None, prefix='')¶
Bases:
torch.nn.ModuleDecoder layer with full attention + MLP.
- Parameters:
layer_id (int)
prefix (str)
- self_attn¶
- mlp¶
- input_layernorm¶
- post_attention_layernorm¶
- forward(positions, hidden_states, residual, forward_batch)¶
- Parameters:
positions (torch.Tensor)
hidden_states (torch.Tensor)
residual (Optional[torch.Tensor])
forward_batch (Any)
- class pymllm.models.qwen3_5.Qwen3_5LinearDecoderLayer(config, layer_id, gdn_layer_idx=0, quant_config=None, prefix='')¶
Bases:
torch.nn.ModuleDecoder layer with GDN linear attention + MLP.
- Parameters:
layer_id (int)
gdn_layer_idx (int)
prefix (str)
- linear_attn¶
- mlp¶
- input_layernorm¶
- post_attention_layernorm¶
- forward(positions, hidden_states, residual, forward_batch)¶
- Parameters:
positions (torch.Tensor)
hidden_states (torch.Tensor)
residual (Optional[torch.Tensor])
forward_batch (Any)
- class pymllm.models.qwen3_5.Qwen3_5ForCausalLM(config, quant_config=None)¶
Bases:
torch.nn.ModuleQwen3.5 causal language model with hybrid attention.
Alternates between full attention and GDN linear attention layers. Dense (non-MoE) variant.
- config¶
- quant_config = None¶
- vocab_size¶
- embed_tokens¶
- layer_types¶
- layers¶
- full_attn_layer_ids¶
- num_gdn_layers = 0¶
- norm¶
- forward(input_ids, positions, forward_batch, input_embeds=None)¶
- Parameters:
input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (Any)
input_embeds (Optional[torch.Tensor])
- Return type:
torch.Tensor
- load_weights(weights)¶
Load HuggingFace checkpoint weights with name remapping.
- Parameters:
weights (Iterable[Tuple[str, torch.Tensor]])
- class pymllm.models.qwen3_5.Qwen3_5ForConditionalGeneration(config, quant_config=None)¶
Bases:
torch.nn.ModuleQwen3.5 multimodal model (text + vision).
Inherits vision encoder from Qwen3VL and uses Qwen3.5’s hybrid language model.
- config¶
- quant_config = None¶
- model¶
- num_gdn_layers = 0¶
- full_attn_layer_ids¶
- lm_head¶
- image_token_id¶
- video_token_id¶
- forward(input_ids, positions, forward_batch, input_embeds=None, pixel_values=None, image_grid_thw=None)¶
- Parameters:
input_ids (torch.Tensor)
positions (torch.Tensor)
forward_batch (Any)
input_embeds (Optional[torch.Tensor])
pixel_values (Optional[torch.Tensor])
image_grid_thw (Optional[torch.Tensor])
- Return type:
torch.Tensor
- load_weights(weights)¶
Load weights, dispatching visual vs language params.
- Parameters:
weights (Iterable[Tuple[str, torch.Tensor]])