pymllm.models.qwen3

Inference-only Qwen3 text model for pymllm.

Implements Qwen3ForCausalLM with: - QK-norm attention + 1D RoPE - RadixAttention KV-cache backend - Optional quantized Linear methods via quant_config

Adapted from pymllm’s Qwen3-VL text backbone and SGLang’s qwen3.py.

Attributes

Classes

Qwen3Attention

Qwen3 attention with QK norm + 1D RoPE.

Qwen3DecoderLayer

Single Qwen3 decoder layer.

Qwen3Model

Qwen3 text backbone (embedding + decoder + final norm).

Qwen3ForCausalLM

Inference-only Qwen3ForCausalLM.

Module Contents

pymllm.models.qwen3.logger
class pymllm.models.qwen3.Qwen3Attention(hidden_size, num_heads, num_kv_heads, head_dim, layer_id, rope_theta=1000000.0, rms_norm_eps=1e-06, max_position_embeddings=32768, attention_bias=False, quant_config=None, prefix='')

Bases: torch.nn.Module

Qwen3 attention with QK norm + 1D RoPE.

Parameters:
  • hidden_size (int)

  • num_heads (int)

  • num_kv_heads (int)

  • head_dim (int)

  • layer_id (int)

  • rope_theta (float)

  • rms_norm_eps (float)

  • max_position_embeddings (int)

  • attention_bias (bool)

  • prefix (str)

num_heads
num_kv_heads
head_dim
q_size
kv_size
scaling
rope_theta = 1000000.0
use_fused_qkv = True
o_proj
q_norm
k_norm
attn
forward(positions, hidden_states, forward_batch)
Parameters:
  • positions (torch.Tensor)

  • hidden_states (torch.Tensor)

Return type:

torch.Tensor

class pymllm.models.qwen3.Qwen3DecoderLayer(hidden_size, num_heads, num_kv_heads, head_dim, intermediate_size, hidden_act, attention_bias, layer_id, rope_theta=1000000.0, rms_norm_eps=1e-06, max_position_embeddings=32768, quant_config=None, prefix='')

Bases: torch.nn.Module

Single Qwen3 decoder layer.

Parameters:
  • hidden_size (int)

  • num_heads (int)

  • num_kv_heads (int)

  • head_dim (int)

  • intermediate_size (int)

  • hidden_act (str)

  • attention_bias (bool)

  • layer_id (int)

  • rope_theta (float)

  • rms_norm_eps (float)

  • max_position_embeddings (int)

  • prefix (str)

self_attn
mlp
input_layernorm
post_attention_layernorm
forward(positions, hidden_states, forward_batch, residual=None)
Parameters:
  • positions (torch.Tensor)

  • hidden_states (torch.Tensor)

  • residual (torch.Tensor | None)

Return type:

tuple[torch.Tensor, torch.Tensor]

class pymllm.models.qwen3.Qwen3Model(config, quant_config=None)

Bases: torch.nn.Module

Qwen3 text backbone (embedding + decoder + final norm).

hidden_size
num_hidden_layers
embed_tokens
layers
norm
forward(input_ids, positions, forward_batch, input_embeds=None)
Parameters:
  • input_ids (torch.Tensor)

  • positions (torch.Tensor)

  • input_embeds (torch.Tensor | None)

Return type:

torch.Tensor

class pymllm.models.qwen3.Qwen3ForCausalLM(config, quant_config=None)

Bases: torch.nn.Module

Inference-only Qwen3ForCausalLM.

config
quant_config = None
model
get_input_embeddings()
Return type:

torch.nn.Module

forward(input_ids, positions, forward_batch)
Parameters:
  • input_ids (torch.Tensor)

  • positions (torch.Tensor)

load_weights(weights)
Parameters:

weights (Iterable[Tuple[str, torch.Tensor]])

Return type:

None