pymllm.layers.gated_delta_net ============================= .. py:module:: pymllm.layers.gated_delta_net .. autoapi-nested-parse:: Gated Delta Network (GDN) linear attention for Qwen3.5. This implements the linear attention mechanism used in Qwen3.5's hybrid architecture. GDN alternates with standard full-attention layers. Core formulation (decode, per-head): g_t = -exp(A_log) * softplus(a_t + dt_bias) beta_t = sigmoid(b_t) state_t = exp(g_t) * state_{t-1} + beta_t * (k_t outer v_t) output_t = (q_t @ state_t) State is externalized into a :class:`~pymllm.mem_cache.memory_pool.GDNPool` and computation is delegated to the attention backend via :class:`~pymllm.layers.attention.radix_linear_attention.RadixLinearAttention`. Attributes ---------- .. autoapisummary:: pymllm.layers.gated_delta_net.logger Classes ------- .. autoapisummary:: pymllm.layers.gated_delta_net.GDNConv1d pymllm.layers.gated_delta_net.GatedDeltaNet Module Contents --------------- .. py:data:: logger .. py:class:: GDNConv1d(channels, kernel_size) Bases: :py:obj:`torch.nn.Module` Causal 1D convolution weight holder for GDN sequence mixing. The actual convolution computation is performed by the GDN backend using pooled conv states. This module only holds the learnable weight. .. py:attribute:: channels .. py:attribute:: kernel_size .. py:attribute:: weight .. py:class:: GatedDeltaNet(hidden_size, num_k_heads = 16, num_v_heads = 32, head_k_dim = 128, head_v_dim = 128, conv_kernel_size = 4, layer_id = 0, gdn_layer_idx = 0, rms_norm_eps = 1e-06, quant_config=None, prefix = '') Bases: :py:obj:`pymllm.layers.base.MllmBaseLayer` Gated Delta Network linear attention layer for Qwen3.5. State is externalized into a GDNPool and computation is delegated to the attention backend via RadixLinearAttention. :param hidden_size: Model hidden dimension. :type hidden_size: int :param num_k_heads: Number of key heads. :type num_k_heads: int :param num_v_heads: Number of value heads. :type num_v_heads: int :param head_k_dim: Per-head key dimension. :type head_k_dim: int :param head_v_dim: Per-head value dimension. :type head_v_dim: int :param conv_kernel_size: Causal conv1d kernel width. :type conv_kernel_size: int :param layer_id: Global layer index. :type layer_id: int :param gdn_layer_idx: Sequential index among GDN layers (0-based). :type gdn_layer_idx: int :param rms_norm_eps: Epsilon for gated RMS normalization. :type rms_norm_eps: float .. py:attribute:: hidden_size .. py:attribute:: num_k_heads :value: 16 .. py:attribute:: num_v_heads :value: 32 .. py:attribute:: head_k_dim :value: 128 .. py:attribute:: head_v_dim :value: 128 .. py:attribute:: key_dim :value: 2048 .. py:attribute:: value_dim :value: 4096 .. py:attribute:: conv_kernel_size :value: 4 .. py:attribute:: layer_id :value: 0 .. py:attribute:: gdn_layer_idx :value: 0 .. py:attribute:: in_proj_qkv .. py:attribute:: in_proj_z .. py:attribute:: in_proj_a .. py:attribute:: in_proj_b .. py:attribute:: conv1d .. py:attribute:: A_log .. py:attribute:: dt_bias .. py:attribute:: norm .. py:attribute:: out_proj .. py:attribute:: attn .. py:method:: forward(hidden_states, forward_batch = None)