pymllm.layers.mlp ================= .. py:module:: pymllm.layers.mlp Attributes ---------- .. autoapisummary:: pymllm.layers.mlp.logger pymllm.layers.mlp.MLPActivation Classes ------- .. autoapisummary:: pymllm.layers.mlp.MLP pymllm.layers.mlp.ParallelMLP Module Contents --------------- .. py:data:: logger .. py:data:: MLPActivation .. py:class:: MLP(hidden_size, intermediate_size, activation = 'silu', use_fused_gate_up_proj = True, use_bias_gate_up = False, use_bias_down = False, enable_pdl = None, quant_config=None, prefix = '') Bases: :py:obj:`pymllm.layers.base.MllmBaseLayer` Feed-forward MLP block with FlashInfer fused gated activations. Non-parallel version (TP=1). Uses :class:`Linear` for all projections. Supported activations: ``silu``, ``gelu``, ``gelu_tanh``. .. py:attribute:: hidden_size .. py:attribute:: intermediate_size .. py:attribute:: activation :value: 'silu' .. py:attribute:: use_fused_gate_up_proj :value: True .. py:attribute:: enable_pdl :value: None .. py:attribute:: down_proj .. py:method:: forward(x) .. py:class:: ParallelMLP(hidden_size, intermediate_size, activation = 'silu', use_bias_gate_up = False, use_bias_down = False, enable_pdl = None, quant_config=None, prefix = '') Bases: :py:obj:`pymllm.layers.base.MllmBaseLayer` Tensor-parallel MLP with column-sharded intermediate dimension. Projection layout (Megatron-style): - ``gate_proj``: :class:`ColumnParallelLinear` ``(hidden_size → intermediate_size, gather_output=False)`` - ``up_proj``: :class:`ColumnParallelLinear` ``(hidden_size → intermediate_size, gather_output=False)`` - ``down_proj``: :class:`RowParallelLinear` ``(intermediate_size → hidden_size, reduce_output=True)`` Gate and up projections are kept separate so that each TP rank holds a correctly paired ``[gate_shard, up_shard]`` for the gated activation. Cost: **1 all-reduce** (inside ``down_proj``). Input shape : ``(*, hidden_size)`` — full / replicated. Output shape: ``(*, hidden_size)`` — full / replicated. :param hidden_size: Model hidden dimension. :param intermediate_size: Intermediate (expanded) dimension **before** TP sharding. :param activation: Gated activation type. :param use_bias_gate_up: Add bias to the gate/up projections. :param use_bias_down: Add bias to the down projection. :param enable_pdl: FlashInfer PDL flag. .. py:attribute:: hidden_size .. py:attribute:: intermediate_size .. py:attribute:: activation :value: 'silu' .. py:attribute:: enable_pdl :value: None .. py:attribute:: gate_proj .. py:attribute:: up_proj .. py:attribute:: down_proj .. py:method:: forward(x)