pymllm.layers.mlp
=================

.. py:module:: pymllm.layers.mlp


Attributes
----------

.. autoapisummary::

   pymllm.layers.mlp.logger
   pymllm.layers.mlp.MLPActivation


Classes
-------

.. autoapisummary::

   pymllm.layers.mlp.MLP
   pymllm.layers.mlp.ParallelMLP


Module Contents
---------------

.. py:data:: logger

.. py:data:: MLPActivation

.. py:class:: MLP(hidden_size, intermediate_size, activation = 'silu', use_fused_gate_up_proj = True, use_bias_gate_up = False, use_bias_down = False, enable_pdl = None, quant_config=None, prefix = '')

   Bases: :py:obj:`pymllm.layers.base.MllmBaseLayer`


   Feed-forward MLP block with FlashInfer fused gated activations.

   Non-parallel version (TP=1). Uses :class:`Linear` for all projections.

   Supported activations: ``silu``, ``gelu``, ``gelu_tanh``.


   .. py:attribute:: hidden_size


   .. py:attribute:: intermediate_size


   .. py:attribute:: activation
      :value: 'silu'


   .. py:attribute:: use_fused_gate_up_proj
      :value: True


   .. py:attribute:: enable_pdl
      :value: None


   .. py:attribute:: down_proj


   .. py:method:: forward(x)


.. py:class:: ParallelMLP(hidden_size, intermediate_size, activation = 'silu', use_bias_gate_up = False, use_bias_down = False, enable_pdl = None, quant_config=None, prefix = '')

   Bases: :py:obj:`pymllm.layers.base.MllmBaseLayer`


   Tensor-parallel MLP with column-sharded intermediate dimension.

   Projection layout (Megatron-style):

   - ``gate_proj``: :class:`ColumnParallelLinear`
     ``(hidden_size → intermediate_size, gather_output=False)``
   - ``up_proj``: :class:`ColumnParallelLinear`
     ``(hidden_size → intermediate_size, gather_output=False)``
   - ``down_proj``: :class:`RowParallelLinear`
     ``(intermediate_size → hidden_size, reduce_output=True)``

   Gate and up projections are kept separate so that each TP rank holds a
   correctly paired ``[gate_shard, up_shard]`` for the gated activation.

   Cost: **1 all-reduce** (inside ``down_proj``).

   Input shape : ``(*, hidden_size)``  — full / replicated.
   Output shape: ``(*, hidden_size)``  — full / replicated.

   :param hidden_size: Model hidden dimension.
   :param intermediate_size: Intermediate (expanded) dimension **before** TP
                             sharding.
   :param activation: Gated activation type.
   :param use_bias_gate_up: Add bias to the gate/up projections.
   :param use_bias_down: Add bias to the down projection.
   :param enable_pdl: FlashInfer PDL flag.


   .. py:attribute:: hidden_size


   .. py:attribute:: intermediate_size


   .. py:attribute:: activation
      :value: 'silu'


   .. py:attribute:: enable_pdl
      :value: None


   .. py:attribute:: gate_proj


   .. py:attribute:: up_proj


   .. py:attribute:: down_proj


   .. py:method:: forward(x)