pymllm.layers.mlp

Attributes

Classes

MLP

Feed-forward MLP block with FlashInfer fused gated activations.

ParallelMLP

Tensor-parallel MLP with column-sharded intermediate dimension.

Module Contents

pymllm.layers.mlp.logger
pymllm.layers.mlp.MLPActivation
class pymllm.layers.mlp.MLP(hidden_size, intermediate_size, activation='silu', use_fused_gate_up_proj=True, use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')

Bases: pymllm.layers.base.MllmBaseLayer

Feed-forward MLP block with FlashInfer fused gated activations.

Non-parallel version (TP=1). Uses Linear for all projections.

Supported activations: silu, gelu, gelu_tanh.

Parameters:
  • hidden_size (int)

  • intermediate_size (int)

  • activation (MLPActivation)

  • use_fused_gate_up_proj (bool)

  • use_bias_gate_up (bool)

  • use_bias_down (bool)

  • enable_pdl (Optional[bool])

  • prefix (str)

hidden_size
intermediate_size
activation = 'silu'
use_fused_gate_up_proj = True
enable_pdl = None
down_proj
forward(x)
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

class pymllm.layers.mlp.ParallelMLP(hidden_size, intermediate_size, activation='silu', use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')

Bases: pymllm.layers.base.MllmBaseLayer

Tensor-parallel MLP with column-sharded intermediate dimension.

Projection layout (Megatron-style):

  • gate_proj: ColumnParallelLinear (hidden_size intermediate_size, gather_output=False)

  • up_proj: ColumnParallelLinear (hidden_size intermediate_size, gather_output=False)

  • down_proj: RowParallelLinear (intermediate_size hidden_size, reduce_output=True)

Gate and up projections are kept separate so that each TP rank holds a correctly paired [gate_shard, up_shard] for the gated activation.

Cost: 1 all-reduce (inside down_proj).

Input shape : (*, hidden_size) — full / replicated. Output shape: (*, hidden_size) — full / replicated.

Parameters:
  • hidden_size (int) – Model hidden dimension.

  • intermediate_size (int) – Intermediate (expanded) dimension before TP sharding.

  • activation (MLPActivation) – Gated activation type.

  • use_bias_gate_up (bool) – Add bias to the gate/up projections.

  • use_bias_down (bool) – Add bias to the down projection.

  • enable_pdl (Optional[bool]) – FlashInfer PDL flag.

  • prefix (str)

hidden_size
intermediate_size
activation = 'silu'
enable_pdl = None
gate_proj
up_proj
down_proj
forward(x)
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor