pymllm.layers.mlp¶

Attributes¶

`logger`
`MLPActivation`

Classes¶

`MLP`	Feed-forward MLP block with FlashInfer fused gated activations.
`ParallelMLP`	Tensor-parallel MLP with column-sharded intermediate dimension.

Module Contents¶

pymllm.layers.mlp.logger¶

pymllm.layers.mlp.MLPActivation¶

class pymllm.layers.mlp.MLP(hidden_size, intermediate_size, activation='silu', use_fused_gate_up_proj=True, use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')¶

Bases: pymllm.layers.base.MllmBaseLayer

Feed-forward MLP block with FlashInfer fused gated activations.

Non-parallel version (TP=1). Uses Linear for all projections.

Supported activations: silu, gelu, gelu_tanh.

Parameters:

hidden_size (int)
intermediate_size (int)
activation (MLPActivation)
use_fused_gate_up_proj (bool)
use_bias_gate_up (bool)
use_bias_down (bool)
enable_pdl (Optional[bool])
prefix (str)

hidden_size¶

intermediate_size¶

activation = 'silu'¶

use_fused_gate_up_proj = True¶

enable_pdl = None¶

down_proj¶

forward(x)¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

class pymllm.layers.mlp.ParallelMLP(hidden_size, intermediate_size, activation='silu', use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')¶

Bases: pymllm.layers.base.MllmBaseLayer

Tensor-parallel MLP with column-sharded intermediate dimension.

Projection layout (Megatron-style):

gate_proj: ColumnParallelLinear (hidden_size → intermediate_size, gather_output=False)
up_proj: ColumnParallelLinear (hidden_size → intermediate_size, gather_output=False)
down_proj: RowParallelLinear (intermediate_size → hidden_size, reduce_output=True)

Gate and up projections are kept separate so that each TP rank holds a correctly paired [gate_shard, up_shard] for the gated activation.

Cost: 1 all-reduce (inside down_proj).

Input shape : (*, hidden_size) — full / replicated. Output shape: (*, hidden_size) — full / replicated.

Parameters:

hidden_size (int) – Model hidden dimension.
intermediate_size (int) – Intermediate (expanded) dimension before TP sharding.
activation (MLPActivation) – Gated activation type.
use_bias_gate_up (bool) – Add bias to the gate/up projections.
use_bias_down (bool) – Add bias to the down projection.
enable_pdl (Optional[bool]) – FlashInfer PDL flag.
prefix (str)

hidden_size¶

intermediate_size¶

activation = 'silu'¶

enable_pdl = None¶

gate_proj¶

up_proj¶

down_proj¶

forward(x)¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor