pymllm.layers.mlp¶
Attributes¶
Classes¶
Feed-forward MLP block with FlashInfer fused gated activations. |
|
Tensor-parallel MLP with column-sharded intermediate dimension. |
Module Contents¶
- pymllm.layers.mlp.logger¶
- pymllm.layers.mlp.MLPActivation¶
- class pymllm.layers.mlp.MLP(hidden_size, intermediate_size, activation='silu', use_fused_gate_up_proj=True, use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')¶
Bases:
pymllm.layers.base.MllmBaseLayerFeed-forward MLP block with FlashInfer fused gated activations.
Non-parallel version (TP=1). Uses
Linearfor all projections.Supported activations:
silu,gelu,gelu_tanh.- Parameters:
hidden_size (int)
intermediate_size (int)
activation (MLPActivation)
use_fused_gate_up_proj (bool)
use_bias_gate_up (bool)
use_bias_down (bool)
enable_pdl (Optional[bool])
prefix (str)
- intermediate_size¶
- activation = 'silu'¶
- use_fused_gate_up_proj = True¶
- enable_pdl = None¶
- down_proj¶
- forward(x)¶
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- class pymllm.layers.mlp.ParallelMLP(hidden_size, intermediate_size, activation='silu', use_bias_gate_up=False, use_bias_down=False, enable_pdl=None, quant_config=None, prefix='')¶
Bases:
pymllm.layers.base.MllmBaseLayerTensor-parallel MLP with column-sharded intermediate dimension.
Projection layout (Megatron-style):
gate_proj:ColumnParallelLinear(hidden_size → intermediate_size, gather_output=False)up_proj:ColumnParallelLinear(hidden_size → intermediate_size, gather_output=False)down_proj:RowParallelLinear(intermediate_size → hidden_size, reduce_output=True)
Gate and up projections are kept separate so that each TP rank holds a correctly paired
[gate_shard, up_shard]for the gated activation.Cost: 1 all-reduce (inside
down_proj).Input shape :
(*, hidden_size)— full / replicated. Output shape:(*, hidden_size)— full / replicated.- Parameters:
hidden_size (int) – Model hidden dimension.
intermediate_size (int) – Intermediate (expanded) dimension before TP sharding.
activation (MLPActivation) – Gated activation type.
use_bias_gate_up (bool) – Add bias to the gate/up projections.
use_bias_down (bool) – Add bias to the down projection.
enable_pdl (Optional[bool]) – FlashInfer PDL flag.
prefix (str)
- intermediate_size¶
- activation = 'silu'¶
- enable_pdl = None¶
- gate_proj¶
- up_proj¶
- down_proj¶
- forward(x)¶
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor