pymllm.quantization.kernels.int8_activation_triton¶

Per-token INT8 activation quantization using Triton.

Ported from sglang int8_kernel.py (per_token_quant_int8). Original: sglang/srt/layers/quantization/int8_kernel.py:28-89

Functions¶

Per-token dynamic INT8 quantization.

pymllm.quantization.kernels.int8_activation_triton.per_token_quant_int8(x, scale_dtype=torch.float32)¶

Per-token dynamic INT8 quantization.

Parameters:

x (torch.Tensor) – Input tensor, any shape with last dim = hidden_dim. Must be contiguous.
scale_dtype (torch.dtype) – Dtype for scale output (default float32).

Returns:

INT8 quantized tensor, same shape as x. scales: Per-token scales, shape = x.shape[:-1] + (1,).

Return type:

x_q