pymllm.quantization.methods.compressed_tensors¶

Attributes¶

`MARLIN_SUPPORTED_GROUP_SIZES`
`GPTQ_MARLIN_MIN_THREAD_N`
`GPTQ_MARLIN_MIN_THREAD_K`
`GPTQ_MARLIN_TILE`
`SCALAR_TYPE_UINT4`
`SCALAR_TYPE_UINT4B8`

Classes¶

`CompressedTensorsWNA16Scheme`
`CompressedTensorsW8A8Int8Scheme`
`CompressedTensorsLinearMethod`	Base class for quantization methods applied to linear layers.
`CompressedTensorsConfig`	Base class for quantization configurations.

Functions¶

`verify_marlin_supported`(group_size)
`verify_marlin_supports_shape`(...)
`marlin_make_workspace`(device)
`marlin_make_empty_g_idx`(device)
`get_scale_perms`()
`marlin_permute_scales`(s, size_k, size_n, group_size)
`replace_parameter`(layer, name, new_data)

Module Contents¶

pymllm.quantization.methods.compressed_tensors.MARLIN_SUPPORTED_GROUP_SIZES¶

pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_N = 64¶

pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_K = 128¶

pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_TILE = 16¶

pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4¶

pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4B8¶

pymllm.quantization.methods.compressed_tensors.verify_marlin_supported(group_size)¶

Parameters:: group_size (int)
Return type:: None

pymllm.quantization.methods.compressed_tensors.verify_marlin_supports_shape(output_size_per_partition, input_size_per_partition, input_size, group_size)¶

Parameters:

output_size_per_partition (int)
input_size_per_partition (int)
input_size (int)
group_size (int)

Return type:

None

pymllm.quantization.methods.compressed_tensors.marlin_make_workspace(device)¶

Parameters:: device (torch.device)
Return type:: torch.Tensor

pymllm.quantization.methods.compressed_tensors.marlin_make_empty_g_idx(device)¶

Parameters:: device (torch.device)
Return type:: torch.Tensor

pymllm.quantization.methods.compressed_tensors.get_scale_perms()¶

pymllm.quantization.methods.compressed_tensors.marlin_permute_scales(s, size_k, size_n, group_size)¶

Parameters:

s (torch.Tensor)
size_k (int)
size_n (int)
group_size (int)

Return type:

torch.Tensor

pymllm.quantization.methods.compressed_tensors.replace_parameter(layer, name, new_data)¶

Parameters:

layer (torch.nn.Module)
name (str)
new_data (torch.Tensor)

Return type:

None

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsWNA16Scheme(*, weight_bits, group_size, symmetric, actorder)¶

Parameters:

weight_bits (int)
group_size (int)
symmetric (bool)
actorder (Optional[str])

weight_bits¶

group_size¶

symmetric¶

actorder¶

pack_factor¶

quant_type¶

create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)¶

Parameters:

layer (torch.nn.Module)
input_size_per_partition (int)
output_partition_sizes (List[int])
input_size (int)
output_size (int)
params_dtype (torch.dtype)
extra_weight_attrs (Any)

Return type:

None

process_weights_after_loading(layer)¶

Parameters:: layer (torch.nn.Module)
Return type:: None

apply(layer, x, bias=None)¶

Parameters:

layer (torch.nn.Module)
x (torch.Tensor)
bias (Optional[torch.Tensor])

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsW8A8Int8Scheme(*, weight_bits)¶

Parameters:: weight_bits (int)

weight_bits¶

create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)¶

Parameters:

layer (torch.nn.Module)
input_size_per_partition (int)
output_partition_sizes (List[int])
input_size (int)
output_size (int)
params_dtype (torch.dtype)
extra_weight_attrs (Any)

Return type:

None

process_weights_after_loading(layer)¶

Parameters:: layer (torch.nn.Module)
Return type:: None

apply(layer, x, bias=None)¶

Parameters:

layer (torch.nn.Module)
x (torch.Tensor)
bias (Optional[torch.Tensor])

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsLinearMethod(quant_config, signature)¶

Bases: pymllm.layers.quantize_base.LinearMethodBase

Base class for quantization methods applied to linear layers.

Narrows the QuantizeMethodBase interface with concrete signatures tailored to linear (matmul) operations.

Subclasses must implement create_weights() and apply().

Parameters:

quant_config (CompressedTensorsConfig)
signature (str)

quant_config¶

scheme¶

create_weights(*args, **kwargs)¶

Create quantized weight tensors on layer.

Parameters:

layer – The linear module that will own the parameters.
input_size_per_partition – Number of input features on this TP rank.
output_partition_sizes – Output sizes of each logical weight on this TP rank. For a standard linear layer this is [out_features_per_partition]. For a merged QKV layer it might be [q_size, k_size, v_size].
input_size – Full (un-sharded) input dimension.
output_size – Full (un-sharded) output dimension.
params_dtype – Data type for full-precision parameters (e.g. torch.float16).
**extra_weight_attrs – Additional metadata to attach to created parameters (e.g. weight_loader, packed_dim, packed_factor).
W4A16):: (Example (AWQ) –
# Register packed 4-bit weights, scales, and zero-points qweight = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qweight”, qweight)

scales = Parameter(torch.empty(…, dtype=params_dtype)) layer.register_parameter(“scales”, scales)

qzeros = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qzeros”, qzeros)
args (Any)
kwargs (Any)

Return type:

None

process_weights_after_loading(layer)¶

Post-process parameters after checkpoint loading.

Called once by ModelRunner after all checkpoint tensors have been loaded into the layer’s parameters. Use this for:

Repacking: converting checkpoint layout to kernel-native layout (e.g. AutoAWQ int4 → Marlin packed format).
Transposing: rearranging dimensions for optimised GEMM kernels.
Calibration: computing per-tensor or per-channel scales from the loaded FP weights (e.g. dynamic FP8 quantisation).
Cleanup: replacing custom parameter wrappers with plain torch.nn.Parameter to avoid overhead during inference.

The default implementation is a no-op.

Parameters:: layer (torch.nn.Module)
Return type:: None

apply(layer, x, bias=None)¶

Compute the quantized linear forward.

Parameters:

layer (torch.nn.Module) – The module that owns quantized parameters (set by create_weights()).
x (torch.Tensor) – Input activation tensor, shape (*, input_size_per_partition).
bias (Optional[torch.Tensor]) – Optional bias vector.

Returns:

torch.Tensor – Output tensor, shape (*, sum(output_partition_sizes)).
Example (AWQ W4A16):: – qweight = layer.qweight # packed int32 scales = layer.scales # fp16 per-group scales qzeros = layer.qzeros # packed int32 zero-points # → invoke dequant + matmul kernel

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsConfig(*, quant_format, ignore, weight_bits, group_size, weight_strategy, weight_type, weight_dynamic, symmetric, actorder, input_bits, input_strategy, input_type, input_dynamic, input_symmetric)¶

Bases: pymllm.quantization.quant_config.QuantizationConfig

Base class for quantization configurations.

A QuantizationConfig is instantiated once per model load. It reads quantization metadata from the checkpoint (bit-width, group size, etc.) and provides QuantizeMethodBase instances to each layer.

Subclass contract¶

get_name() — return the method name (e.g. "awq").
from_config() — class method that parses a dict from the checkpoint’s quantize_config.json.
get_quant_method() — return the appropriate LinearMethodBase (or None to skip quantization for a layer).

Optional overrides¶

get_supported_act_dtypes() — restrict activation dtypes.
get_min_capability() — minimum GPU compute capability.
get_config_filenames() — files to probe in the checkpoint dir.

quant_format¶

ignore¶

weight_bits¶

group_size¶

weight_strategy¶

weight_type¶

weight_dynamic¶

symmetric¶

actorder¶

input_bits¶

input_strategy¶

input_type¶

input_dynamic¶

input_symmetric¶

get_name()¶

Return the canonical name of this quantization method.

Examples: "awq", "gptq", "fp8", "w8a8".

Return type:: str

get_supported_act_dtypes()¶

Activation dtypes supported by this method.

Override to restrict (e.g. FP8 only supports float16). Default: no restriction.

Return type:: List[torch.dtype]

classmethod get_min_capability()¶

Minimum CUDA compute capability (e.g. 75 for Turing).

Default: 0 (no restriction).

Return type:: int

static get_config_filenames()¶

File names to look for in the checkpoint directory.

Default: ["quantize_config.json"].

Return type:: List[str]

classmethod from_config(config)¶

Create an instance from a checkpoint’s quantization config dict.

Parameters:

config (Dict[str, Any]) – Parsed JSON from the checkpoint’s quantize_config.json or the quantization_config section of config.json.
(AWQ):: (Example config dict) –

{
“quant_method”: “awq”, “bits”: 4, “group_size”: 128, “zero_point”: true

}

Return type:

CompressedTensorsConfig

get_quant_method(layer, prefix='')¶

Return the quantization method for layer, or None to skip.

Parameters:

layer (torch.nn.Module) – The nn.Module being constructed (e.g. ColumnParallelLinear).
prefix (str) – The layer’s full dotted name in the model (e.g. "model.layers.0.self_attn.q_proj"). Can be used to selectively skip quantization for certain layers.

Returns:

The method instance. None means this layer should fall back to the default UnquantizedLinearMethod.

Return type:

QuantizeMethodBase or None

Parameters:

quant_format (str)
ignore (List[str])
weight_bits (int)
group_size (Optional[int])
weight_strategy (Optional[str])
weight_type (Optional[str])
weight_dynamic (bool)
symmetric (bool)
actorder (Optional[str])
input_bits (Optional[int])
input_strategy (Optional[str])
input_type (Optional[str])
input_dynamic (bool)
input_symmetric (bool)