pymllm.quantization.methods.compressed_tensors

Attributes

Classes

CompressedTensorsWNA16Scheme

CompressedTensorsW8A8Int8Scheme

CompressedTensorsLinearMethod

Base class for quantization methods applied to linear layers.

CompressedTensorsConfig

Base class for quantization configurations.

Functions

Module Contents

pymllm.quantization.methods.compressed_tensors.MARLIN_SUPPORTED_GROUP_SIZES
pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_N = 64
pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_K = 128
pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_TILE = 16
pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4
pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4B8
pymllm.quantization.methods.compressed_tensors.verify_marlin_supported(group_size)
Parameters:

group_size (int)

Return type:

None

pymllm.quantization.methods.compressed_tensors.verify_marlin_supports_shape(output_size_per_partition, input_size_per_partition, input_size, group_size)
Parameters:
  • output_size_per_partition (int)

  • input_size_per_partition (int)

  • input_size (int)

  • group_size (int)

Return type:

None

pymllm.quantization.methods.compressed_tensors.marlin_make_workspace(device)
Parameters:

device (torch.device)

Return type:

torch.Tensor

pymllm.quantization.methods.compressed_tensors.marlin_make_empty_g_idx(device)
Parameters:

device (torch.device)

Return type:

torch.Tensor

pymllm.quantization.methods.compressed_tensors.get_scale_perms()
pymllm.quantization.methods.compressed_tensors.marlin_permute_scales(s, size_k, size_n, group_size)
Parameters:
  • s (torch.Tensor)

  • size_k (int)

  • size_n (int)

  • group_size (int)

Return type:

torch.Tensor

pymllm.quantization.methods.compressed_tensors.replace_parameter(layer, name, new_data)
Parameters:
  • layer (torch.nn.Module)

  • name (str)

  • new_data (torch.Tensor)

Return type:

None

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsWNA16Scheme(*, weight_bits, group_size, symmetric, actorder)
Parameters:
  • weight_bits (int)

  • group_size (int)

  • symmetric (bool)

  • actorder (Optional[str])

weight_bits
group_size
symmetric
actorder
pack_factor
quant_type
create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)
Parameters:
  • layer (torch.nn.Module)

  • input_size_per_partition (int)

  • output_partition_sizes (List[int])

  • input_size (int)

  • output_size (int)

  • params_dtype (torch.dtype)

  • extra_weight_attrs (Any)

Return type:

None

process_weights_after_loading(layer)
Parameters:

layer (torch.nn.Module)

Return type:

None

apply(layer, x, bias=None)
Parameters:
  • layer (torch.nn.Module)

  • x (torch.Tensor)

  • bias (Optional[torch.Tensor])

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsW8A8Int8Scheme(*, weight_bits)
Parameters:

weight_bits (int)

weight_bits
create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)
Parameters:
  • layer (torch.nn.Module)

  • input_size_per_partition (int)

  • output_partition_sizes (List[int])

  • input_size (int)

  • output_size (int)

  • params_dtype (torch.dtype)

  • extra_weight_attrs (Any)

Return type:

None

process_weights_after_loading(layer)
Parameters:

layer (torch.nn.Module)

Return type:

None

apply(layer, x, bias=None)
Parameters:
  • layer (torch.nn.Module)

  • x (torch.Tensor)

  • bias (Optional[torch.Tensor])

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsLinearMethod(quant_config, signature)

Bases: pymllm.layers.quantize_base.LinearMethodBase

Base class for quantization methods applied to linear layers.

Narrows the QuantizeMethodBase interface with concrete signatures tailored to linear (matmul) operations.

Subclasses must implement create_weights() and apply().

Parameters:
quant_config
scheme
create_weights(*args, **kwargs)

Create quantized weight tensors on layer.

Parameters:
  • layer – The linear module that will own the parameters.

  • input_size_per_partition – Number of input features on this TP rank.

  • output_partition_sizes – Output sizes of each logical weight on this TP rank. For a standard linear layer this is [out_features_per_partition]. For a merged QKV layer it might be [q_size, k_size, v_size].

  • input_size – Full (un-sharded) input dimension.

  • output_size – Full (un-sharded) output dimension.

  • params_dtype – Data type for full-precision parameters (e.g. torch.float16).

  • **extra_weight_attrs – Additional metadata to attach to created parameters (e.g. weight_loader, packed_dim, packed_factor).

  • W4A16):: (Example (AWQ) –

    # Register packed 4-bit weights, scales, and zero-points qweight = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qweight”, qweight)

    scales = Parameter(torch.empty(…, dtype=params_dtype)) layer.register_parameter(“scales”, scales)

    qzeros = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qzeros”, qzeros)

  • args (Any)

  • kwargs (Any)

Return type:

None

process_weights_after_loading(layer)

Post-process parameters after checkpoint loading.

Called once by ModelRunner after all checkpoint tensors have been loaded into the layer’s parameters. Use this for:

  • Repacking: converting checkpoint layout to kernel-native layout (e.g. AutoAWQ int4 → Marlin packed format).

  • Transposing: rearranging dimensions for optimised GEMM kernels.

  • Calibration: computing per-tensor or per-channel scales from the loaded FP weights (e.g. dynamic FP8 quantisation).

  • Cleanup: replacing custom parameter wrappers with plain torch.nn.Parameter to avoid overhead during inference.

The default implementation is a no-op.

Parameters:

layer (torch.nn.Module)

Return type:

None

apply(layer, x, bias=None)

Compute the quantized linear forward.

Parameters:
  • layer (torch.nn.Module) – The module that owns quantized parameters (set by create_weights()).

  • x (torch.Tensor) – Input activation tensor, shape (*, input_size_per_partition).

  • bias (Optional[torch.Tensor]) – Optional bias vector.

Returns:

  • torch.Tensor – Output tensor, shape (*, sum(output_partition_sizes)).

  • Example (AWQ W4A16):: – qweight = layer.qweight # packed int32 scales = layer.scales # fp16 per-group scales qzeros = layer.qzeros # packed int32 zero-points # → invoke dequant + matmul kernel

Return type:

torch.Tensor

class pymllm.quantization.methods.compressed_tensors.CompressedTensorsConfig(*, quant_format, ignore, weight_bits, group_size, weight_strategy, weight_type, weight_dynamic, symmetric, actorder, input_bits, input_strategy, input_type, input_dynamic, input_symmetric)

Bases: pymllm.quantization.quant_config.QuantizationConfig

Base class for quantization configurations.

A QuantizationConfig is instantiated once per model load. It reads quantization metadata from the checkpoint (bit-width, group size, etc.) and provides QuantizeMethodBase instances to each layer.

Subclass contract

  • get_name() — return the method name (e.g. "awq").

  • from_config() — class method that parses a dict from the checkpoint’s quantize_config.json.

  • get_quant_method() — return the appropriate LinearMethodBase (or None to skip quantization for a layer).

Optional overrides

quant_format
ignore
weight_bits
group_size
weight_strategy
weight_type
weight_dynamic
symmetric
actorder
input_bits
input_strategy
input_type
input_dynamic
input_symmetric
get_name()

Return the canonical name of this quantization method.

Examples: "awq", "gptq", "fp8", "w8a8".

Return type:

str

get_supported_act_dtypes()

Activation dtypes supported by this method.

Override to restrict (e.g. FP8 only supports float16). Default: no restriction.

Return type:

List[torch.dtype]

classmethod get_min_capability()

Minimum CUDA compute capability (e.g. 75 for Turing).

Default: 0 (no restriction).

Return type:

int

static get_config_filenames()

File names to look for in the checkpoint directory.

Default: ["quantize_config.json"].

Return type:

List[str]

classmethod from_config(config)

Create an instance from a checkpoint’s quantization config dict.

Parameters:
  • config (Dict[str, Any]) – Parsed JSON from the checkpoint’s quantize_config.json or the quantization_config section of config.json.

  • (AWQ):: (Example config dict) –

    {

    “quant_method”: “awq”, “bits”: 4, “group_size”: 128, “zero_point”: true

    }

Return type:

CompressedTensorsConfig

get_quant_method(layer, prefix='')

Return the quantization method for layer, or None to skip.

Parameters:
  • layer (torch.nn.Module) – The nn.Module being constructed (e.g. ColumnParallelLinear).

  • prefix (str) – The layer’s full dotted name in the model (e.g. "model.layers.0.self_attn.q_proj"). Can be used to selectively skip quantization for certain layers.

Returns:

The method instance. None means this layer should fall back to the default UnquantizedLinearMethod.

Return type:

QuantizeMethodBase or None

Parameters:
  • quant_format (str)

  • ignore (List[str])

  • weight_bits (int)

  • group_size (Optional[int])

  • weight_strategy (Optional[str])

  • weight_type (Optional[str])

  • weight_dynamic (bool)

  • symmetric (bool)

  • actorder (Optional[str])

  • input_bits (Optional[int])

  • input_strategy (Optional[str])

  • input_type (Optional[str])

  • input_dynamic (bool)

  • input_symmetric (bool)