pymllm.quantization.methods.compressed_tensors¶
Attributes¶
Classes¶
Base class for quantization methods applied to linear layers. |
|
Base class for quantization configurations. |
Functions¶
|
|
|
|
|
|
|
|
|
Module Contents¶
- pymllm.quantization.methods.compressed_tensors.MARLIN_SUPPORTED_GROUP_SIZES¶
- pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_N = 64¶
- pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_MIN_THREAD_K = 128¶
- pymllm.quantization.methods.compressed_tensors.GPTQ_MARLIN_TILE = 16¶
- pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4¶
- pymllm.quantization.methods.compressed_tensors.SCALAR_TYPE_UINT4B8¶
- pymllm.quantization.methods.compressed_tensors.verify_marlin_supported(group_size)¶
- Parameters:
group_size (int)
- Return type:
None
- pymllm.quantization.methods.compressed_tensors.verify_marlin_supports_shape(output_size_per_partition, input_size_per_partition, input_size, group_size)¶
- Parameters:
output_size_per_partition (int)
input_size_per_partition (int)
input_size (int)
group_size (int)
- Return type:
None
- pymllm.quantization.methods.compressed_tensors.marlin_make_workspace(device)¶
- Parameters:
device (torch.device)
- Return type:
torch.Tensor
- pymllm.quantization.methods.compressed_tensors.marlin_make_empty_g_idx(device)¶
- Parameters:
device (torch.device)
- Return type:
torch.Tensor
- pymllm.quantization.methods.compressed_tensors.get_scale_perms()¶
- pymllm.quantization.methods.compressed_tensors.marlin_permute_scales(s, size_k, size_n, group_size)¶
- Parameters:
s (torch.Tensor)
size_k (int)
size_n (int)
group_size (int)
- Return type:
torch.Tensor
- pymllm.quantization.methods.compressed_tensors.replace_parameter(layer, name, new_data)¶
- Parameters:
layer (torch.nn.Module)
name (str)
new_data (torch.Tensor)
- Return type:
None
- class pymllm.quantization.methods.compressed_tensors.CompressedTensorsWNA16Scheme(*, weight_bits, group_size, symmetric, actorder)¶
- Parameters:
weight_bits (int)
group_size (int)
symmetric (bool)
actorder (Optional[str])
- weight_bits¶
- group_size¶
- symmetric¶
- actorder¶
- pack_factor¶
- quant_type¶
- create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)¶
- Parameters:
layer (torch.nn.Module)
input_size_per_partition (int)
output_partition_sizes (List[int])
input_size (int)
output_size (int)
params_dtype (torch.dtype)
extra_weight_attrs (Any)
- Return type:
None
- process_weights_after_loading(layer)¶
- Parameters:
layer (torch.nn.Module)
- Return type:
None
- apply(layer, x, bias=None)¶
- Parameters:
layer (torch.nn.Module)
x (torch.Tensor)
bias (Optional[torch.Tensor])
- Return type:
torch.Tensor
- class pymllm.quantization.methods.compressed_tensors.CompressedTensorsW8A8Int8Scheme(*, weight_bits)¶
- Parameters:
weight_bits (int)
- weight_bits¶
- create_weights(layer, input_size_per_partition, output_partition_sizes, input_size, output_size, params_dtype, **extra_weight_attrs)¶
- Parameters:
layer (torch.nn.Module)
input_size_per_partition (int)
output_partition_sizes (List[int])
input_size (int)
output_size (int)
params_dtype (torch.dtype)
extra_weight_attrs (Any)
- Return type:
None
- process_weights_after_loading(layer)¶
- Parameters:
layer (torch.nn.Module)
- Return type:
None
- apply(layer, x, bias=None)¶
- Parameters:
layer (torch.nn.Module)
x (torch.Tensor)
bias (Optional[torch.Tensor])
- Return type:
torch.Tensor
- class pymllm.quantization.methods.compressed_tensors.CompressedTensorsLinearMethod(quant_config, signature)¶
Bases:
pymllm.layers.quantize_base.LinearMethodBaseBase class for quantization methods applied to linear layers.
Narrows the
QuantizeMethodBaseinterface with concrete signatures tailored to linear (matmul) operations.Subclasses must implement
create_weights()andapply().- Parameters:
quant_config (CompressedTensorsConfig)
signature (str)
- quant_config¶
- scheme¶
- create_weights(*args, **kwargs)¶
Create quantized weight tensors on layer.
- Parameters:
layer – The linear module that will own the parameters.
input_size_per_partition – Number of input features on this TP rank.
output_partition_sizes – Output sizes of each logical weight on this TP rank. For a standard linear layer this is
[out_features_per_partition]. For a merged QKV layer it might be[q_size, k_size, v_size].input_size – Full (un-sharded) input dimension.
output_size – Full (un-sharded) output dimension.
params_dtype – Data type for full-precision parameters (e.g.
torch.float16).**extra_weight_attrs – Additional metadata to attach to created parameters (e.g.
weight_loader,packed_dim,packed_factor).W4A16):: (Example (AWQ) –
# Register packed 4-bit weights, scales, and zero-points qweight = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qweight”, qweight)
scales = Parameter(torch.empty(…, dtype=params_dtype)) layer.register_parameter(“scales”, scales)
qzeros = Parameter(torch.empty(…, dtype=torch.int32)) layer.register_parameter(“qzeros”, qzeros)
args (Any)
kwargs (Any)
- Return type:
None
- process_weights_after_loading(layer)¶
Post-process parameters after checkpoint loading.
Called once by
ModelRunnerafter all checkpoint tensors have been loaded into the layer’s parameters. Use this for:Repacking: converting checkpoint layout to kernel-native layout (e.g. AutoAWQ int4 → Marlin packed format).
Transposing: rearranging dimensions for optimised GEMM kernels.
Calibration: computing per-tensor or per-channel scales from the loaded FP weights (e.g. dynamic FP8 quantisation).
Cleanup: replacing custom parameter wrappers with plain
torch.nn.Parameterto avoid overhead during inference.
The default implementation is a no-op.
- Parameters:
layer (torch.nn.Module)
- Return type:
None
- apply(layer, x, bias=None)¶
Compute the quantized linear forward.
- Parameters:
layer (torch.nn.Module) – The module that owns quantized parameters (set by
create_weights()).x (torch.Tensor) – Input activation tensor, shape
(*, input_size_per_partition).bias (Optional[torch.Tensor]) – Optional bias vector.
- Returns:
torch.Tensor – Output tensor, shape
(*, sum(output_partition_sizes)).Example (AWQ W4A16):: – qweight = layer.qweight # packed int32 scales = layer.scales # fp16 per-group scales qzeros = layer.qzeros # packed int32 zero-points # → invoke dequant + matmul kernel
- Return type:
torch.Tensor
- class pymllm.quantization.methods.compressed_tensors.CompressedTensorsConfig(*, quant_format, ignore, weight_bits, group_size, weight_strategy, weight_type, weight_dynamic, symmetric, actorder, input_bits, input_strategy, input_type, input_dynamic, input_symmetric)¶
Bases:
pymllm.quantization.quant_config.QuantizationConfigBase class for quantization configurations.
A
QuantizationConfigis instantiated once per model load. It reads quantization metadata from the checkpoint (bit-width, group size, etc.) and providesQuantizeMethodBaseinstances to each layer.Subclass contract¶
get_name()— return the method name (e.g."awq").from_config()— class method that parses a dict from the checkpoint’squantize_config.json.get_quant_method()— return the appropriateLinearMethodBase(orNoneto skip quantization for a layer).
Optional overrides¶
get_supported_act_dtypes()— restrict activation dtypes.get_min_capability()— minimum GPU compute capability.get_config_filenames()— files to probe in the checkpoint dir.
- quant_format¶
- ignore¶
- weight_bits¶
- group_size¶
- weight_strategy¶
- weight_type¶
- weight_dynamic¶
- symmetric¶
- actorder¶
- input_bits¶
- input_strategy¶
- input_type¶
- input_dynamic¶
- input_symmetric¶
- get_name()¶
Return the canonical name of this quantization method.
Examples:
"awq","gptq","fp8","w8a8".- Return type:
str
- get_supported_act_dtypes()¶
Activation dtypes supported by this method.
Override to restrict (e.g. FP8 only supports
float16). Default: no restriction.- Return type:
List[torch.dtype]
- classmethod get_min_capability()¶
Minimum CUDA compute capability (e.g. 75 for Turing).
Default: 0 (no restriction).
- Return type:
int
- static get_config_filenames()¶
File names to look for in the checkpoint directory.
Default:
["quantize_config.json"].- Return type:
List[str]
- classmethod from_config(config)¶
Create an instance from a checkpoint’s quantization config dict.
- Parameters:
config (Dict[str, Any]) – Parsed JSON from the checkpoint’s
quantize_config.jsonor thequantization_configsection ofconfig.json.(AWQ):: (Example config dict) –
- {
“quant_method”: “awq”, “bits”: 4, “group_size”: 128, “zero_point”: true
}
- Return type:
- get_quant_method(layer, prefix='')¶
Return the quantization method for layer, or
Noneto skip.- Parameters:
layer (torch.nn.Module) – The
nn.Modulebeing constructed (e.g.ColumnParallelLinear).prefix (str) – The layer’s full dotted name in the model (e.g.
"model.layers.0.self_attn.q_proj"). Can be used to selectively skip quantization for certain layers.
- Returns:
The method instance.
Nonemeans this layer should fall back to the defaultUnquantizedLinearMethod.- Return type:
QuantizeMethodBase or None
- Parameters:
quant_format (str)
ignore (List[str])
weight_bits (int)
group_size (Optional[int])
weight_strategy (Optional[str])
weight_type (Optional[str])
weight_dynamic (bool)
symmetric (bool)
actorder (Optional[str])
input_bits (Optional[int])
input_strategy (Optional[str])
input_type (Optional[str])
input_dynamic (bool)
input_symmetric (bool)