pymllm.quantization.quant_config¶

Quantization configuration base class and registry.

This module provides the bridge between a model checkpoint’s quantization metadata (e.g. quantize_config.json) and the runtime LinearMethodBase instances used by each linear layer.

Architecture overview:

quantize_config.json   ──parse──►  QuantizationConfig subclass
                                      │
                                      │  get_quant_method(layer, prefix)
                                      ▼
                                 LinearMethodBase instance
                                  (AWQLinearMethod, FP8LinearMethod, ...)

How to add a new quantization method¶

Create a QuantizationConfig subclass (e.g. AWQConfig).
Implement get_name(), from_config(), get_quant_method().

from pymllm.quantization.quant_config import register_quantization

@register_quantization("awq")
class AWQConfig(QuantizationConfig):
    ...

When the server starts with --quantization.method awq, the loader will call get_quantization_config("awq") to obtain the config class, then from_config(hf_quant_config) to instantiate it, and finally config.get_quant_method(layer, prefix) for each linear layer.

Classes¶

QuantizationConfig

Base class for quantization configurations.

Functions¶

`register_quantization`(name)	Class decorator that registers a `QuantizationConfig` subclass.
`get_quantization_config`(method)	Look up a registered `QuantizationConfig` by name.
`list_quantization_methods`()	Return sorted list of registered quantization method names.

Module Contents¶

pymllm.quantization.quant_config.register_quantization(name)¶

Class decorator that registers a QuantizationConfig subclass.

Usage:

@register_quantization("awq")
class AWQConfig(QuantizationConfig):
    ...

Parameters:: name (str)
Return type:: type[type[QuantizationConfig]]

pymllm.quantization.quant_config.get_quantization_config(method)¶

Look up a registered QuantizationConfig by name.

Raises KeyError if the method is not registered.

Parameters:: method (str)
Return type:: Type[QuantizationConfig]

pymllm.quantization.quant_config.list_quantization_methods()¶

Return sorted list of registered quantization method names.

Return type:: List[str]

class pymllm.quantization.quant_config.QuantizationConfig¶

Bases: abc.ABC

Base class for quantization configurations.

A QuantizationConfig is instantiated once per model load. It reads quantization metadata from the checkpoint (bit-width, group size, etc.) and provides QuantizeMethodBase instances to each layer.

Subclass contract¶

get_name() — return the method name (e.g. "awq").
from_config() — class method that parses a dict from the checkpoint’s quantize_config.json.
get_quant_method() — return the appropriate LinearMethodBase (or None to skip quantization for a layer).

Optional overrides¶

get_supported_act_dtypes() — restrict activation dtypes.
get_min_capability() — minimum GPU compute capability.
get_config_filenames() — files to probe in the checkpoint dir.

abstractmethod get_name()¶

Return the canonical name of this quantization method.

Examples: "awq", "gptq", "fp8", "w8a8".

Return type:: str

classmethod from_config(config)¶

Abstractmethod:
Parameters:: config (Dict[str, Any])
Return type:: QuantizationConfig

Create an instance from a checkpoint’s quantization config dict.

Parameters:

config (Dict[str, Any]) – Parsed JSON from the checkpoint’s quantize_config.json or the quantization_config section of config.json.
(AWQ):: (Example config dict) –

{
“quant_method”: “awq”, “bits”: 4, “group_size”: 128, “zero_point”: true

}

Return type:

QuantizationConfig

abstractmethod get_quant_method(layer, prefix='')¶

Return the quantization method for layer, or None to skip.

Parameters:

layer (torch.nn.Module) – The nn.Module being constructed (e.g. ColumnParallelLinear).
prefix (str) – The layer’s full dotted name in the model (e.g. "model.layers.0.self_attn.q_proj"). Can be used to selectively skip quantization for certain layers.

Returns:

The method instance. None means this layer should fall back to the default UnquantizedLinearMethod.

Return type:

QuantizeMethodBase or None

get_supported_act_dtypes()¶

Activation dtypes supported by this method.

Override to restrict (e.g. FP8 only supports float16). Default: no restriction.

Return type:: List[torch.dtype]

classmethod get_min_capability()¶

Minimum CUDA compute capability (e.g. 75 for Turing).

Default: 0 (no restriction).

Return type:: int

static get_config_filenames()¶

File names to look for in the checkpoint directory.

Default: ["quantize_config.json"].

Return type:: List[str]