pymllm.quantization.quant_config

Quantization configuration base class and registry.

This module provides the bridge between a model checkpoint’s quantization metadata (e.g. quantize_config.json) and the runtime LinearMethodBase instances used by each linear layer.

Architecture overview:

quantize_config.json   ──parse──►  QuantizationConfig subclass
                                      │
                                      │  get_quant_method(layer, prefix)
                                      ▼
                                 LinearMethodBase instance
                                  (AWQLinearMethod, FP8LinearMethod, ...)

How to add a new quantization method

  1. Create a QuantizationConfig subclass (e.g. AWQConfig).

  2. Implement get_name(), from_config(), get_quant_method().

  3. Register it:

    from pymllm.quantization.quant_config import register_quantization
    
    @register_quantization("awq")
    class AWQConfig(QuantizationConfig):
        ...
    
  4. When the server starts with --quantization.method awq, the loader will call get_quantization_config("awq") to obtain the config class, then from_config(hf_quant_config) to instantiate it, and finally config.get_quant_method(layer, prefix) for each linear layer.

Classes

QuantizationConfig

Base class for quantization configurations.

Functions

register_quantization(name)

Class decorator that registers a QuantizationConfig subclass.

get_quantization_config(method)

Look up a registered QuantizationConfig by name.

list_quantization_methods()

Return sorted list of registered quantization method names.

Module Contents

pymllm.quantization.quant_config.register_quantization(name)

Class decorator that registers a QuantizationConfig subclass.

Usage:

@register_quantization("awq")
class AWQConfig(QuantizationConfig):
    ...
Parameters:

name (str)

Return type:

type[type[QuantizationConfig]]

pymllm.quantization.quant_config.get_quantization_config(method)

Look up a registered QuantizationConfig by name.

Raises KeyError if the method is not registered.

Parameters:

method (str)

Return type:

Type[QuantizationConfig]

pymllm.quantization.quant_config.list_quantization_methods()

Return sorted list of registered quantization method names.

Return type:

List[str]

class pymllm.quantization.quant_config.QuantizationConfig

Bases: abc.ABC

Base class for quantization configurations.

A QuantizationConfig is instantiated once per model load. It reads quantization metadata from the checkpoint (bit-width, group size, etc.) and provides QuantizeMethodBase instances to each layer.

Subclass contract

  • get_name() — return the method name (e.g. "awq").

  • from_config() — class method that parses a dict from the checkpoint’s quantize_config.json.

  • get_quant_method() — return the appropriate LinearMethodBase (or None to skip quantization for a layer).

Optional overrides

abstractmethod get_name()

Return the canonical name of this quantization method.

Examples: "awq", "gptq", "fp8", "w8a8".

Return type:

str

classmethod from_config(config)
Abstractmethod:

Parameters:

config (Dict[str, Any])

Return type:

QuantizationConfig

Create an instance from a checkpoint’s quantization config dict.

Parameters:
  • config (Dict[str, Any]) – Parsed JSON from the checkpoint’s quantize_config.json or the quantization_config section of config.json.

  • (AWQ):: (Example config dict) –

    {

    “quant_method”: “awq”, “bits”: 4, “group_size”: 128, “zero_point”: true

    }

Return type:

QuantizationConfig

abstractmethod get_quant_method(layer, prefix='')

Return the quantization method for layer, or None to skip.

Parameters:
  • layer (torch.nn.Module) – The nn.Module being constructed (e.g. ColumnParallelLinear).

  • prefix (str) – The layer’s full dotted name in the model (e.g. "model.layers.0.self_attn.q_proj"). Can be used to selectively skip quantization for certain layers.

Returns:

The method instance. None means this layer should fall back to the default UnquantizedLinearMethod.

Return type:

QuantizeMethodBase or None

get_supported_act_dtypes()

Activation dtypes supported by this method.

Override to restrict (e.g. FP8 only supports float16). Default: no restriction.

Return type:

List[torch.dtype]

classmethod get_min_capability()

Minimum CUDA compute capability (e.g. 75 for Turing).

Default: 0 (no restriction).

Return type:

int

static get_config_filenames()

File names to look for in the checkpoint directory.

Default: ["quantize_config.json"].

Return type:

List[str]