pymllm.backends.qualcomm.transformers.core.qlinear¶
Classes¶
Handles LPBQ double normalization logic to work like FakeQuantize |
|
Module Contents¶
- class pymllm.backends.qualcomm.transformers.core.qlinear.QLinear(in_features, out_features, bias=True)¶
Bases:
torch.nn.Module- in_features¶
- out_features¶
- weight¶
- act_quant = None¶
- weight_quant = None¶
- deploy_mode = False¶
- freeze_weight()¶
PTQ Core: Observe current weights, calculate and fix Scale/ZP
- abstractmethod forward(x)¶
- class pymllm.backends.qualcomm.transformers.core.qlinear.QLinearW8A16_PerChannelSym(in_features, out_features, bias=True)¶
Bases:
QLinear- weight_quant¶
- forward(x)¶
- convert_to_deploy()¶
- convert_to_conv2d_deploy_hwio()¶
Convert to deploy format with HWIO layout [1, 1, In, Out]. This format is commonly used by convolution-based inference engines.
- class pymllm.backends.qualcomm.transformers.core.qlinear.DoubleQuantizer(block_size=64)¶
Bases:
torch.nn.ModuleHandles LPBQ double normalization logic to work like FakeQuantize
- block_size = 64¶
- w_recon_cached = None¶
- freeze(w)¶
- quantize_dequantize(w, save_buffers=False)¶
- forward(w)¶
- class pymllm.backends.qualcomm.transformers.core.qlinear.QLinearLPBQ(in_features, out_features, bias=True, block_size=64)¶
Bases:
QLinear- weight_quant¶
- forward(x)¶
- convert_to_deploy()¶
- convert_to_conv2d_deploy_hwio()¶
Convert to deploy format with HWIO layout [1, 1, In, Out]. This format is commonly used by convolution-based inference engines.