pymllm.backends.qualcomm.transformers.core.qlinear

Classes

QLinear

QLinearW8A16_PerChannelSym

DoubleQuantizer

Handles LPBQ double normalization logic to work like FakeQuantize

QLinearLPBQ

Module Contents

class pymllm.backends.qualcomm.transformers.core.qlinear.QLinear(in_features, out_features, bias=True)

Bases: torch.nn.Module

in_features
out_features
weight
act_quant = None
weight_quant = None
deploy_mode = False
freeze_weight()

PTQ Core: Observe current weights, calculate and fix Scale/ZP

abstractmethod forward(x)
class pymllm.backends.qualcomm.transformers.core.qlinear.QLinearW8A16_PerChannelSym(in_features, out_features, bias=True)

Bases: QLinear

weight_quant
forward(x)
convert_to_deploy()
convert_to_conv2d_deploy_hwio()

Convert to deploy format with HWIO layout [1, 1, In, Out]. This format is commonly used by convolution-based inference engines.

class pymllm.backends.qualcomm.transformers.core.qlinear.DoubleQuantizer(block_size=64)

Bases: torch.nn.Module

Handles LPBQ double normalization logic to work like FakeQuantize

block_size = 64
w_recon_cached = None
freeze(w)
quantize_dequantize(w, save_buffers=False)
forward(w)
class pymllm.backends.qualcomm.transformers.core.qlinear.QLinearLPBQ(in_features, out_features, bias=True, block_size=64)

Bases: QLinear

weight_quant
forward(x)
convert_to_deploy()
convert_to_conv2d_deploy_hwio()

Convert to deploy format with HWIO layout [1, 1, In, Out]. This format is commonly used by convolution-based inference engines.