pymllm.backends.qualcomm.transformers.core.observer

Classes

PerBlockParamObserver

PerBlockParamFakeQuantize

ConcatObserver

Fetch maximum data range of all tensors to be concatenated

Module Contents

class pymllm.backends.qualcomm.transformers.core.observer.PerBlockParamObserver(dtype, block_size, quant_min=None, quant_max=None, eps=torch.finfo(torch.float32).eps, **kwargs)

Bases: torchao.quantization.pt2e._affine_quantization.AffineQuantizedMinMaxObserver

Parameters:
  • dtype (torch.dtype)

  • block_size (torch.Size)

dtype
block_size
bitwidth_of_scale = 4
num_steps = 16
calibrated = False
forward(input)
Parameters:

input (torch.Tensor)

calculate_qparams()
Return type:

Tuple[torch.Tensor, torch.Tensor]

class pymllm.backends.qualcomm.transformers.core.observer.PerBlockParamFakeQuantize(dtype=torch.int8, block_size=None, quant_min=None, quant_max=None, eps=torch.finfo(torch.float32).eps, **kwargs)

Bases: torchao.quantization.pt2e.FakeQuantize

Parameters:
  • dtype (torch.dtype)

  • block_size (torch.Size)

  • quant_min (int)

  • quant_max (int)

  • eps (float)

activation_post_process
dtype
block_size = None
quant_min
quant_max
eps
forward(x)
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

calculate_qparams()
Return type:

Tuple[torch.Tensor, torch.Tensor]

convert(model, observer_node)
class pymllm.backends.qualcomm.transformers.core.observer.ConcatObserver(dtype=torch.uint8, qscheme=torch.per_tensor_affine, reduce_range=False, quant_min=None, quant_max=None, factory_kwargs=None, eps=torch.finfo(torch.float32).eps, is_dynamic=False, **kwargs)

Bases: torchao.quantization.pt2e.UniformQuantizationObserverBase

Fetch maximum data range of all tensors to be concatenated

input_observers = []
add_observer(observer)
forward(x_orig)
calculate_qparams()