pymllm.backends.qualcomm.transformers.core.observer¶
Classes¶
Fetch maximum data range of all tensors to be concatenated |
Module Contents¶
- class pymllm.backends.qualcomm.transformers.core.observer.PerBlockParamObserver(dtype, block_size, quant_min=None, quant_max=None, eps=torch.finfo(torch.float32).eps, **kwargs)¶
Bases:
torchao.quantization.pt2e._affine_quantization.AffineQuantizedMinMaxObserver- Parameters:
dtype (torch.dtype)
block_size (torch.Size)
- dtype¶
- block_size¶
- bitwidth_of_scale = 4¶
- num_steps = 16¶
- calibrated = False¶
- forward(input)¶
- Parameters:
input (torch.Tensor)
- calculate_qparams()¶
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- class pymllm.backends.qualcomm.transformers.core.observer.PerBlockParamFakeQuantize(dtype=torch.int8, block_size=None, quant_min=None, quant_max=None, eps=torch.finfo(torch.float32).eps, **kwargs)¶
Bases:
torchao.quantization.pt2e.FakeQuantize- Parameters:
dtype (torch.dtype)
block_size (torch.Size)
quant_min (int)
quant_max (int)
eps (float)
- activation_post_process¶
- dtype¶
- block_size = None¶
- quant_min¶
- quant_max¶
- eps¶
- forward(x)¶
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- calculate_qparams()¶
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- convert(model, observer_node)¶
- class pymllm.backends.qualcomm.transformers.core.observer.ConcatObserver(dtype=torch.uint8, qscheme=torch.per_tensor_affine, reduce_range=False, quant_min=None, quant_max=None, factory_kwargs=None, eps=torch.finfo(torch.float32).eps, is_dynamic=False, **kwargs)¶
Bases:
torchao.quantization.pt2e.UniformQuantizationObserverBaseFetch maximum data range of all tensors to be concatenated
- input_observers = []¶
- add_observer(observer)¶
- forward(x_orig)¶
- calculate_qparams()¶