runner¶
Classes¶
Functions¶
|
Callback function: Used to forcefully refresh scale and zero_point of all FakeQuantize modules after calibration. |
|
Callback function: Validate that all input_observers in ConcatObserver have consistent scale and zero_point. |
Module Contents¶
- runner.recompute_scale_zp(module)¶
Callback function: Used to forcefully refresh scale and zero_point of all FakeQuantize modules after calibration.
- Problem solved:
When using ConcatObserver, min/max may be updated during forward pass, but at the end of forward, the scale/zp stored in FakeQuantize’s internal buffer are still computed from old min/max. This function forces a calculate_qparams call to sync the latest parameters to the buffer.
- Usage:
model.apply(recompute_scale_zp)
- runner.validate_concat_observer_fn(module, results, name='')¶
Callback function: Validate that all input_observers in ConcatObserver have consistent scale and zero_point.
- Usage:
results = [] for name, m in model.named_modules():
validate_concat_observer_fn(m, results, name)
- Parameters:
results (list)
name (str)
- runner.freeze_qwen2_rmsnorm_weight(m)¶
- runner.freeze_qwen2_linear_weight(m)¶
- runner.freeze_qwen2_embed_tokens_weight(m)¶
- runner.disable_qdq_observer(m)¶
- runner.enable_qdq_observer(m)¶
- runner.enable_fake_quant(m)¶
- runner.disable_fake_quant(m)¶
- runner.convert_weight(m)¶
- class runner.Qwen2Quantizer(model_path, mllm_qualcomm_max_length=2048)¶
- Parameters:
model_path (str)
- tokenizer¶
- model¶
- mllm_qualcomm_max_length = 2048¶
- freeze_activation()¶
- enable_activation_update()¶
- enable_fake_quant()¶
- disable_fake_quant()¶
- compile()¶
- infer(prompt)¶
- Parameters:
prompt (str)
- calibrate(num_samples=64, max_seq_length=512)¶
Perform calibration using Wikipedia dataset (PTQ) :param num_samples: Number of samples for calibration :param max_seq_length: Maximum length for each sample (not exceeding mllm_qualcomm_max_length)
- convert()¶
- recompute_scale_zp()¶
- validate_concat_observer()¶