runner¶

Classes¶

`recompute_scale_zp`(module)	Callback function: Used to forcefully refresh scale and zero_point of all FakeQuantize modules after calibration.
`validate_concat_observer_fn`(module, results[, name])	Callback function: Validate that all input_observers in ConcatObserver have consistent scale and zero_point.
`freeze_qwen2_rmsnorm_weight`(m)
`freeze_qwen2_linear_weight`(m)
`freeze_qwen2_embed_tokens_weight`(m)
`disable_qdq_observer`(m)
`enable_qdq_observer`(m)
`enable_fake_quant`(m)
`disable_fake_quant`(m)
`convert_weight`(m)

runner.recompute_scale_zp(module)¶

Callback function: Used to forcefully refresh scale and zero_point of all FakeQuantize modules after calibration.

Problem solved:: When using ConcatObserver, min/max may be updated during forward pass, but at the end of forward, the scale/zp stored in FakeQuantize’s internal buffer are still computed from old min/max. This function forces a calculate_qparams call to sync the latest parameters to the buffer.
Usage:: model.apply(recompute_scale_zp)

runner.validate_concat_observer_fn(module, results, name='')¶

Callback function: Validate that all input_observers in ConcatObserver have consistent scale and zero_point.

Usage:: results = [] for name, m in model.named_modules():

validate_concat_observer_fn(m, results, name)

Parameters:

class runner.Qwen2Quantizer(model_path, mllm_qualcomm_max_length=2048)¶

infer(prompt)¶

calibrate(num_samples=64, max_seq_length=512)¶: Perform calibration using Wikipedia dataset (PTQ) :param num_samples: Number of samples for calibration :param max_seq_length: Maximum length for each sample (not exceeding mllm_qualcomm_max_length)