pymllm.bench_one_batch¶

SGLang-style one-batch benchmark for pymllm.

This module intentionally bypasses the HTTP server, tokenizer workers, scheduler, and detokenizer. It drives pymllm.executor.ModelRunner directly to measure one static prefill followed by token-by-token decode.

Attributes¶

logger

Classes¶

`BenchSetting`
`BenchArgs`
`DecodeState`
`PymllmBenchRunner`

Functions¶

`add_bench_args`(parser)
`make_parser`()
`parse_args`([argv])
`generate_settings`(args)
`make_synthetic_input_ids`(*, batch_size, input_len, ...)
`summarize_latencies`(*, setting, prefill_latency, ...)
`make_profile_trace_path`(*, output_dir, prefix, ...[, step])
`run_single_setting`(*, bench_runner, args, setting, ...)
`run_benchmark`(cfg, args)
`main`([argv])

Module Contents¶

pymllm.bench_one_batch.logger¶

class pymllm.bench_one_batch.BenchSetting¶

batch_size: int¶

input_len: int¶

output_len: int¶

class pymllm.bench_one_batch.BenchArgs¶

run_name: str = 'default'¶

batch_size: list[int] = [1]¶

input_len: list[int] = [256, 512, 1024]¶

output_len: list[int] = [128]¶

result_filename: pathlib.Path¶

log_decode_step: int = 0¶

seed: int = 42¶

profile: bool = False¶

profile_record_shapes: bool = False¶

profile_activities: list[str] = ['CPU', 'GPU']¶

profile_stage: str = 'all'¶

profile_filename_prefix: str = 'pymllm_profile'¶

profile_start_step: int | None = None¶

profile_steps: int = 1¶

skip_warmup: bool = False¶

class pymllm.bench_one_batch.DecodeState¶

req_pool_indices: torch.Tensor¶

seq_lens: torch.Tensor¶

mrope_position_deltas: torch.Tensor | None = None¶

pymllm.bench_one_batch.add_bench_args(parser)¶

Parameters:: parser (argparse.ArgumentParser)
Return type:: argparse.ArgumentParser

pymllm.bench_one_batch.make_parser()¶

Return type:: argparse.ArgumentParser

pymllm.bench_one_batch.parse_args(argv=None)¶

Parameters:: argv (Optional[Sequence[str]])
Return type:: tuple[pymllm.configs.global_config.GlobalConfig, BenchArgs]

pymllm.bench_one_batch.generate_settings(args)¶

Parameters:: args (BenchArgs)
Return type:: list[BenchSetting]

pymllm.bench_one_batch.make_synthetic_input_ids(*, batch_size, input_len, vocab_size, seed, device)¶

Parameters:

batch_size (int)
input_len (int)
vocab_size (int)
seed (int)
device (str | torch.device)

Return type:

torch.Tensor

pymllm.bench_one_batch.summarize_latencies(*, setting, prefill_latency, decode_latencies, run_name, device, dtype, cuda_graph, extra=None)¶

Parameters:

setting (BenchSetting)
prefill_latency (float)
decode_latencies (Sequence[float])
run_name (str)
device (str)
dtype (str)
cuda_graph (bool)
extra (Optional[dict[str, Any]])

Return type:

dict[str, Any]

pymllm.bench_one_batch.make_profile_trace_path(*, output_dir, prefix, run_name, setting, stage, step=None)¶

Parameters:

output_dir (pathlib.Path)
prefix (str)
run_name (str)
setting (BenchSetting)
stage (str)
step (Optional[int])

Return type:

pathlib.Path

class pymllm.bench_one_batch.PymllmBenchRunner(runner)¶

Parameters:: runner (pymllm.executor.model_runner.ModelRunner)

runner¶

device¶

classmethod create(cfg)¶

Parameters:: cfg (pymllm.configs.global_config.GlobalConfig)
Return type:: PymllmBenchRunner

clear()¶

Return type:: None

extend(input_ids)¶

Parameters:: input_ids (torch.Tensor)
Return type:: tuple[torch.Tensor, DecodeState]

decode(input_ids, state)¶

Parameters:

input_ids (torch.Tensor)
state (DecodeState)

Return type:

tuple[torch.Tensor, DecodeState]

shutdown()¶

Return type:: None

pymllm.bench_one_batch.run_single_setting(*, bench_runner, args, setting, seed, record_result)¶

Parameters:

bench_runner (PymllmBenchRunner)
args (BenchArgs)
setting (BenchSetting)
seed (int)
record_result (bool)

Return type:

Optional[dict[str, Any]]

pymllm.bench_one_batch.run_benchmark(cfg, args)¶

Parameters:

cfg (pymllm.configs.global_config.GlobalConfig)
args (BenchArgs)

Return type:

list[dict[str, Any]]

pymllm.bench_one_batch.main(argv=None)¶

Parameters:: argv (Optional[Sequence[str]])
Return type:: None