pymllm.bench_one_batch

SGLang-style one-batch benchmark for pymllm.

This module intentionally bypasses the HTTP server, tokenizer workers, scheduler, and detokenizer. It drives pymllm.executor.ModelRunner directly to measure one static prefill followed by token-by-token decode.

Attributes

Classes

Functions

add_bench_args(parser)

make_parser()

parse_args([argv])

generate_settings(args)

make_synthetic_input_ids(*, batch_size, input_len, ...)

summarize_latencies(*, setting, prefill_latency, ...)

make_profile_trace_path(*, output_dir, prefix, ...[, step])

run_single_setting(*, bench_runner, args, setting, ...)

run_benchmark(cfg, args)

main([argv])

Module Contents

pymllm.bench_one_batch.logger
class pymllm.bench_one_batch.BenchSetting
batch_size: int
input_len: int
output_len: int
class pymllm.bench_one_batch.BenchArgs
run_name: str = 'default'
batch_size: list[int] = [1]
input_len: list[int] = [256, 512, 1024]
output_len: list[int] = [128]
result_filename: pathlib.Path
log_decode_step: int = 0
seed: int = 42
profile: bool = False
profile_record_shapes: bool = False
profile_activities: list[str] = ['CPU', 'GPU']
profile_stage: str = 'all'
profile_filename_prefix: str = 'pymllm_profile'
profile_start_step: int | None = None
profile_steps: int = 1
skip_warmup: bool = False
class pymllm.bench_one_batch.DecodeState
req_pool_indices: torch.Tensor
seq_lens: torch.Tensor
mrope_position_deltas: torch.Tensor | None = None
pymllm.bench_one_batch.add_bench_args(parser)
Parameters:

parser (argparse.ArgumentParser)

Return type:

argparse.ArgumentParser

pymllm.bench_one_batch.make_parser()
Return type:

argparse.ArgumentParser

pymllm.bench_one_batch.parse_args(argv=None)
Parameters:

argv (Optional[Sequence[str]])

Return type:

tuple[pymllm.configs.global_config.GlobalConfig, BenchArgs]

pymllm.bench_one_batch.generate_settings(args)
Parameters:

args (BenchArgs)

Return type:

list[BenchSetting]

pymllm.bench_one_batch.make_synthetic_input_ids(*, batch_size, input_len, vocab_size, seed, device)
Parameters:
  • batch_size (int)

  • input_len (int)

  • vocab_size (int)

  • seed (int)

  • device (str | torch.device)

Return type:

torch.Tensor

pymllm.bench_one_batch.summarize_latencies(*, setting, prefill_latency, decode_latencies, run_name, device, dtype, cuda_graph, extra=None)
Parameters:
  • setting (BenchSetting)

  • prefill_latency (float)

  • decode_latencies (Sequence[float])

  • run_name (str)

  • device (str)

  • dtype (str)

  • cuda_graph (bool)

  • extra (Optional[dict[str, Any]])

Return type:

dict[str, Any]

pymllm.bench_one_batch.make_profile_trace_path(*, output_dir, prefix, run_name, setting, stage, step=None)
Parameters:
  • output_dir (pathlib.Path)

  • prefix (str)

  • run_name (str)

  • setting (BenchSetting)

  • stage (str)

  • step (Optional[int])

Return type:

pathlib.Path

class pymllm.bench_one_batch.PymllmBenchRunner(runner)
Parameters:

runner (pymllm.executor.model_runner.ModelRunner)

runner
device
classmethod create(cfg)
Parameters:

cfg (pymllm.configs.global_config.GlobalConfig)

Return type:

PymllmBenchRunner

clear()
Return type:

None

extend(input_ids)
Parameters:

input_ids (torch.Tensor)

Return type:

tuple[torch.Tensor, DecodeState]

decode(input_ids, state)
Parameters:
Return type:

tuple[torch.Tensor, DecodeState]

shutdown()
Return type:

None

pymllm.bench_one_batch.run_single_setting(*, bench_runner, args, setting, seed, record_result)
Parameters:
Return type:

Optional[dict[str, Any]]

pymllm.bench_one_batch.run_benchmark(cfg, args)
Parameters:
Return type:

list[dict[str, Any]]

pymllm.bench_one_batch.main(argv=None)
Parameters:

argv (Optional[Sequence[str]])

Return type:

None