pymllm.bench_one_batch

SGLang-style one-batch benchmark for pymllm.

This module intentionally bypasses the HTTP server, tokenizer workers, scheduler, and detokenizer. It drives pymllm.executor.ModelRunner directly to measure one static prefill followed by token-by-token decode.

Attributes

Classes

Functions

add_bench_args(parser)

make_parser()

parse_args([argv])

generate_settings(args)

make_synthetic_input_ids(*, batch_size, input_len, ...)

summarize_latencies(*, setting, prefill_latency, ...)

make_vit_prefill_metrics(*, vit_prefill_ms, ...)

make_multimodal_prefill_metrics(*, prefill_latency, ...)

make_multimodal_bench_input_from_processor_output(...)

make_profile_trace_path(*, output_dir, prefix, ...[, step])

run_single_setting(*, bench_runner, args, setting, ...)

run_benchmark(cfg, args)

correctness_test(bench_runner, cfg, args)

Single-stage smoke correctness check.

run_correctness(cfg, args)

main([argv])

Module Contents

pymllm.bench_one_batch.logger
class pymllm.bench_one_batch.BenchSetting
batch_size: int
input_len: int
output_len: int
class pymllm.bench_one_batch.BenchArgs
run_name: str = 'default'
batch_size: list[int] = [1]
input_len: list[int] = [256, 512, 1024]
output_len: list[int] = [128]
result_filename: pathlib.Path
log_decode_step: int = 0
seed: int = 42
profile: bool = False
profile_record_shapes: bool = False
profile_activities: list[str] = ['CPU', 'GPU']
profile_stage: str = 'all'
profile_filename_prefix: str = 'pymllm_profile'
profile_start_step: int | None = None
profile_steps: int = 1
skip_warmup: bool = False
image_path: pathlib.Path | None = None
prompt: str = 'Describe this image.'
input_len_was_provided: bool = False
correctness_test: bool = False
class pymllm.bench_one_batch.DecodeState
req_pool_indices: torch.Tensor
seq_lens: torch.Tensor
mrope_position_deltas: torch.Tensor | None = None
class pymllm.bench_one_batch.ExtendResult
next_token_ids: torch.Tensor
state: DecodeState
vit_prefill_ms: float | None = None
vit_prefill_tokens: int | None = None
vit_prefill_tps: float | None = None
__iter__()
Return type:

Iterator[Any]

class pymllm.bench_one_batch.MultimodalBenchInput
input_ids: torch.Tensor
pixel_values: torch.Tensor
image_grid_thw: torch.Tensor
vit_prefill_tokens: int
class pymllm.bench_one_batch.MultimodalProcessorBundle
processor_output: Any
pad_token_id: int
pymllm.bench_one_batch.add_bench_args(parser)
Parameters:

parser (argparse.ArgumentParser)

Return type:

argparse.ArgumentParser

pymllm.bench_one_batch.make_parser()
Return type:

argparse.ArgumentParser

pymllm.bench_one_batch.parse_args(argv=None)
Parameters:

argv (Optional[Sequence[str]])

Return type:

tuple[pymllm.configs.global_config.GlobalConfig, BenchArgs]

pymllm.bench_one_batch.generate_settings(args)
Parameters:

args (BenchArgs)

Return type:

list[BenchSetting]

pymllm.bench_one_batch.make_synthetic_input_ids(*, batch_size, input_len, vocab_size, seed, device)
Parameters:
  • batch_size (int)

  • input_len (int)

  • vocab_size (int)

  • seed (int)

  • device (str | torch.device)

Return type:

torch.Tensor

pymllm.bench_one_batch.summarize_latencies(*, setting, prefill_latency, decode_latencies, run_name, device, dtype, cuda_graph, extra=None)
Parameters:
  • setting (BenchSetting)

  • prefill_latency (float)

  • decode_latencies (Sequence[float])

  • run_name (str)

  • device (str)

  • dtype (str)

  • cuda_graph (bool)

  • extra (Optional[dict[str, Any]])

Return type:

dict[str, Any]

pymllm.bench_one_batch.make_vit_prefill_metrics(*, vit_prefill_ms, vit_prefill_tokens)
Parameters:
  • vit_prefill_ms (float)

  • vit_prefill_tokens (int)

Return type:

dict[str, Any]

pymllm.bench_one_batch.make_multimodal_prefill_metrics(*, prefill_latency, batch_size, input_len)
Parameters:
  • prefill_latency (float)

  • batch_size (int)

  • input_len (int)

Return type:

dict[str, Any]

pymllm.bench_one_batch.make_multimodal_bench_input_from_processor_output(processor_output, *, batch_size, image_token_id, device, target_input_len=None, pad_token_id=0)
Parameters:
  • processor_output (Any)

  • batch_size (int)

  • image_token_id (int)

  • device (str | torch.device)

  • target_input_len (Optional[int])

  • pad_token_id (int)

Return type:

MultimodalBenchInput

pymllm.bench_one_batch.make_profile_trace_path(*, output_dir, prefix, run_name, setting, stage, step=None)
Parameters:
  • output_dir (pathlib.Path)

  • prefix (str)

  • run_name (str)

  • setting (BenchSetting)

  • stage (str)

  • step (Optional[int])

Return type:

pathlib.Path

class pymllm.bench_one_batch.PymllmBenchRunner(runner)
Parameters:

runner (pymllm.executor.model_runner.ModelRunner)

runner
device
classmethod create(cfg)
Parameters:

cfg (pymllm.configs.global_config.GlobalConfig)

Return type:

PymllmBenchRunner

clear()
Return type:

None

extend(input_ids, *, pixel_values=None, image_grid_thw=None, benchmark_vision_timing=False)
Parameters:
  • input_ids (torch.Tensor)

  • pixel_values (Optional[torch.Tensor])

  • image_grid_thw (Optional[torch.Tensor])

  • benchmark_vision_timing (bool)

Return type:

ExtendResult

decode(input_ids, state)
Parameters:
Return type:

tuple[torch.Tensor, DecodeState]

shutdown()
Return type:

None

pymllm.bench_one_batch.run_single_setting(*, bench_runner, args, setting, seed, record_result, multimodal_processor_bundle=None, allow_profile=True)
Parameters:
Return type:

Optional[dict[str, Any]]

pymllm.bench_one_batch.run_benchmark(cfg, args)
Parameters:
Return type:

list[dict[str, Any]]

pymllm.bench_one_batch.DEFAULT_CORRECTNESS_PROMPTS = ('The capital of France is', 'The capital of the United Kingdom is', 'Today is a sunny day and I like')
pymllm.bench_one_batch.correctness_test(bench_runner, cfg, args)

Single-stage smoke correctness check.

Encode a real prompt, run one full prefill at batch_size=1, greedy-decode output_len tokens, and print the decoded text. Unlike SGLang’s --correct (which exercises a cut_len two-stage prefill to test prefix-KV reuse), this runs each prompt as a single full prefill. Greedy decoding makes the per-prompt output identical to SGLang’s batched path. The cut_len two-stage variant can be layered on later: prepare_forward_batch_extend already accepts extend_prefix_lens > 0 and req_to_token_pool.write can pre-populate prefix KV indices.

Parameters:
Return type:

None

pymllm.bench_one_batch.run_correctness(cfg, args)
Parameters:
Return type:

None

pymllm.bench_one_batch.main(argv=None)
Parameters:

argv (Optional[Sequence[str]])

Return type:

None