pymllm.bench_one_batch¶
SGLang-style one-batch benchmark for pymllm.
This module intentionally bypasses the HTTP server, tokenizer workers,
scheduler, and detokenizer. It drives pymllm.executor.ModelRunner
directly to measure one static prefill followed by token-by-token decode.
Attributes¶
Classes¶
Functions¶
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Single-stage smoke correctness check. |
|
|
|
Module Contents¶
- pymllm.bench_one_batch.logger¶
- class pymllm.bench_one_batch.BenchArgs¶
- run_name: str = 'default'¶
- batch_size: list[int] = [1]¶
- input_len: list[int] = [256, 512, 1024]¶
- output_len: list[int] = [128]¶
- result_filename: pathlib.Path¶
- log_decode_step: int = 0¶
- seed: int = 42¶
- profile: bool = False¶
- profile_record_shapes: bool = False¶
- profile_activities: list[str] = ['CPU', 'GPU']¶
- profile_stage: str = 'all'¶
- profile_filename_prefix: str = 'pymllm_profile'¶
- profile_start_step: int | None = None¶
- profile_steps: int = 1¶
- skip_warmup: bool = False¶
- image_path: pathlib.Path | None = None¶
- prompt: str = 'Describe this image.'¶
- input_len_was_provided: bool = False¶
- correctness_test: bool = False¶
- class pymllm.bench_one_batch.DecodeState¶
- req_pool_indices: torch.Tensor¶
- seq_lens: torch.Tensor¶
- mrope_position_deltas: torch.Tensor | None = None¶
- class pymllm.bench_one_batch.ExtendResult¶
- next_token_ids: torch.Tensor¶
- state: DecodeState¶
- vit_prefill_ms: float | None = None¶
- vit_prefill_tokens: int | None = None¶
- vit_prefill_tps: float | None = None¶
- __iter__()¶
- Return type:
Iterator[Any]
- class pymllm.bench_one_batch.MultimodalBenchInput¶
- input_ids: torch.Tensor¶
- pixel_values: torch.Tensor¶
- image_grid_thw: torch.Tensor¶
- vit_prefill_tokens: int¶
- pymllm.bench_one_batch.add_bench_args(parser)¶
- Parameters:
parser (argparse.ArgumentParser)
- Return type:
argparse.ArgumentParser
- pymllm.bench_one_batch.make_parser()¶
- Return type:
argparse.ArgumentParser
- pymllm.bench_one_batch.parse_args(argv=None)¶
- Parameters:
argv (Optional[Sequence[str]])
- Return type:
- pymllm.bench_one_batch.generate_settings(args)¶
- Parameters:
args (BenchArgs)
- Return type:
list[BenchSetting]
- pymllm.bench_one_batch.make_synthetic_input_ids(*, batch_size, input_len, vocab_size, seed, device)¶
- Parameters:
batch_size (int)
input_len (int)
vocab_size (int)
seed (int)
device (str | torch.device)
- Return type:
torch.Tensor
- pymllm.bench_one_batch.summarize_latencies(*, setting, prefill_latency, decode_latencies, run_name, device, dtype, cuda_graph, extra=None)¶
- Parameters:
setting (BenchSetting)
prefill_latency (float)
decode_latencies (Sequence[float])
run_name (str)
device (str)
dtype (str)
cuda_graph (bool)
extra (Optional[dict[str, Any]])
- Return type:
dict[str, Any]
- pymllm.bench_one_batch.make_vit_prefill_metrics(*, vit_prefill_ms, vit_prefill_tokens)¶
- Parameters:
vit_prefill_ms (float)
vit_prefill_tokens (int)
- Return type:
dict[str, Any]
- pymllm.bench_one_batch.make_multimodal_prefill_metrics(*, prefill_latency, batch_size, input_len)¶
- Parameters:
prefill_latency (float)
batch_size (int)
input_len (int)
- Return type:
dict[str, Any]
- pymllm.bench_one_batch.make_multimodal_bench_input_from_processor_output(processor_output, *, batch_size, image_token_id, device, target_input_len=None, pad_token_id=0)¶
- Parameters:
processor_output (Any)
batch_size (int)
image_token_id (int)
device (str | torch.device)
target_input_len (Optional[int])
pad_token_id (int)
- Return type:
- pymllm.bench_one_batch.make_profile_trace_path(*, output_dir, prefix, run_name, setting, stage, step=None)¶
- Parameters:
output_dir (pathlib.Path)
prefix (str)
run_name (str)
setting (BenchSetting)
stage (str)
step (Optional[int])
- Return type:
pathlib.Path
- class pymllm.bench_one_batch.PymllmBenchRunner(runner)¶
- Parameters:
- runner¶
- device¶
- classmethod create(cfg)¶
- Parameters:
- Return type:
- clear()¶
- Return type:
None
- extend(input_ids, *, pixel_values=None, image_grid_thw=None, benchmark_vision_timing=False)¶
- Parameters:
input_ids (torch.Tensor)
pixel_values (Optional[torch.Tensor])
image_grid_thw (Optional[torch.Tensor])
benchmark_vision_timing (bool)
- Return type:
- decode(input_ids, state)¶
- Parameters:
input_ids (torch.Tensor)
state (DecodeState)
- Return type:
tuple[torch.Tensor, DecodeState]
- shutdown()¶
- Return type:
None
- pymllm.bench_one_batch.run_single_setting(*, bench_runner, args, setting, seed, record_result, multimodal_processor_bundle=None, allow_profile=True)¶
- Parameters:
bench_runner (PymllmBenchRunner)
args (BenchArgs)
setting (BenchSetting)
seed (int)
record_result (bool)
multimodal_processor_bundle (Optional[MultimodalProcessorBundle])
allow_profile (bool)
- Return type:
Optional[dict[str, Any]]
- pymllm.bench_one_batch.run_benchmark(cfg, args)¶
- Parameters:
args (BenchArgs)
- Return type:
list[dict[str, Any]]
- pymllm.bench_one_batch.DEFAULT_CORRECTNESS_PROMPTS = ('The capital of France is', 'The capital of the United Kingdom is', 'Today is a sunny day and I like')¶
- pymllm.bench_one_batch.correctness_test(bench_runner, cfg, args)¶
Single-stage smoke correctness check.
Encode a real prompt, run one full prefill at batch_size=1, greedy-decode
output_lentokens, and print the decoded text. Unlike SGLang’s--correct(which exercises a cut_len two-stage prefill to test prefix-KV reuse), this runs each prompt as a single full prefill. Greedy decoding makes the per-prompt output identical to SGLang’s batched path. The cut_len two-stage variant can be layered on later:prepare_forward_batch_extendalready acceptsextend_prefix_lens > 0andreq_to_token_pool.writecan pre-populate prefix KV indices.- Parameters:
bench_runner (PymllmBenchRunner)
args (BenchArgs)
- Return type:
None
- pymllm.bench_one_batch.main(argv=None)¶
- Parameters:
argv (Optional[Sequence[str]])
- Return type:
None