pymllm.executor.model_runner¶
ModelRunner runs the forward passes of the models.
pymllm’s single-GPU inference architecture. Handles:
Model loading (HuggingFace checkpoint via
transformers)KV-cache memory pool initialisation
Attention backend setup (FlashInfer)
Forward pass dispatch (extend / decode / idle)
Token sampling from logits
Typical lifecycle:
runner = ModelRunner(server_config, model_config)
runner.initialize()
# --- inside the inference loop ---
forward_batch = runner.prepare_forward_batch_decode(...)
logits_output = runner.forward(forward_batch)
next_token_ids = runner.sample(logits_output, forward_batch)
Typical data flow¶
- SchedulerProcess builds a batch dict
↓
- ModelRunnerProcess calls ModelRunner.forward(forward_batch)
↓
- attn_backend.init_forward_metadata(forward_batch)
↓
- model.forward(input_ids, positions, forward_batch)
↓
- ModelRunner.sample(logits_output, forward_batch)
↓
next_token_ids returned to scheduler
Attributes¶
Classes¶
Container for output logits produced by the model's forward pass. |
|
Runs the forward passes of the models. |
Functions¶
|
Return available GPU memory in GB. |
|
Return total GPU memory in GB. |
Module Contents¶
- pymllm.executor.model_runner.logger¶
- pymllm.executor.model_runner.get_available_gpu_memory(device='cuda', gpu_id=0)¶
Return available GPU memory in GB.
- Parameters:
device (str)
gpu_id (int)
- Return type:
float
- pymllm.executor.model_runner.get_total_gpu_memory(device='cuda', gpu_id=0)¶
Return total GPU memory in GB.
- Parameters:
device (str)
gpu_id (int)
- Return type:
float
- class pymllm.executor.model_runner.LogitsProcessorOutput¶
Container for output logits produced by the model’s forward pass.
- next_token_logits¶
Raw logits for the last token of each sequence in the batch, shape
[batch_size, vocab_size].
Optional hidden states from the model (e.g. for speculative decoding or auxiliary loss computation).
- next_token_logits: torch.Tensor¶
- hidden_states: torch.Tensor | None = None¶
- class pymllm.executor.model_runner.ModelRunner(server_config=None, model_config=None, gpu_id=0)¶
Runs the forward passes of the models.
This is the core execution component that owns the model, memory pools, and attention backend. It is used by
ModelRunnerProcessto execute batches dispatched by the scheduler.- Parameters:
server_config (Optional[pymllm.configs.server_config.ServerConfig]) – Server runtime configuration. Falls back to the global singleton when
None.model_config (Optional[pymllm.configs.model_config.ModelConfig]) – Model configuration (wraps a HuggingFace
PretrainedConfig). Falls back to the global singleton whenNone.gpu_id (int) – GPU device index to use.
- server_config¶
- model_config¶
- gpu_id = 0¶
- device: str = 'cuda'¶
- dtype: torch.dtype¶
- model: torch.nn.Module | None = None¶
- req_to_token_pool: pymllm.mem_cache.memory_pool.ReqToTokenPool | None = None¶
- token_to_kv_pool: pymllm.mem_cache.memory_pool.KVPool | None = None¶
- token_to_kv_pool_allocator: pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator | None = None¶
- gdn_pool: pymllm.mem_cache.memory_pool.GDNPool | None = None¶
- attn_backend: pymllm.layers.attention.attention_backend.AttentionBackend | None = None¶
- graph_runner: pymllm.executor.cuda_graph_runner.CudaGraphRunner | None = None¶
- max_total_num_tokens: int = 0¶
- max_running_requests: int = 0¶
- num_attention_heads: int = 0¶
- num_kv_heads: int = 0¶
- head_dim: int = 0¶
- vocab_size: int = 0¶
- context_len: int = 0¶
- kv_cache_dtype: torch.dtype¶
- forward_pass_id: int = 0¶
- initialize()¶
Full initialisation: set device, load model, init memory + backend.
Call this once before any forward pass.
- Return type:
None
- load_model()¶
Load the model from a HuggingFace checkpoint.
First checks the pymllm model registry for a custom implementation that uses
RadixAttention. If found, instantiates it with the HuggingFace config and loads weights viaload_weights(). Otherwise falls back toAutoModelForCausalLM.from_pretrained.- Return type:
None
- init_memory_pool()¶
Initialise KV-cache memory pools and request-to-token mapping.
Profiles available GPU memory to determine the maximum number of KV-cache token slots (
max_total_num_tokens).Derives
max_running_requestsfrom config or heuristic.Creates
ReqToTokenPool,KVPool, andTokenToKVPoolAllocator.
- Return type:
None
- init_attention_backend()¶
Initialise the attention backend.
Creates a
FlashInferAttnBackendfor standard models, or aHybridAttnBackend(FlashInfer + GDN) for hybrid models.- Return type:
None
- init_cuda_graphs()¶
Capture CUDA graphs for decode-step acceleration.
Skipped when: * The device is not CUDA. *
server_config.disable_cuda_graphisTrue. * The model is not a generation model.- Return type:
None
- prepare_forward_batch_extend(input_ids, req_pool_indices, seq_lens, extend_seq_lens, extend_prefix_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None)¶
Build a
ForwardBatchfor an extend (prefill) pass.- Parameters:
input_ids (torch.Tensor) – Token IDs for all new tokens, shape
[total_new_tokens].req_pool_indices (torch.Tensor) – Index of each request in
ReqToTokenPool, shape[batch_size].seq_lens (torch.Tensor) – Total (prefix + new) length of each sequence, shape
[batch_size].extend_seq_lens (torch.Tensor) – Number of new tokens per sequence, shape
[batch_size].extend_prefix_lens (torch.Tensor) – Cached prefix length per sequence, shape
[batch_size].out_cache_loc (torch.Tensor) – KV-pool slot indices for each new token, shape
[total_new_tokens].return_logprob (bool) – Whether to return per-token log-probabilities.
top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.
- Return type:
- prepare_forward_batch_decode(input_ids, req_pool_indices, seq_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None, mrope_position_deltas=None)¶
Build a
ForwardBatchfor a decode step.- Parameters:
input_ids (torch.Tensor) – Token IDs (one per sequence), shape
[batch_size].req_pool_indices (torch.Tensor) – Index of each request in
ReqToTokenPool, shape[batch_size].seq_lens (torch.Tensor) – Total sequence length of each request, shape
[batch_size].out_cache_loc (torch.Tensor) – KV-pool slot for each sequence’s new token, shape
[batch_size].return_logprob (bool) – Whether to return per-token log-probabilities.
top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.
mrope_position_deltas (Optional[torch.Tensor]) – Per-request M-RoPE position deltas, shape
[batch_size](int64). Used by multimodal models (e.g. Qwen3-VL) to offset decode-step positions by the spatial extent of prefill images.
- Return type:
- forward(forward_batch)¶
Run a forward pass through the model.
Dispatches to the appropriate method based on the batch’s
ForwardMode. For decode batches, automatically uses CUDA-graph replay when a captured graph is available.- Parameters:
forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The prepared batch (from
prepare_forward_batch_*).- Returns:
Contains
next_token_logitsof shape[batch_size, vocab_size].- Return type:
- forward_decode(forward_batch)¶
Run a decode forward pass (one new token per sequence).
Calls
attn_backend.init_forward_metadatafollowed bymodel.forward.- Parameters:
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
- Return type:
- forward_extend(forward_batch)¶
Run an extend (prefill) forward pass.
Calls
attn_backend.init_forward_metadatafollowed bymodel.forward.- Parameters:
forward_batch (pymllm.engine.forward_batch.ForwardBatch)
- Return type:
- sample(logits_output, forward_batch, temperatures=None, top_ps=None, top_ks=None, penalty_params=None)¶
Sample next-token IDs from logits.
Supports per-request temperature, top-p, top-k, and penalties (repetition, frequency, presence).
- Parameters:
logits_output (LogitsProcessorOutput) – The logits from
forward().forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The current forward batch.
temperatures (Optional[torch.Tensor]) – Per-request temperature, shape
[batch_size].top_ps (Optional[torch.Tensor]) – Per-request top-p, shape
[batch_size].top_ks (Optional[torch.Tensor]) – Per-request top-k, shape
[batch_size].penalty_params (Optional[Dict[str, Any]]) – Optional dict with keys
repetition_penalties,frequency_penalties,presence_penalties(tensors of shape[batch_size]), andtoken_histories(list of list of int).
- Returns:
Next-token IDs, shape
[batch_size], dtypeint32.- Return type:
torch.Tensor
- shutdown()¶
Release model and memory resources.
- Return type:
None
- property is_generation: bool¶
True if the model is a generation (causal-LM) model.
- Return type:
bool
- property sliding_window_size: int | None¶
Sliding-window attention span, or
Nonefor full context.- Return type:
Optional[int]