pymllm.executor.model_runner¶

ModelRunner runs the forward passes of the models.

pymllm’s single-GPU inference architecture. Handles:

Model loading (HuggingFace checkpoint via transformers)
KV-cache memory pool initialisation
Attention backend setup (FlashInfer)
Forward pass dispatch (extend / decode / idle)
Token sampling from logits

Typical lifecycle:

runner = ModelRunner(server_config, model_config)
runner.initialize()

# --- inside the inference loop ---
forward_batch = runner.prepare_forward_batch_decode(...)
logits_output = runner.forward(forward_batch)
next_token_ids = runner.sample(logits_output, forward_batch)

Typical data flow¶

SchedulerProcess builds a batch dict
↓

ModelRunnerProcess calls ModelRunner.forward(forward_batch)
↓

attn_backend.init_forward_metadata(forward_batch)
↓

model.forward(input_ids, positions, forward_batch)
↓

ModelRunner.sample(logits_output, forward_batch)
↓

next_token_ids returned to scheduler

Attributes¶

logger

Classes¶

`LogitsProcessorOutput`	Container for output logits produced by the model's forward pass.
`MemoryProfileResult`
`ModelRunner`	Runs the forward passes of the models.

Functions¶

`get_available_gpu_memory`([device, gpu_id])	Return available GPU memory in GB.
`get_total_gpu_memory`([device, gpu_id])	Return total GPU memory in GB.

Module Contents¶

pymllm.executor.model_runner.logger¶

pymllm.executor.model_runner.get_available_gpu_memory(device='cuda', gpu_id=0)¶

Return available GPU memory in GB.

Parameters:

device (str)
gpu_id (int)

Return type:

float

pymllm.executor.model_runner.get_total_gpu_memory(device='cuda', gpu_id=0)¶

Return total GPU memory in GB.

Parameters:

device (str)
gpu_id (int)

Return type:

float

class pymllm.executor.model_runner.LogitsProcessorOutput¶

Container for output logits produced by the model’s forward pass.

next_token_logits¶: Raw logits for the last token of each sequence in the batch, shape [batch_size, vocab_size].

hidden_states¶: Optional hidden states from the model (e.g. for speculative decoding or auxiliary loss computation).

next_token_logits: torch.Tensor¶

hidden_states: torch.Tensor | None = None¶

class pymllm.executor.model_runner.MemoryProfileResult¶

pre_model_available_gb: float¶

available_gb: float¶

mem_fraction: float¶

static_kv_budget_gb: float¶

cell_size_bytes: int¶

profiled_max_tokens: int¶

requested_max_total_tokens: int | None¶

effective_max_tokens: int¶

gdn_pool_gb: float = 0.0¶

class pymllm.executor.model_runner.ModelRunner(server_config=None, model_config=None, gpu_id=0)¶

Runs the forward passes of the models.

This is the core execution component that owns the model, memory pools, and attention backend. It is used by ModelRunnerProcess to execute batches dispatched by the scheduler.

Parameters:

server_config (Optional[pymllm.configs.server_config.ServerConfig]) – Server runtime configuration. Falls back to the global singleton when None.
model_config (Optional[pymllm.configs.model_config.ModelConfig]) – Model configuration (wraps a HuggingFace PretrainedConfig). Falls back to the global singleton when None.
gpu_id (int) – GPU device index to use.

server_config¶

model_config¶

gpu_id = 0¶

device: str = 'cuda'¶

dtype: torch.dtype¶

model: torch.nn.Module | None = None¶

req_to_token_pool: pymllm.mem_cache.memory_pool.ReqToTokenPool | None = None¶

token_to_kv_pool: pymllm.mem_cache.memory_pool.KVPool | None = None¶

token_to_kv_pool_allocator: pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator | None = None¶

gdn_pool: pymllm.mem_cache.memory_pool.GDNPool | None = None¶

attn_backend: pymllm.layers.attention.attention_backend.AttentionBackend | None = None¶

graph_runner: pymllm.executor.cuda_graph_runner.CudaGraphRunner | None = None¶

max_total_num_tokens: int = 0¶

max_running_requests: int = 0¶

num_hidden_layers: int = 0¶

num_attention_heads: int = 0¶

num_kv_heads: int = 0¶

head_dim: int = 0¶

hidden_size: int = 0¶

vocab_size: int = 0¶

context_len: int = 0¶

kv_cache_dtype: torch.dtype¶

forward_pass_id: int = 0¶

initialize()¶

Full initialisation: set device, load model, init memory + backend.

Call this once before any forward pass.

Return type:: None

load_model()¶

Load the model from a HuggingFace checkpoint.

First checks the pymllm model registry for a custom implementation that uses RadixAttention. If found, instantiates it with the HuggingFace config and loads weights via load_weights(). Otherwise falls back to AutoModelForCausalLM.from_pretrained.

Return type:: None

init_memory_pool()¶

Initialise KV-cache memory pools and request-to-token mapping.

Profiles available GPU memory to determine the maximum number of KV-cache token slots (max_total_num_tokens).
Derives max_running_requests from config or heuristic.
Creates ReqToTokenPool, KVPool, and TokenToKVPoolAllocator.

Return type:: None

init_attention_backend()¶

Initialise the attention backend.

Creates a FlashInferAttnBackend for standard models, or a HybridAttnBackend (FlashInfer + GDN) for hybrid models.

Return type:: None

init_cuda_graphs()¶

Capture CUDA graphs for decode-step acceleration.

Skipped when: * The device is not CUDA. * server_config.disable_cuda_graph is True. * The model is not a generation model.

Return type:: None

prepare_forward_batch_extend(input_ids, req_pool_indices, seq_lens, extend_seq_lens, extend_prefix_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None)¶

Build a ForwardBatch for an extend (prefill) pass.

Parameters:

input_ids (torch.Tensor) – Token IDs for all new tokens, shape [total_new_tokens].
req_pool_indices (torch.Tensor) – Index of each request in ReqToTokenPool, shape [batch_size].
seq_lens (torch.Tensor) – Total (prefix + new) length of each sequence, shape [batch_size].
extend_seq_lens (torch.Tensor) – Number of new tokens per sequence, shape [batch_size].
extend_prefix_lens (torch.Tensor) – Cached prefix length per sequence, shape [batch_size].
out_cache_loc (torch.Tensor) – KV-pool slot indices for each new token, shape [total_new_tokens].
return_logprob (bool) – Whether to return per-token log-probabilities.
top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.

Return type:

pymllm.engine.forward_batch.ForwardBatch

prepare_forward_batch_decode(input_ids, req_pool_indices, seq_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None, mrope_position_deltas=None)¶

Build a ForwardBatch for a decode step.

Parameters:

input_ids (torch.Tensor) – Token IDs (one per sequence), shape [batch_size].
req_pool_indices (torch.Tensor) – Index of each request in ReqToTokenPool, shape [batch_size].
seq_lens (torch.Tensor) – Total sequence length of each request, shape [batch_size].
out_cache_loc (torch.Tensor) – KV-pool slot for each sequence’s new token, shape [batch_size].
return_logprob (bool) – Whether to return per-token log-probabilities.
top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.
mrope_position_deltas (Optional[torch.Tensor]) – Per-request M-RoPE position deltas, shape [batch_size] (int64). Used by multimodal models (e.g. Qwen3-VL) to offset decode-step positions by the spatial extent of prefill images.

Return type:

pymllm.engine.forward_batch.ForwardBatch

forward(forward_batch)¶

Run a forward pass through the model.

Dispatches to the appropriate method based on the batch’s ForwardMode. For decode batches, automatically uses CUDA-graph replay when a captured graph is available.

Parameters:: forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The prepared batch (from prepare_forward_batch_*).
Returns:: Contains next_token_logits of shape [batch_size, vocab_size].
Return type:: LogitsProcessorOutput

forward_decode(forward_batch)¶

Run a decode forward pass (one new token per sequence).

Calls attn_backend.init_forward_metadata followed by model.forward.

Parameters:: forward_batch (pymllm.engine.forward_batch.ForwardBatch)
Return type:: LogitsProcessorOutput

forward_extend(forward_batch)¶

Run an extend (prefill) forward pass.

Calls attn_backend.init_forward_metadata followed by model.forward.

Parameters:: forward_batch (pymllm.engine.forward_batch.ForwardBatch)
Return type:: LogitsProcessorOutput

sample(logits_output, forward_batch, temperatures=None, top_ps=None, top_ks=None, penalty_params=None, is_all_greedy=None)¶

Sample next-token IDs from logits.

Supports per-request temperature, top-p, top-k, and penalties (repetition, frequency, presence).

Parameters:

logits_output (LogitsProcessorOutput) – The logits from forward().
forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The current forward batch.
temperatures (Optional[torch.Tensor]) – Per-request temperature, shape [batch_size].
top_ps (Optional[torch.Tensor]) – Per-request top-p, shape [batch_size].
top_ks (Optional[torch.Tensor]) – Per-request top-k, shape [batch_size].
penalty_params (Optional[Dict[str, Any]]) – Optional dict with keys repetition_penalties, frequency_penalties, presence_penalties (tensors of shape [batch_size]), and token_histories (list of list of int).
is_all_greedy (Optional[bool]) – CPU-side metadata indicating that every request should use greedy sampling. Supplying this avoids a CUDA tensor reduction and synchronization in the decode hot path.

Returns:

Next-token IDs, shape [batch_size], dtype int32.

Return type:

torch.Tensor

shutdown()¶

Release model and memory resources.

Return type:: None

property is_generation: bool¶

True if the model is a generation (causal-LM) model.

Return type:: bool

property sliding_window_size: int | None¶

Sliding-window attention span, or None for full context.

Return type:: Optional[int]