pymllm.executor.model_runner

ModelRunner runs the forward passes of the models.

pymllm’s single-GPU inference architecture. Handles:

  • Model loading (HuggingFace checkpoint via transformers)

  • KV-cache memory pool initialisation

  • Attention backend setup (FlashInfer)

  • Forward pass dispatch (extend / decode / idle)

  • Token sampling from logits

Typical lifecycle:

runner = ModelRunner(server_config, model_config)
runner.initialize()

# --- inside the inference loop ---
forward_batch = runner.prepare_forward_batch_decode(...)
logits_output = runner.forward(forward_batch)
next_token_ids = runner.sample(logits_output, forward_batch)

Typical data flow

SchedulerProcess builds a batch dict

ModelRunnerProcess calls ModelRunner.forward(forward_batch)

attn_backend.init_forward_metadata(forward_batch)

model.forward(input_ids, positions, forward_batch)

ModelRunner.sample(logits_output, forward_batch)

next_token_ids returned to scheduler

Attributes

Classes

LogitsProcessorOutput

Container for output logits produced by the model's forward pass.

ModelRunner

Runs the forward passes of the models.

Functions

get_available_gpu_memory([device, gpu_id])

Return available GPU memory in GB.

get_total_gpu_memory([device, gpu_id])

Return total GPU memory in GB.

Module Contents

pymllm.executor.model_runner.logger
pymllm.executor.model_runner.get_available_gpu_memory(device='cuda', gpu_id=0)

Return available GPU memory in GB.

Parameters:
  • device (str)

  • gpu_id (int)

Return type:

float

pymllm.executor.model_runner.get_total_gpu_memory(device='cuda', gpu_id=0)

Return total GPU memory in GB.

Parameters:
  • device (str)

  • gpu_id (int)

Return type:

float

class pymllm.executor.model_runner.LogitsProcessorOutput

Container for output logits produced by the model’s forward pass.

next_token_logits

Raw logits for the last token of each sequence in the batch, shape [batch_size, vocab_size].

hidden_states

Optional hidden states from the model (e.g. for speculative decoding or auxiliary loss computation).

next_token_logits: torch.Tensor
hidden_states: torch.Tensor | None = None
class pymllm.executor.model_runner.ModelRunner(server_config=None, model_config=None, gpu_id=0)

Runs the forward passes of the models.

This is the core execution component that owns the model, memory pools, and attention backend. It is used by ModelRunnerProcess to execute batches dispatched by the scheduler.

Parameters:
server_config
model_config
gpu_id = 0
device: str = 'cuda'
dtype: torch.dtype
model: torch.nn.Module | None = None
req_to_token_pool: pymllm.mem_cache.memory_pool.ReqToTokenPool | None = None
token_to_kv_pool: pymllm.mem_cache.memory_pool.KVPool | None = None
token_to_kv_pool_allocator: pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator | None = None
gdn_pool: pymllm.mem_cache.memory_pool.GDNPool | None = None
attn_backend: pymllm.layers.attention.attention_backend.AttentionBackend | None = None
graph_runner: pymllm.executor.cuda_graph_runner.CudaGraphRunner | None = None
max_total_num_tokens: int = 0
max_running_requests: int = 0
num_hidden_layers: int = 0
num_attention_heads: int = 0
num_kv_heads: int = 0
head_dim: int = 0
hidden_size: int = 0
vocab_size: int = 0
context_len: int = 0
kv_cache_dtype: torch.dtype
forward_pass_id: int = 0
initialize()

Full initialisation: set device, load model, init memory + backend.

Call this once before any forward pass.

Return type:

None

load_model()

Load the model from a HuggingFace checkpoint.

First checks the pymllm model registry for a custom implementation that uses RadixAttention. If found, instantiates it with the HuggingFace config and loads weights via load_weights(). Otherwise falls back to AutoModelForCausalLM.from_pretrained.

Return type:

None

init_memory_pool()

Initialise KV-cache memory pools and request-to-token mapping.

  1. Profiles available GPU memory to determine the maximum number of KV-cache token slots (max_total_num_tokens).

  2. Derives max_running_requests from config or heuristic.

  3. Creates ReqToTokenPool, KVPool, and TokenToKVPoolAllocator.

Return type:

None

init_attention_backend()

Initialise the attention backend.

Creates a FlashInferAttnBackend for standard models, or a HybridAttnBackend (FlashInfer + GDN) for hybrid models.

Return type:

None

init_cuda_graphs()

Capture CUDA graphs for decode-step acceleration.

Skipped when: * The device is not CUDA. * server_config.disable_cuda_graph is True. * The model is not a generation model.

Return type:

None

prepare_forward_batch_extend(input_ids, req_pool_indices, seq_lens, extend_seq_lens, extend_prefix_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None)

Build a ForwardBatch for an extend (prefill) pass.

Parameters:
  • input_ids (torch.Tensor) – Token IDs for all new tokens, shape [total_new_tokens].

  • req_pool_indices (torch.Tensor) – Index of each request in ReqToTokenPool, shape [batch_size].

  • seq_lens (torch.Tensor) – Total (prefix + new) length of each sequence, shape [batch_size].

  • extend_seq_lens (torch.Tensor) – Number of new tokens per sequence, shape [batch_size].

  • extend_prefix_lens (torch.Tensor) – Cached prefix length per sequence, shape [batch_size].

  • out_cache_loc (torch.Tensor) – KV-pool slot indices for each new token, shape [total_new_tokens].

  • return_logprob (bool) – Whether to return per-token log-probabilities.

  • top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.

Return type:

pymllm.engine.forward_batch.ForwardBatch

prepare_forward_batch_decode(input_ids, req_pool_indices, seq_lens, out_cache_loc, return_logprob=False, top_logprobs_nums=None, mrope_position_deltas=None)

Build a ForwardBatch for a decode step.

Parameters:
  • input_ids (torch.Tensor) – Token IDs (one per sequence), shape [batch_size].

  • req_pool_indices (torch.Tensor) – Index of each request in ReqToTokenPool, shape [batch_size].

  • seq_lens (torch.Tensor) – Total sequence length of each request, shape [batch_size].

  • out_cache_loc (torch.Tensor) – KV-pool slot for each sequence’s new token, shape [batch_size].

  • return_logprob (bool) – Whether to return per-token log-probabilities.

  • top_logprobs_nums (Optional[List[int]]) – Number of top log-probs per sequence.

  • mrope_position_deltas (Optional[torch.Tensor]) – Per-request M-RoPE position deltas, shape [batch_size] (int64). Used by multimodal models (e.g. Qwen3-VL) to offset decode-step positions by the spatial extent of prefill images.

Return type:

pymllm.engine.forward_batch.ForwardBatch

forward(forward_batch)

Run a forward pass through the model.

Dispatches to the appropriate method based on the batch’s ForwardMode. For decode batches, automatically uses CUDA-graph replay when a captured graph is available.

Parameters:

forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The prepared batch (from prepare_forward_batch_*).

Returns:

Contains next_token_logits of shape [batch_size, vocab_size].

Return type:

LogitsProcessorOutput

forward_decode(forward_batch)

Run a decode forward pass (one new token per sequence).

Calls attn_backend.init_forward_metadata followed by model.forward.

Parameters:

forward_batch (pymllm.engine.forward_batch.ForwardBatch)

Return type:

LogitsProcessorOutput

forward_extend(forward_batch)

Run an extend (prefill) forward pass.

Calls attn_backend.init_forward_metadata followed by model.forward.

Parameters:

forward_batch (pymllm.engine.forward_batch.ForwardBatch)

Return type:

LogitsProcessorOutput

sample(logits_output, forward_batch, temperatures=None, top_ps=None, top_ks=None, penalty_params=None)

Sample next-token IDs from logits.

Supports per-request temperature, top-p, top-k, and penalties (repetition, frequency, presence).

Parameters:
  • logits_output (LogitsProcessorOutput) – The logits from forward().

  • forward_batch (pymllm.engine.forward_batch.ForwardBatch) – The current forward batch.

  • temperatures (Optional[torch.Tensor]) – Per-request temperature, shape [batch_size].

  • top_ps (Optional[torch.Tensor]) – Per-request top-p, shape [batch_size].

  • top_ks (Optional[torch.Tensor]) – Per-request top-k, shape [batch_size].

  • penalty_params (Optional[Dict[str, Any]]) – Optional dict with keys repetition_penalties, frequency_penalties, presence_penalties (tensors of shape [batch_size]), and token_histories (list of list of int).

Returns:

Next-token IDs, shape [batch_size], dtype int32.

Return type:

torch.Tensor

shutdown()

Release model and memory resources.

Return type:

None

property is_generation: bool

True if the model is a generation (causal-LM) model.

Return type:

bool

property sliding_window_size: int | None

Sliding-window attention span, or None for full context.

Return type:

Optional[int]