pymllm.orchestrator.model_runner_process¶

ModelRunnerProcess – GPU-owning component that executes model forward passes.

Instantiated in-process by SchedulerProcess The scheduler calls _forward_batch() directly — no inter-process communication is involved.

This component owns the GPU: it holds a ModelRunner with model weights, KV-cache memory pools, and the attention backend. It also owns the RadixCache for prefix-aware KV reuse.

RadixCache lifecycle¶

match_prefix — called during _allocate_extend before KV allocation.
inc_lock_ref — locks matched radix-tree nodes to prevent eviction.
insert (prefill) — inserts prompt KV indices after prefill.
insert (completion) — re-inserts the full sequence when a request finishes.
dec_lock_ref — unlocks radix-tree nodes when a request is freed.
evict — called when KV allocation fails to free stale cache entries.

Attributes¶

logger

Classes¶

ModelRunnerProcess

GPU-owning component created in-process by SchedulerProcess.

Module Contents¶

pymllm.orchestrator.model_runner_process.logger¶

class pymllm.orchestrator.model_runner_process.ModelRunnerProcess(gpu_id=0, server_config=None, model_config=None)¶

GPU-owning component created in-process by SchedulerProcess.

Parameters:

gpu_id (int)
server_config (Optional[Any])
model_config (Optional[Any])

init_model()¶

Create and initialise the ModelRunner and RadixCache.

Must run inside the subprocess (after spawn) since it does CUDA init.

Return type:: None

shutdown()¶

Return type:: None