pymllm.orchestrator.model_runner_process¶
ModelRunnerProcess – GPU-owning component that executes model forward passes.
Instantiated in-process by SchedulerProcess
The scheduler calls _forward_batch() directly —
no inter-process communication is involved.
This component owns the GPU: it holds a ModelRunner with model
weights, KV-cache memory pools, and the attention backend. It also owns
the RadixCache for prefix-aware KV reuse.
RadixCache lifecycle¶
match_prefix — called during
_allocate_extendbefore KV allocation.inc_lock_ref — locks matched radix-tree nodes to prevent eviction.
insert (prefill) — inserts prompt KV indices after prefill.
insert (completion) — re-inserts the full sequence when a request finishes.
dec_lock_ref — unlocks radix-tree nodes when a request is freed.
evict — called when KV allocation fails to free stale cache entries.
Attributes¶
Classes¶
GPU-owning component created in-process by SchedulerProcess. |
Module Contents¶
- pymllm.orchestrator.model_runner_process.logger¶
- class pymllm.orchestrator.model_runner_process.ModelRunnerProcess(gpu_id=0, server_config=None, model_config=None)¶
GPU-owning component created in-process by SchedulerProcess.
- Parameters:
gpu_id (int)
server_config (Optional[Any])
model_config (Optional[Any])
- init_model()¶
Create and initialise the ModelRunner and RadixCache.
Must run inside the subprocess (after spawn) since it does CUDA init.
- Return type:
None
- shutdown()¶
- Return type:
None