pymllm.orchestrator.model_runner_process

ModelRunnerProcess – GPU-owning component that executes model forward passes.

Instantiated in-process by SchedulerProcess The scheduler calls _forward_batch() directly — no inter-process communication is involved.

This component owns the GPU: it holds a ModelRunner with model weights, KV-cache memory pools, and the attention backend. It also owns the RadixCache for prefix-aware KV reuse.

RadixCache lifecycle

  1. match_prefix — called during _allocate_extend before KV allocation.

  2. inc_lock_ref — locks matched radix-tree nodes to prevent eviction.

  3. insert (prefill) — inserts prompt KV indices after prefill.

  4. insert (completion) — re-inserts the full sequence when a request finishes.

  5. dec_lock_ref — unlocks radix-tree nodes when a request is freed.

  6. evict — called when KV allocation fails to free stale cache entries.

Attributes

Classes

ModelRunnerProcess

GPU-owning component created in-process by SchedulerProcess.

Module Contents

pymllm.orchestrator.model_runner_process.logger
class pymllm.orchestrator.model_runner_process.ModelRunnerProcess(gpu_id=0, server_config=None, model_config=None)

GPU-owning component created in-process by SchedulerProcess.

Parameters:
  • gpu_id (int)

  • server_config (Optional[Any])

  • model_config (Optional[Any])

init_model()

Create and initialise the ModelRunner and RadixCache.

Must run inside the subprocess (after spawn) since it does CUDA init.

Return type:

None

shutdown()
Return type:

None