pymllm.mem_cache.radix_cache ============================ .. py:module:: pymllm.mem_cache.radix_cache .. autoapi-nested-parse:: Radix-tree KV cache with SWA and multimodal support. Supports: - Multi-batch serving on a single GPU - Sliding Window Attention (SWA) via tombstone mechanism - Multimodal namespace isolation via ``extra_key`` - SHA256 position-aware hashing - Page-aligned operations (page_size >= 1) - LRU leaf eviction Attributes ---------- .. autoapisummary:: pymllm.mem_cache.radix_cache.logger Classes ------- .. autoapisummary:: pymllm.mem_cache.radix_cache.TreeNode pymllm.mem_cache.radix_cache.RadixCache Module Contents --------------- .. py:data:: logger .. py:class:: TreeNode A single node in the radix tree. ``value`` holds a 1-D ``int64`` tensor of KV-pool indices (one per token in ``key``). When the node has been evicted, ``value`` is ``None``. .. py:attribute:: __slots__ :value: ('children', 'parent', 'key', 'value', 'lock_ref', 'swa_lock_ref', 'swa_tombstone',... .. py:attribute:: children :type: Dict[Any, TreeNode] .. py:attribute:: parent :type: Optional[TreeNode] :value: None .. py:attribute:: key :type: Optional[pymllm.mem_cache.base_prefix_cache.RadixKey] :value: None .. py:attribute:: value :type: Optional[torch.Tensor] :value: None .. py:attribute:: lock_ref :type: int :value: 0 .. py:attribute:: swa_lock_ref :type: int :value: 0 .. py:attribute:: swa_tombstone :type: bool :value: False .. py:attribute:: swa_boundary_id :type: Optional[int] :value: None .. py:attribute:: last_access_time :type: float .. py:attribute:: hit_count :type: int :value: 0 .. py:attribute:: hash_values :type: Optional[List[str]] :value: None .. py:attribute:: id :type: int :value: 1 .. py:property:: evicted :type: bool .. py:method:: __lt__(other) .. py:class:: RadixCache(page_size = 1, sliding_window_size = None, token_to_kv_pool_allocator = None, on_node_evict = None) Bases: :py:obj:`pymllm.mem_cache.base_prefix_cache.BasePrefixCache` Radix tree for KV-cache prefix sharing. :param page_size: Number of tokens per KV-pool page. Keys and values are aligned to this granularity. :param sliding_window_size: If set, enables SWA mode. The cache tracks which nodes have had their SWA KV freed (tombstoned) and constrains prefix matching so that the sliding-window invariant is maintained. :param token_to_kv_pool_allocator: Optional pool allocator with ``free(indices)`` (and ``free_swa`` for SWA mode). When *None*, index tensors are simply discarded. :param on_node_evict: Optional callback invoked with the node id when a node is evicted. .. py:attribute:: page_size :value: 1 .. py:attribute:: sliding_window_size :value: None .. py:attribute:: pool :value: None .. py:attribute:: on_node_evict :value: None .. py:property:: supports_swa :type: bool .. py:method:: evictable_size() .. py:method:: swa_evictable_size() .. py:method:: protected_size() .. py:method:: swa_protected_size() .. py:method:: total_size() Total number of cached tokens (including tombstoned). .. py:method:: reset() Clear all cached state and re-initialise the root node. .. py:method:: match_prefix(key) Find the longest cached prefix of *key*. For SWA mode the match is further constrained: the path from the returned ``last_node`` to root must have at least ``sliding_window_size`` non-tombstone tokens (or be entirely tombstone-free back to root). Accessing a prefix refreshes LRU timestamps along the matched path. .. py:method:: insert(key, value = None, *, prev_prefix_len = 0, swa_evicted_seqlen = 0, **kwargs) Insert *key*/*value* into the tree. Returns how many leading tokens were already present (the prefix length). The caller is responsible for freeing duplicate KV indices in the range ``[cache_protected_len, prefix_len)``. :param prev_prefix_len: (SWA mode) tokens before this offset are already protected and should not have their values overwritten. :param swa_evicted_seqlen: (SWA mode) the sequence length up to which SWA KV has been previously evicted. Used to decide whether a tombstoned node can be un-tombstoned with the incoming value. .. py:method:: evict(num_tokens, swa_num_tokens = 0) Evict up to *num_tokens* (full) and *swa_num_tokens* (SWA) tokens. Full eviction removes leaf nodes entirely; SWA eviction tombstones internal nodes (freeing SWA KV but retaining full-attn KV). .. py:method:: inc_lock_ref(node) Lock nodes from *node* up to root (prevents eviction). Returns ``swa_boundary_id`` that must be passed back to :meth:`dec_lock_ref`. In non-SWA mode, returns ``None``. .. py:method:: dec_lock_ref(node, swa_boundary_id = None, **kwargs) Unlock nodes from *node* up to root. .. py:method:: compute_node_hash(node) Compute position-aware SHA-256 hashes for *node* (one per page). Lazily computed and cached on ``node.hash_values``. .. py:method:: pretty_print() Print the tree structure to stdout.