pymllm.mem_cache.memory_pool ============================ .. py:module:: pymllm.mem_cache.memory_pool .. autoapi-nested-parse:: Lightweight KV-cache memory pools Three-layer architecture:: ReqToTokenPool maps (req_slot, position) → kv_index TokenToKVPoolAllocator manages a free-list of integer indices KVPool holds the actual GPU K/V tensors All indices are **int32** tensors on the target device. Slot 0 in the KV buffers is reserved as a padding / dummy-output slot and is never allocated. Attributes ---------- .. autoapisummary:: pymllm.mem_cache.memory_pool.logger Classes ------- .. autoapisummary:: pymllm.mem_cache.memory_pool.KVPool pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator pymllm.mem_cache.memory_pool.ReqToTokenPool pymllm.mem_cache.memory_pool.GDNPool Functions --------- .. autoapisummary:: pymllm.mem_cache.memory_pool.make_full_attention_net_mem_pool pymllm.mem_cache.memory_pool.make_req_to_token_pool Module Contents --------------- .. py:data:: logger .. py:class:: KVPool(size, layer_num, k_head_num, k_head_dim, device = 'cuda', dtype = torch.float16, v_head_num = None, v_head_dim = None, pin_memory = True) GPU (or CPU) storage for per-layer key and value caches. Layout per layer:: JIT: k_buffer[layer][slot, k_head_num * k_head_dim] v_buffer[layer][slot, v_head_num * v_head_dim] PyTorch: k_buffer[layer][slot, k_head_num, k_head_dim] v_buffer[layer][slot, v_head_num, v_head_dim] K and V may have **independent** head counts and head dimensions, which covers standard MHA, GQA / MQA, and architectures like MLA where value projection uses a different dimensionality. ``size`` usable slots are numbered ``[1, size]``. Slot 0 is a dummy padding slot that absorbs writes from padded tokens. :param size: Number of usable token slots (total buffer length = ``size + 1``). :type size: int :param layer_num: Number of transformer layers (one K buffer + one V buffer per layer). :type layer_num: int :param k_head_num: Number of key heads. :type k_head_num: int :param k_head_dim: Dimension of each key head. :type k_head_dim: int :param device: Target device (``"cuda"``, ``"cpu"``, …). :type device: str | torch.device :param dtype: Storage data type. :type dtype: torch.dtype :param v_head_num: Number of value heads. Defaults to *k_head_num*. :type v_head_num: int, optional :param v_head_dim: Dimension of each value head. Defaults to *k_head_dim*. :type v_head_dim: int, optional :param pin_memory: Whether to use pinned memory. Defaults to True. :type pin_memory: bool, optional .. py:attribute:: size .. py:attribute:: layer_num .. py:attribute:: k_head_num .. py:attribute:: k_head_dim .. py:attribute:: v_head_num .. py:attribute:: v_head_dim .. py:attribute:: device .. py:attribute:: dtype .. py:attribute:: k_buffer :type: List[torch.Tensor] .. py:attribute:: v_buffer :type: List[torch.Tensor] .. py:method:: get_key_buffer(layer_id) .. py:method:: get_value_buffer(layer_id) .. py:method:: get_kv_buffer(layer_id) .. py:method:: set_kv_buffer(layer_id, indices, k, v) Write K/V vectors into the cache at the given *indices*. ``k`` / ``v`` can be any shape as long as the trailing dimensions multiply to ``head_num * head_dim`` (the row dimension). All leading dimensions are treated as the batch axis and must match ``indices`` after flattening. Typical shapes:: k: [num_tokens, head_num, head_dim] indices: [num_tokens] k: [batch, seq_len, head_num, head_dim] indices: [batch, seq_len] k: [num_tokens, head_num * head_dim] indices: [num_tokens] .. py:class:: TokenToKVPoolAllocator(size, device = 'cuda', page_size = 1, need_sort = True) Manages allocation / deallocation of integer indices into a :class:`KVPool`. Each ``alloc(n)`` returns *n* free indices; each ``free(indices)`` returns them to the pool. Uses a **dual-buffer** strategy (``free_slots`` + ``release_slots``) so that ``free()`` never cats onto the large main free-list. Freed indices accumulate in the smaller ``release_slots`` and are merged lazily (with an optional sort) only when ``alloc()`` cannot be satisfied from ``free_slots`` alone. A **batch-free** API (``free_group_begin`` / ``free_group_end``) further amortises cost when many ``free()`` calls happen in a tight loop (e.g. during scheduling or eviction). Typical usage:: allocator = TokenToKVPoolAllocator(size=4096, device="cuda") # --- basic alloc / free --- indices = allocator.alloc(128) # 128 free slot indices (int32) allocator.free(indices[:64]) # return 64 slots # --- batch free (amortised) --- allocator.free_group_begin() for req in finished_requests: allocator.free(req.kv_indices) # O(1) list append each allocator.free_group_end() # single torch.cat + release :param size: Total number of allocatable slots (must match ``KVPool.size``). :type size: int :param device: Device for the free-list tensor. :type device: str | torch.device :param page_size: When > 1 the allocator works in page-aligned mode: ``alloc`` returns multiples of ``page_size`` contiguous within each page, and ``free`` deduplicates by page. :type page_size: int :param need_sort: When ``True`` (default), ``merge_and_sort_free`` sorts after merging so that lower-index slots are allocated first (better memory locality). :type need_sort: bool .. py:attribute:: size .. py:attribute:: page_size :value: 1 .. py:attribute:: device .. py:attribute:: need_sort :value: True .. py:method:: clear() Reset the allocator so that all slots ``[1, size]`` are free. The first slot is reserved for padding. .. py:method:: available_size() Number of tokens that can still be allocated. .. py:method:: merge_and_sort_free() Merge ``release_slots`` into ``free_slots`` (and sort if ``need_sort``). .. py:method:: free_group_begin() Start collecting ``free()`` calls; actual release is deferred to ``free_group_end``. .. py:method:: free_group_end() Flush all ``free()`` calls collected since ``free_group_begin``. .. py:method:: alloc(need_size) Allocate *need_size* token indices. Returns a 1-D ``int32`` tensor on success, or ``None`` if the pool is exhausted. .. py:method:: free(indices) Return *indices* to the free pool. .. py:class:: ReqToTokenPool(max_reqs, max_context_len, device = 'cuda') Maps each live request to its per-position KV-pool indices. Internally a 2-D tensor ``req_to_token[slot, position]`` stores the KV-pool index for every token position of every active request. Slots are recycled via a simple free-list. This class is a **pure mapping table** -- it does **not** track per-request sequence lengths. The caller (typically the ``Req`` / IO-struct object) must store ``req_pool_idx`` and ``seq_len`` and use them to slice into ``req_to_token`` when reading back KV indices. Typical usage:: pool = ReqToTokenPool(max_reqs=256, max_context_len=4096) # --- on new request arrival --- [slot] = pool.alloc(1) # slot = req_pool_idx kv_indices = kv_allocator.alloc(seq_len) # from TokenToKVPoolAllocator pool.write((slot, slice(0, seq_len)), kv_indices) # --- read back (caller tracks seq_len) --- kv_indices = pool.req_to_token[slot, :seq_len] # --- on request completion --- kv_allocator.free(pool.req_to_token[slot, :seq_len]) pool.free(slot) :param max_reqs: Maximum number of concurrent requests (number of rows). :type max_reqs: int :param max_context_len: Maximum sequence length any single request can reach (number of cols). :type max_context_len: int :param device: Target device for the mapping tensor. :type device: str | torch.device .. py:attribute:: size .. py:attribute:: max_context_len .. py:attribute:: device .. py:attribute:: req_to_token .. py:method:: available_size() .. py:method:: alloc(n = 1) Allocate *n* request slots. Returns a list of slot indices. .. py:method:: free(slot) Return a single request slot to the pool. .. py:method:: write(index, values) Write KV indices into the mapping table. ``index`` is typically ``(req_pool_idx, slice(start, end))``. .. py:method:: clear() .. py:function:: make_full_attention_net_mem_pool(size, layer_num, k_head_num, k_head_dim, v_head_num, v_head_dim, device = 'cuda', dtype = torch.float16, page_size = 1, need_sort = True, pin_memory = True) Create a :class:`KVPool` and its :class:`TokenToKVPoolAllocator` for a full-attention (non-SWA) model. :param size: Number of usable token slots in the KV cache. :type size: int :param layer_num: Number of transformer layers. :type layer_num: int :param k_head_num / k_head_dim: Key head count and dimension. :type k_head_num / k_head_dim: int :param v_head_num / v_head_dim: Value head count and dimension. :type v_head_num / v_head_dim: int :param device: Target device. :type device: str | torch.device :param dtype: Storage data type for the KV buffers. :type dtype: torch.dtype :param page_size: Allocator page size (1 = per-token, >1 = page-aligned). :type page_size: int :param need_sort: Whether the allocator sorts on merge for memory locality. :type need_sort: bool :param pin_memory: Whether to use pinned memory for the KV buffers. :type pin_memory: bool :rtype: (KVPool, TokenToKVPoolAllocator) .. py:class:: GDNPool(max_reqs, num_gdn_layers, num_v_heads, head_k_dim, head_v_dim, conv_dim, conv_kernel_size, device = 'cuda', dtype = torch.bfloat16, max_track_slots = 0) Pre-allocated memory pool for GDN recurrent and conv states. Indexed by ``req_pool_idx`` (same index space as :class:`ReqToTokenPool`). Slot 0 is reserved as a padding / dummy slot and is never allocated. Layout:: recurrent_state[gdn_layer_idx, slot, num_v_heads, head_k_dim, head_v_dim] float32 (FlashInfer requirement) conv_state[gdn_layer_idx, slot, conv_dim, kernel_size - 1] model dtype (bfloat16 / float16) :param max_reqs: Maximum number of concurrent requests (matches ``ReqToTokenPool.size``). :type max_reqs: int :param num_gdn_layers: Number of GDN (linear attention) layers in the model. :type num_gdn_layers: int :param num_v_heads: Number of value heads per GDN layer. :type num_v_heads: int :param head_k_dim: Per-head key dimension. :type head_k_dim: int :param head_v_dim: Per-head value dimension. :type head_v_dim: int :param conv_dim: Total convolution input dimension (``key_dim * 2 + value_dim``). :type conv_dim: int :param conv_kernel_size: Causal conv1d kernel width (state stores ``kernel_size - 1`` columns). :type conv_kernel_size: int :param device: Target device. :type device: str | torch.device :param dtype: Storage dtype for conv_state (recurrent_state is always float32). :type dtype: torch.dtype .. py:attribute:: max_reqs .. py:attribute:: num_gdn_layers .. py:attribute:: num_v_heads .. py:attribute:: head_k_dim .. py:attribute:: head_v_dim .. py:attribute:: conv_dim .. py:attribute:: conv_kernel_size .. py:attribute:: device .. py:attribute:: dtype .. py:attribute:: max_track_slots :value: 0 .. py:attribute:: recurrent_state .. py:attribute:: conv_state .. py:method:: get_layer_state(gdn_layer_idx) Return ``(recurrent_state, conv_state)`` for a specific GDN layer. Both are views into the pool tensors with shape: - recurrent: ``[pool_size, num_v_heads, head_v_dim, head_k_dim]`` - conv: ``[pool_size, conv_dim, kernel_size - 1]`` .. py:method:: reset_states(req_pool_indices) Zero-init GDN states for the given request pool indices. Called when new requests are allocated to ensure clean state. .. py:method:: alloc_track_slot() Allocate a single track slot index. Returns ``None`` if exhausted. .. py:method:: free_track_slot(slot) Return a track slot to the free list. .. py:method:: copy_states(src_index, dst_index) Copy recurrent and conv states from *src_index* to *dst_index*. Works for any pool indices (working or track slots). .. py:method:: mem_bytes() Total memory consumption in bytes. .. py:function:: make_req_to_token_pool(max_reqs, max_context_len, device = 'cuda')