pymllm.mem_cache.memory_pool¶
Lightweight KV-cache memory pools
Three-layer architecture:
ReqToTokenPool maps (req_slot, position) → kv_index
TokenToKVPoolAllocator manages a free-list of integer indices
KVPool holds the actual GPU K/V tensors
All indices are int32 tensors on the target device. Slot 0 in the KV buffers is reserved as a padding / dummy-output slot and is never allocated.
Attributes¶
Classes¶
GPU (or CPU) storage for per-layer key and value caches. |
|
Manages allocation / deallocation of integer indices into a |
|
Maps each live request to its per-position KV-pool indices. |
|
Pre-allocated memory pool for GDN recurrent and conv states. |
Functions¶
|
Create a |
|
Module Contents¶
- pymllm.mem_cache.memory_pool.logger¶
- class pymllm.mem_cache.memory_pool.KVPool(size, layer_num, k_head_num, k_head_dim, device='cuda', dtype=torch.float16, v_head_num=None, v_head_dim=None, pin_memory=True)¶
GPU (or CPU) storage for per-layer key and value caches.
Layout per layer:
- JIT:
k_buffer[layer][slot, k_head_num * k_head_dim] v_buffer[layer][slot, v_head_num * v_head_dim]
- PyTorch:
k_buffer[layer][slot, k_head_num, k_head_dim] v_buffer[layer][slot, v_head_num, v_head_dim]
K and V may have independent head counts and head dimensions, which covers standard MHA, GQA / MQA, and architectures like MLA where value projection uses a different dimensionality.
sizeusable slots are numbered[1, size]. Slot 0 is a dummy padding slot that absorbs writes from padded tokens.- Parameters:
size (int) – Number of usable token slots (total buffer length =
size + 1).layer_num (int) – Number of transformer layers (one K buffer + one V buffer per layer).
k_head_num (int) – Number of key heads.
k_head_dim (int) – Dimension of each key head.
device (str | torch.device) – Target device (
"cuda","cpu", …).dtype (torch.dtype) – Storage data type.
v_head_num (int, optional) – Number of value heads. Defaults to k_head_num.
v_head_dim (int, optional) – Dimension of each value head. Defaults to k_head_dim.
pin_memory (bool, optional) – Whether to use pinned memory. Defaults to True.
- size¶
- layer_num¶
- k_head_num¶
- k_head_dim¶
- v_head_num¶
- v_head_dim¶
- device¶
- dtype¶
- k_buffer: List[torch.Tensor]¶
- v_buffer: List[torch.Tensor]¶
- get_key_buffer(layer_id)¶
- Parameters:
layer_id (int)
- Return type:
torch.Tensor
- get_value_buffer(layer_id)¶
- Parameters:
layer_id (int)
- Return type:
torch.Tensor
- get_kv_buffer(layer_id)¶
- Parameters:
layer_id (int)
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- set_kv_buffer(layer_id, indices, k, v)¶
Write K/V vectors into the cache at the given indices.
k/vcan be any shape as long as the trailing dimensions multiply tohead_num * head_dim(the row dimension). All leading dimensions are treated as the batch axis and must matchindicesafter flattening. Typical shapes:k: [num_tokens, head_num, head_dim] indices: [num_tokens] k: [batch, seq_len, head_num, head_dim] indices: [batch, seq_len] k: [num_tokens, head_num * head_dim] indices: [num_tokens]
- Parameters:
layer_id (int)
indices (torch.Tensor)
k (torch.Tensor)
v (torch.Tensor)
- Return type:
None
- class pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator(size, device='cuda', page_size=1, need_sort=True)¶
Manages allocation / deallocation of integer indices into a
KVPool.Each
alloc(n)returns n free indices; eachfree(indices)returns them to the pool.Uses a dual-buffer strategy (
free_slots+release_slots) so thatfree()never cats onto the large main free-list. Freed indices accumulate in the smallerrelease_slotsand are merged lazily (with an optional sort) only whenalloc()cannot be satisfied fromfree_slotsalone.A batch-free API (
free_group_begin/free_group_end) further amortises cost when manyfree()calls happen in a tight loop (e.g. during scheduling or eviction).Typical usage:
allocator = TokenToKVPoolAllocator(size=4096, device="cuda") # --- basic alloc / free --- indices = allocator.alloc(128) # 128 free slot indices (int32) allocator.free(indices[:64]) # return 64 slots # --- batch free (amortised) --- allocator.free_group_begin() for req in finished_requests: allocator.free(req.kv_indices) # O(1) list append each allocator.free_group_end() # single torch.cat + release
- Parameters:
size (int) – Total number of allocatable slots (must match
KVPool.size).device (str | torch.device) – Device for the free-list tensor.
page_size (int) – When > 1 the allocator works in page-aligned mode:
allocreturns multiples ofpage_sizecontiguous within each page, andfreededuplicates by page.need_sort (bool) – When
True(default),merge_and_sort_freesorts after merging so that lower-index slots are allocated first (better memory locality).
- size¶
- page_size = 1¶
- device¶
- need_sort = True¶
- clear()¶
Reset the allocator so that all slots
[1, size]are free. The first slot is reserved for padding.- Return type:
None
- available_size()¶
Number of tokens that can still be allocated.
- Return type:
int
- merge_and_sort_free()¶
Merge
release_slotsintofree_slots(and sort ifneed_sort).- Return type:
None
- free_group_begin()¶
Start collecting
free()calls; actual release is deferred tofree_group_end.- Return type:
None
- free_group_end()¶
Flush all
free()calls collected sincefree_group_begin.- Return type:
None
- alloc(need_size)¶
Allocate need_size token indices.
Returns a 1-D
int32tensor on success, orNoneif the pool is exhausted.- Parameters:
need_size (int)
- Return type:
Optional[torch.Tensor]
- free(indices)¶
Return indices to the free pool.
- Parameters:
indices (torch.Tensor)
- Return type:
None
- class pymllm.mem_cache.memory_pool.ReqToTokenPool(max_reqs, max_context_len, device='cuda')¶
Maps each live request to its per-position KV-pool indices.
Internally a 2-D tensor
req_to_token[slot, position]stores the KV-pool index for every token position of every active request. Slots are recycled via a simple free-list.This class is a pure mapping table – it does not track per-request sequence lengths. The caller (typically the
Req/ IO-struct object) must storereq_pool_idxandseq_lenand use them to slice intoreq_to_tokenwhen reading back KV indices.Typical usage:
pool = ReqToTokenPool(max_reqs=256, max_context_len=4096) # --- on new request arrival --- [slot] = pool.alloc(1) # slot = req_pool_idx kv_indices = kv_allocator.alloc(seq_len) # from TokenToKVPoolAllocator pool.write((slot, slice(0, seq_len)), kv_indices) # --- read back (caller tracks seq_len) --- kv_indices = pool.req_to_token[slot, :seq_len] # --- on request completion --- kv_allocator.free(pool.req_to_token[slot, :seq_len]) pool.free(slot)
- Parameters:
max_reqs (int) – Maximum number of concurrent requests (number of rows).
max_context_len (int) – Maximum sequence length any single request can reach (number of cols).
device (str | torch.device) – Target device for the mapping tensor.
- size¶
- max_context_len¶
- device¶
- req_to_token¶
- available_size()¶
- Return type:
int
- alloc(n=1)¶
Allocate n request slots. Returns a list of slot indices.
- Parameters:
n (int)
- Return type:
Optional[List[int]]
- free(slot)¶
Return a single request slot to the pool.
- Parameters:
slot (int)
- Return type:
None
- write(index, values)¶
Write KV indices into the mapping table.
indexis typically(req_pool_idx, slice(start, end)).- Parameters:
index (Tuple)
values (torch.Tensor)
- Return type:
None
- clear()¶
- Return type:
None
- pymllm.mem_cache.memory_pool.make_full_attention_net_mem_pool(size, layer_num, k_head_num, k_head_dim, v_head_num, v_head_dim, device='cuda', dtype=torch.float16, page_size=1, need_sort=True, pin_memory=True)¶
Create a
KVPooland itsTokenToKVPoolAllocatorfor a full-attention (non-SWA) model.- Parameters:
size (int) – Number of usable token slots in the KV cache.
layer_num (int) – Number of transformer layers.
k_head_dim (int) – Key head count and dimension.
v_head_dim (int) – Value head count and dimension.
device (str | torch.device) – Target device.
dtype (torch.dtype) – Storage data type for the KV buffers.
page_size (int) – Allocator page size (1 = per-token, >1 = page-aligned).
need_sort (bool) – Whether the allocator sorts on merge for memory locality.
pin_memory (bool) – Whether to use pinned memory for the KV buffers.
k_head_num (int)
k_head_dim
v_head_num (int)
v_head_dim
- Return type:
- class pymllm.mem_cache.memory_pool.GDNPool(max_reqs, num_gdn_layers, num_v_heads, head_k_dim, head_v_dim, conv_dim, conv_kernel_size, device='cuda', dtype=torch.bfloat16, max_track_slots=0)¶
Pre-allocated memory pool for GDN recurrent and conv states.
Indexed by
req_pool_idx(same index space asReqToTokenPool). Slot 0 is reserved as a padding / dummy slot and is never allocated.Layout:
recurrent_state[gdn_layer_idx, slot, num_v_heads, head_k_dim, head_v_dim] float32 (FlashInfer requirement) conv_state[gdn_layer_idx, slot, conv_dim, kernel_size - 1] model dtype (bfloat16 / float16)
- Parameters:
max_reqs (int) – Maximum number of concurrent requests (matches
ReqToTokenPool.size).num_gdn_layers (int) – Number of GDN (linear attention) layers in the model.
num_v_heads (int) – Number of value heads per GDN layer.
head_k_dim (int) – Per-head key dimension.
head_v_dim (int) – Per-head value dimension.
conv_dim (int) – Total convolution input dimension (
key_dim * 2 + value_dim).conv_kernel_size (int) – Causal conv1d kernel width (state stores
kernel_size - 1columns).device (str | torch.device) – Target device.
dtype (torch.dtype) – Storage dtype for conv_state (recurrent_state is always float32).
max_track_slots (int)
- max_reqs¶
- num_gdn_layers¶
- num_v_heads¶
- head_k_dim¶
- head_v_dim¶
- conv_dim¶
- conv_kernel_size¶
- device¶
- dtype¶
- max_track_slots = 0¶
- recurrent_state¶
- conv_state¶
- get_layer_state(gdn_layer_idx)¶
Return
(recurrent_state, conv_state)for a specific GDN layer.Both are views into the pool tensors with shape: - recurrent:
[pool_size, num_v_heads, head_v_dim, head_k_dim]- conv:[pool_size, conv_dim, kernel_size - 1]- Parameters:
gdn_layer_idx (int)
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- reset_states(req_pool_indices)¶
Zero-init GDN states for the given request pool indices.
Called when new requests are allocated to ensure clean state.
- Parameters:
req_pool_indices (torch.Tensor)
- Return type:
None
- alloc_track_slot()¶
Allocate a single track slot index. Returns
Noneif exhausted.- Return type:
Optional[int]
- free_track_slot(slot)¶
Return a track slot to the free list.
- Parameters:
slot (int)
- Return type:
None
- copy_states(src_index, dst_index)¶
Copy recurrent and conv states from src_index to dst_index.
Works for any pool indices (working or track slots).
- Parameters:
src_index (int)
dst_index (int)
- Return type:
None
- mem_bytes()¶
Total memory consumption in bytes.
- Return type:
int
- pymllm.mem_cache.memory_pool.make_req_to_token_pool(max_reqs, max_context_len, device='cuda')¶
- Parameters:
max_reqs (int)
max_context_len (int)
device (Union[str, torch.device])
- Return type: