pymllm.mem_cache.memory_pool¶

Lightweight KV-cache memory pools

Three-layer architecture:

ReqToTokenPool          maps  (req_slot, position) → kv_index
TokenToKVPoolAllocator  manages a free-list of integer indices
KVPool                  holds the actual GPU K/V tensors

All indices are int32 tensors on the target device. Slot 0 in the KV buffers is reserved as a padding / dummy-output slot and is never allocated.

Attributes¶

logger

Classes¶

`KVPool`	GPU (or CPU) storage for per-layer key and value caches.
`TokenToKVPoolAllocator`	Manages allocation / deallocation of integer indices into a `KVPool`.
`ReqToTokenPool`	Maps each live request to its per-position KV-pool indices.
`GDNPool`	Pre-allocated memory pool for GDN recurrent and conv states.

Functions¶

`make_full_attention_net_mem_pool`(size, layer_num, ...)	Create a `KVPool` and its `TokenToKVPoolAllocator` for a
`make_req_to_token_pool`(max_reqs, max_context_len[, device])

Module Contents¶

pymllm.mem_cache.memory_pool.logger¶

class pymllm.mem_cache.memory_pool.KVPool(size, layer_num, k_head_num, k_head_dim, device='cuda', dtype=torch.float16, v_head_num=None, v_head_dim=None, pin_memory=True)¶

GPU (or CPU) storage for per-layer key and value caches.

Layout per layer:

JIT:: k_buffer[layer][slot, k_head_num * k_head_dim] v_buffer[layer][slot, v_head_num * v_head_dim]
PyTorch:: k_buffer[layer][slot, k_head_num, k_head_dim] v_buffer[layer][slot, v_head_num, v_head_dim]

K and V may have independent head counts and head dimensions, which covers standard MHA, GQA / MQA, and architectures like MLA where value projection uses a different dimensionality.

size usable slots are numbered [1, size]. Slot 0 is a dummy padding slot that absorbs writes from padded tokens.

Parameters:

size (int) – Number of usable token slots (total buffer length = size + 1).
layer_num (int) – Number of transformer layers (one K buffer + one V buffer per layer).
k_head_num (int) – Number of key heads.
k_head_dim (int) – Dimension of each key head.
device (str | torch.device) – Target device ("cuda", "cpu", …).
dtype (torch.dtype) – Storage data type.
v_head_num (int, optional) – Number of value heads. Defaults to k_head_num.
v_head_dim (int, optional) – Dimension of each value head. Defaults to k_head_dim.
pin_memory (bool, optional) – Whether to use pinned memory. Defaults to True.

size¶

layer_num¶

k_head_num¶

k_head_dim¶

v_head_num¶

v_head_dim¶

device¶

dtype¶

k_buffer: List[torch.Tensor]¶

v_buffer: List[torch.Tensor]¶

get_key_buffer(layer_id)¶

Parameters:: layer_id (int)
Return type:: torch.Tensor

get_value_buffer(layer_id)¶

Parameters:: layer_id (int)
Return type:: torch.Tensor

get_kv_buffer(layer_id)¶

Parameters:: layer_id (int)
Return type:: Tuple[torch.Tensor, torch.Tensor]

set_kv_buffer(layer_id, indices, k, v)¶

Write K/V vectors into the cache at the given indices.

k / v can be any shape as long as the trailing dimensions multiply to head_num * head_dim (the row dimension). All leading dimensions are treated as the batch axis and must match indices after flattening. Typical shapes:

k: [num_tokens, head_num, head_dim]          indices: [num_tokens]
k: [batch, seq_len, head_num, head_dim]      indices: [batch, seq_len]
k: [num_tokens, head_num * head_dim]          indices: [num_tokens]

Parameters:

layer_id (int)
indices (torch.Tensor)
k (torch.Tensor)
v (torch.Tensor)

Return type:

None

class pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator(size, device='cuda', page_size=1, need_sort=True)¶

Manages allocation / deallocation of integer indices into a KVPool.

Each alloc(n) returns n free indices; each free(indices) returns them to the pool.

Uses a dual-buffer strategy (free_slots + release_slots) so that free() never cats onto the large main free-list. Freed indices accumulate in the smaller release_slots and are merged lazily (with an optional sort) only when alloc() cannot be satisfied from free_slots alone.

A batch-free API (free_group_begin / free_group_end) further amortises cost when many free() calls happen in a tight loop (e.g. during scheduling or eviction).

Typical usage:

allocator = TokenToKVPoolAllocator(size=4096, device="cuda")

# --- basic alloc / free ---
indices = allocator.alloc(128)      # 128 free slot indices (int32)
allocator.free(indices[:64])        # return 64 slots

# --- batch free (amortised) ---
allocator.free_group_begin()
for req in finished_requests:
    allocator.free(req.kv_indices)  # O(1) list append each
allocator.free_group_end()          # single torch.cat + release

Parameters:

size (int) – Total number of allocatable slots (must match KVPool.size).
device (str | torch.device) – Device for the free-list tensor.
page_size (int) – When > 1 the allocator works in page-aligned mode: alloc returns multiples of page_size contiguous within each page, and free deduplicates by page.
need_sort (bool) – When True (default), merge_and_sort_free sorts after merging so that lower-index slots are allocated first (better memory locality).

size¶

page_size = 1¶

device¶

need_sort = True¶

clear()¶

Reset the allocator so that all slots [1, size] are free. The first slot is reserved for padding.

Return type:: None

available_size()¶

Number of tokens that can still be allocated.

Return type:: int

merge_and_sort_free()¶

Merge release_slots into free_slots (and sort if need_sort).

Return type:: None

free_group_begin()¶

Start collecting free() calls; actual release is deferred to free_group_end.

Return type:: None

free_group_end()¶

Flush all free() calls collected since free_group_begin.

Return type:: None

alloc(need_size)¶

Allocate need_size token indices.

Returns a 1-D int32 tensor on success, or None if the pool is exhausted.

Parameters:: need_size (int)
Return type:: Optional[torch.Tensor]

free(indices)¶

Return indices to the free pool.

Parameters:: indices (torch.Tensor)
Return type:: None

class pymllm.mem_cache.memory_pool.ReqToTokenPool(max_reqs, max_context_len, device='cuda')¶

Maps each live request to its per-position KV-pool indices.

Internally a 2-D tensor req_to_token[slot, position] stores the KV-pool index for every token position of every active request. Slots are recycled via a simple free-list.

This class is a pure mapping table – it does not track per-request sequence lengths. The caller (typically the Req / IO-struct object) must store req_pool_idx and seq_len and use them to slice into req_to_token when reading back KV indices.

Typical usage:

pool = ReqToTokenPool(max_reqs=256, max_context_len=4096)

# --- on new request arrival ---
[slot] = pool.alloc(1)                       # slot = req_pool_idx
kv_indices = kv_allocator.alloc(seq_len)      # from TokenToKVPoolAllocator
pool.write((slot, slice(0, seq_len)), kv_indices)

# --- read back (caller tracks seq_len) ---
kv_indices = pool.req_to_token[slot, :seq_len]

# --- on request completion ---
kv_allocator.free(pool.req_to_token[slot, :seq_len])
pool.free(slot)

Parameters:

max_reqs (int) – Maximum number of concurrent requests (number of rows).
max_context_len (int) – Maximum sequence length any single request can reach (number of cols).
device (str | torch.device) – Target device for the mapping tensor.

size¶

max_context_len¶

device¶

req_to_token¶

available_size()¶

Return type:: int

alloc(n=1)¶

Allocate n request slots. Returns a list of slot indices.

Parameters:: n (int)
Return type:: Optional[List[int]]

free(slot)¶

Return a single request slot to the pool.

Parameters:: slot (int)
Return type:: None

write(index, values)¶

Write KV indices into the mapping table.

index is typically (req_pool_idx, slice(start, end)).

Parameters:

index (Tuple)
values (torch.Tensor)

Return type:

None

clear()¶

Return type:: None

pymllm.mem_cache.memory_pool.make_full_attention_net_mem_pool(size, layer_num, k_head_num, k_head_dim, v_head_num, v_head_dim, device='cuda', dtype=torch.float16, page_size=1, need_sort=True, pin_memory=True)¶

Create a KVPool and its TokenToKVPoolAllocator for a full-attention (non-SWA) model.

Parameters:

size (int) – Number of usable token slots in the KV cache.
layer_num (int) – Number of transformer layers.
k_head_dim (int) – Key head count and dimension.
v_head_dim (int) – Value head count and dimension.
device (str | torch.device) – Target device.
dtype (torch.dtype) – Storage data type for the KV buffers.
page_size (int) – Allocator page size (1 = per-token, >1 = page-aligned).
need_sort (bool) – Whether the allocator sorts on merge for memory locality.
pin_memory (bool) – Whether to use pinned memory for the KV buffers.
k_head_num (int)
k_head_dim
v_head_num (int)
v_head_dim

Return type:

(KVPool, TokenToKVPoolAllocator)

class pymllm.mem_cache.memory_pool.GDNPool(max_reqs, num_gdn_layers, num_v_heads, head_k_dim, head_v_dim, conv_dim, conv_kernel_size, device='cuda', dtype=torch.bfloat16, max_track_slots=0)¶

Pre-allocated memory pool for GDN recurrent and conv states.

Indexed by req_pool_idx (same index space as ReqToTokenPool). Slot 0 is reserved as a padding / dummy slot and is never allocated.

Layout:

recurrent_state[gdn_layer_idx, slot, num_v_heads, head_k_dim, head_v_dim]
    float32 (FlashInfer requirement)
conv_state[gdn_layer_idx, slot, conv_dim, kernel_size - 1]
    model dtype (bfloat16 / float16)

Parameters:

max_reqs (int) – Maximum number of concurrent requests (matches ReqToTokenPool.size).
num_gdn_layers (int) – Number of GDN (linear attention) layers in the model.
num_v_heads (int) – Number of value heads per GDN layer.
head_k_dim (int) – Per-head key dimension.
head_v_dim (int) – Per-head value dimension.
conv_dim (int) – Total convolution input dimension (key_dim * 2 + value_dim).
conv_kernel_size (int) – Causal conv1d kernel width (state stores kernel_size - 1 columns).
device (str | torch.device) – Target device.
dtype (torch.dtype) – Storage dtype for conv_state (recurrent_state is always float32).
max_track_slots (int)

max_reqs¶

num_gdn_layers¶

num_v_heads¶

head_k_dim¶

head_v_dim¶

conv_dim¶

conv_kernel_size¶

device¶

dtype¶

max_track_slots = 0¶

recurrent_state¶

conv_state¶

get_layer_state(gdn_layer_idx)¶

Return (recurrent_state, conv_state) for a specific GDN layer.

Both are views into the pool tensors with shape: - recurrent: [pool_size, num_v_heads, head_v_dim, head_k_dim] - conv: [pool_size, conv_dim, kernel_size - 1]

Parameters:: gdn_layer_idx (int)
Return type:: Tuple[torch.Tensor, torch.Tensor]

reset_states(req_pool_indices)¶

Zero-init GDN states for the given request pool indices.

Called when new requests are allocated to ensure clean state.

Parameters:: req_pool_indices (torch.Tensor)
Return type:: None

alloc_track_slot()¶

Allocate a single track slot index. Returns None if exhausted.

Return type:: Optional[int]

free_track_slot(slot)¶

Return a track slot to the free list.

Parameters:: slot (int)
Return type:: None

copy_states(src_index, dst_index)¶

Copy recurrent and conv states from src_index to dst_index.

Works for any pool indices (working or track slots).

Parameters:

src_index (int)
dst_index (int)

Return type:

None

mem_bytes()¶

Total memory consumption in bytes.

Return type:: int

pymllm.mem_cache.memory_pool.make_req_to_token_pool(max_reqs, max_context_len, device='cuda')¶

Parameters:

max_reqs (int)
max_context_len (int)
device (Union[str, torch.device])

Return type:

ReqToTokenPool