pymllm.mem_cache.memory_pool

Lightweight KV-cache memory pools

Three-layer architecture:

ReqToTokenPool          maps  (req_slot, position) → kv_index
TokenToKVPoolAllocator  manages a free-list of integer indices
KVPool                  holds the actual GPU K/V tensors

All indices are int32 tensors on the target device. Slot 0 in the KV buffers is reserved as a padding / dummy-output slot and is never allocated.

Attributes

Classes

KVPool

GPU (or CPU) storage for per-layer key and value caches.

TokenToKVPoolAllocator

Manages allocation / deallocation of integer indices into a KVPool.

ReqToTokenPool

Maps each live request to its per-position KV-pool indices.

GDNPool

Pre-allocated memory pool for GDN recurrent and conv states.

Functions

make_full_attention_net_mem_pool(size, layer_num, ...)

Create a KVPool and its TokenToKVPoolAllocator for a

make_req_to_token_pool(max_reqs, max_context_len[, device])

Module Contents

pymllm.mem_cache.memory_pool.logger
class pymllm.mem_cache.memory_pool.KVPool(size, layer_num, k_head_num, k_head_dim, device='cuda', dtype=torch.float16, v_head_num=None, v_head_dim=None, pin_memory=True)

GPU (or CPU) storage for per-layer key and value caches.

Layout per layer:

JIT:

k_buffer[layer][slot, k_head_num * k_head_dim] v_buffer[layer][slot, v_head_num * v_head_dim]

PyTorch:

k_buffer[layer][slot, k_head_num, k_head_dim] v_buffer[layer][slot, v_head_num, v_head_dim]

K and V may have independent head counts and head dimensions, which covers standard MHA, GQA / MQA, and architectures like MLA where value projection uses a different dimensionality.

size usable slots are numbered [1, size]. Slot 0 is a dummy padding slot that absorbs writes from padded tokens.

Parameters:
  • size (int) – Number of usable token slots (total buffer length = size + 1).

  • layer_num (int) – Number of transformer layers (one K buffer + one V buffer per layer).

  • k_head_num (int) – Number of key heads.

  • k_head_dim (int) – Dimension of each key head.

  • device (str | torch.device) – Target device ("cuda", "cpu", …).

  • dtype (torch.dtype) – Storage data type.

  • v_head_num (int, optional) – Number of value heads. Defaults to k_head_num.

  • v_head_dim (int, optional) – Dimension of each value head. Defaults to k_head_dim.

  • pin_memory (bool, optional) – Whether to use pinned memory. Defaults to True.

size
layer_num
k_head_num
k_head_dim
v_head_num
v_head_dim
device
dtype
k_buffer: List[torch.Tensor]
v_buffer: List[torch.Tensor]
get_key_buffer(layer_id)
Parameters:

layer_id (int)

Return type:

torch.Tensor

get_value_buffer(layer_id)
Parameters:

layer_id (int)

Return type:

torch.Tensor

get_kv_buffer(layer_id)
Parameters:

layer_id (int)

Return type:

Tuple[torch.Tensor, torch.Tensor]

set_kv_buffer(layer_id, indices, k, v)

Write K/V vectors into the cache at the given indices.

k / v can be any shape as long as the trailing dimensions multiply to head_num * head_dim (the row dimension). All leading dimensions are treated as the batch axis and must match indices after flattening. Typical shapes:

k: [num_tokens, head_num, head_dim]          indices: [num_tokens]
k: [batch, seq_len, head_num, head_dim]      indices: [batch, seq_len]
k: [num_tokens, head_num * head_dim]          indices: [num_tokens]
Parameters:
  • layer_id (int)

  • indices (torch.Tensor)

  • k (torch.Tensor)

  • v (torch.Tensor)

Return type:

None

class pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator(size, device='cuda', page_size=1, need_sort=True)

Manages allocation / deallocation of integer indices into a KVPool.

Each alloc(n) returns n free indices; each free(indices) returns them to the pool.

Uses a dual-buffer strategy (free_slots + release_slots) so that free() never cats onto the large main free-list. Freed indices accumulate in the smaller release_slots and are merged lazily (with an optional sort) only when alloc() cannot be satisfied from free_slots alone.

A batch-free API (free_group_begin / free_group_end) further amortises cost when many free() calls happen in a tight loop (e.g. during scheduling or eviction).

Typical usage:

allocator = TokenToKVPoolAllocator(size=4096, device="cuda")

# --- basic alloc / free ---
indices = allocator.alloc(128)      # 128 free slot indices (int32)
allocator.free(indices[:64])        # return 64 slots

# --- batch free (amortised) ---
allocator.free_group_begin()
for req in finished_requests:
    allocator.free(req.kv_indices)  # O(1) list append each
allocator.free_group_end()          # single torch.cat + release
Parameters:
  • size (int) – Total number of allocatable slots (must match KVPool.size).

  • device (str | torch.device) – Device for the free-list tensor.

  • page_size (int) – When > 1 the allocator works in page-aligned mode: alloc returns multiples of page_size contiguous within each page, and free deduplicates by page.

  • need_sort (bool) – When True (default), merge_and_sort_free sorts after merging so that lower-index slots are allocated first (better memory locality).

size
page_size = 1
device
need_sort = True
clear()

Reset the allocator so that all slots [1, size] are free. The first slot is reserved for padding.

Return type:

None

available_size()

Number of tokens that can still be allocated.

Return type:

int

merge_and_sort_free()

Merge release_slots into free_slots (and sort if need_sort).

Return type:

None

free_group_begin()

Start collecting free() calls; actual release is deferred to free_group_end.

Return type:

None

free_group_end()

Flush all free() calls collected since free_group_begin.

Return type:

None

alloc(need_size)

Allocate need_size token indices.

Returns a 1-D int32 tensor on success, or None if the pool is exhausted.

Parameters:

need_size (int)

Return type:

Optional[torch.Tensor]

free(indices)

Return indices to the free pool.

Parameters:

indices (torch.Tensor)

Return type:

None

class pymllm.mem_cache.memory_pool.ReqToTokenPool(max_reqs, max_context_len, device='cuda')

Maps each live request to its per-position KV-pool indices.

Internally a 2-D tensor req_to_token[slot, position] stores the KV-pool index for every token position of every active request. Slots are recycled via a simple free-list.

This class is a pure mapping table – it does not track per-request sequence lengths. The caller (typically the Req / IO-struct object) must store req_pool_idx and seq_len and use them to slice into req_to_token when reading back KV indices.

Typical usage:

pool = ReqToTokenPool(max_reqs=256, max_context_len=4096)

# --- on new request arrival ---
[slot] = pool.alloc(1)                       # slot = req_pool_idx
kv_indices = kv_allocator.alloc(seq_len)      # from TokenToKVPoolAllocator
pool.write((slot, slice(0, seq_len)), kv_indices)

# --- read back (caller tracks seq_len) ---
kv_indices = pool.req_to_token[slot, :seq_len]

# --- on request completion ---
kv_allocator.free(pool.req_to_token[slot, :seq_len])
pool.free(slot)
Parameters:
  • max_reqs (int) – Maximum number of concurrent requests (number of rows).

  • max_context_len (int) – Maximum sequence length any single request can reach (number of cols).

  • device (str | torch.device) – Target device for the mapping tensor.

size
max_context_len
device
req_to_token
available_size()
Return type:

int

alloc(n=1)

Allocate n request slots. Returns a list of slot indices.

Parameters:

n (int)

Return type:

Optional[List[int]]

free(slot)

Return a single request slot to the pool.

Parameters:

slot (int)

Return type:

None

write(index, values)

Write KV indices into the mapping table.

index is typically (req_pool_idx, slice(start, end)).

Parameters:
  • index (Tuple)

  • values (torch.Tensor)

Return type:

None

clear()
Return type:

None

pymllm.mem_cache.memory_pool.make_full_attention_net_mem_pool(size, layer_num, k_head_num, k_head_dim, v_head_num, v_head_dim, device='cuda', dtype=torch.float16, page_size=1, need_sort=True, pin_memory=True)

Create a KVPool and its TokenToKVPoolAllocator for a full-attention (non-SWA) model.

Parameters:
  • size (int) – Number of usable token slots in the KV cache.

  • layer_num (int) – Number of transformer layers.

  • k_head_dim (int) – Key head count and dimension.

  • v_head_dim (int) – Value head count and dimension.

  • device (str | torch.device) – Target device.

  • dtype (torch.dtype) – Storage data type for the KV buffers.

  • page_size (int) – Allocator page size (1 = per-token, >1 = page-aligned).

  • need_sort (bool) – Whether the allocator sorts on merge for memory locality.

  • pin_memory (bool) – Whether to use pinned memory for the KV buffers.

  • k_head_num (int)

  • k_head_dim

  • v_head_num (int)

  • v_head_dim

Return type:

(KVPool, TokenToKVPoolAllocator)

class pymllm.mem_cache.memory_pool.GDNPool(max_reqs, num_gdn_layers, num_v_heads, head_k_dim, head_v_dim, conv_dim, conv_kernel_size, device='cuda', dtype=torch.bfloat16, max_track_slots=0)

Pre-allocated memory pool for GDN recurrent and conv states.

Indexed by req_pool_idx (same index space as ReqToTokenPool). Slot 0 is reserved as a padding / dummy slot and is never allocated.

Layout:

recurrent_state[gdn_layer_idx, slot, num_v_heads, head_k_dim, head_v_dim]
    float32 (FlashInfer requirement)
conv_state[gdn_layer_idx, slot, conv_dim, kernel_size - 1]
    model dtype (bfloat16 / float16)
Parameters:
  • max_reqs (int) – Maximum number of concurrent requests (matches ReqToTokenPool.size).

  • num_gdn_layers (int) – Number of GDN (linear attention) layers in the model.

  • num_v_heads (int) – Number of value heads per GDN layer.

  • head_k_dim (int) – Per-head key dimension.

  • head_v_dim (int) – Per-head value dimension.

  • conv_dim (int) – Total convolution input dimension (key_dim * 2 + value_dim).

  • conv_kernel_size (int) – Causal conv1d kernel width (state stores kernel_size - 1 columns).

  • device (str | torch.device) – Target device.

  • dtype (torch.dtype) – Storage dtype for conv_state (recurrent_state is always float32).

  • max_track_slots (int)

max_reqs
num_gdn_layers
num_v_heads
head_k_dim
head_v_dim
conv_dim
conv_kernel_size
device
dtype
max_track_slots = 0
recurrent_state
conv_state
get_layer_state(gdn_layer_idx)

Return (recurrent_state, conv_state) for a specific GDN layer.

Both are views into the pool tensors with shape: - recurrent: [pool_size, num_v_heads, head_v_dim, head_k_dim] - conv: [pool_size, conv_dim, kernel_size - 1]

Parameters:

gdn_layer_idx (int)

Return type:

Tuple[torch.Tensor, torch.Tensor]

reset_states(req_pool_indices)

Zero-init GDN states for the given request pool indices.

Called when new requests are allocated to ensure clean state.

Parameters:

req_pool_indices (torch.Tensor)

Return type:

None

alloc_track_slot()

Allocate a single track slot index. Returns None if exhausted.

Return type:

Optional[int]

free_track_slot(slot)

Return a track slot to the free list.

Parameters:

slot (int)

Return type:

None

copy_states(src_index, dst_index)

Copy recurrent and conv states from src_index to dst_index.

Works for any pool indices (working or track slots).

Parameters:
  • src_index (int)

  • dst_index (int)

Return type:

None

mem_bytes()

Total memory consumption in bytes.

Return type:

int

pymllm.mem_cache.memory_pool.make_req_to_token_pool(max_reqs, max_context_len, device='cuda')
Parameters:
  • max_reqs (int)

  • max_context_len (int)

  • device (Union[str, torch.device])

Return type:

ReqToTokenPool