pymllm.mem_cache.memory_pool
============================

.. py:module:: pymllm.mem_cache.memory_pool

.. autoapi-nested-parse::

   Lightweight KV-cache memory pools

   Three-layer architecture::

       ReqToTokenPool          maps  (req_slot, position) → kv_index
       TokenToKVPoolAllocator  manages a free-list of integer indices
       KVPool                  holds the actual GPU K/V tensors

   All indices are **int32** tensors on the target device.  Slot 0 in the KV
   buffers is reserved as a padding / dummy-output slot and is never allocated.


Attributes
----------

.. autoapisummary::

   pymllm.mem_cache.memory_pool.logger


Classes
-------

.. autoapisummary::

   pymllm.mem_cache.memory_pool.KVPool
   pymllm.mem_cache.memory_pool.TokenToKVPoolAllocator
   pymllm.mem_cache.memory_pool.ReqToTokenPool
   pymllm.mem_cache.memory_pool.GDNPool


Functions
---------

.. autoapisummary::

   pymllm.mem_cache.memory_pool.make_full_attention_net_mem_pool
   pymllm.mem_cache.memory_pool.make_req_to_token_pool


Module Contents
---------------

.. py:data:: logger

.. py:class:: KVPool(size, layer_num, k_head_num, k_head_dim, device = 'cuda', dtype = torch.float16, v_head_num = None, v_head_dim = None, pin_memory = True)

   GPU (or CPU) storage for per-layer key and value caches.

   Layout per layer::

   JIT:
       k_buffer[layer][slot, k_head_num * k_head_dim]
       v_buffer[layer][slot, v_head_num * v_head_dim]

   PyTorch:
       k_buffer[layer][slot, k_head_num, k_head_dim]
       v_buffer[layer][slot, v_head_num, v_head_dim]

   K and V may have **independent** head counts and head dimensions, which
   covers standard MHA, GQA / MQA, and architectures like MLA where value
   projection uses a different dimensionality.

   ``size`` usable slots are numbered ``[1, size]``.  Slot 0 is a dummy
   padding slot that absorbs writes from padded tokens.

   :param size: Number of usable token slots (total buffer length = ``size + 1``).
   :type size: int
   :param layer_num: Number of transformer layers (one K buffer + one V buffer per layer).
   :type layer_num: int
   :param k_head_num: Number of key heads.
   :type k_head_num: int
   :param k_head_dim: Dimension of each key head.
   :type k_head_dim: int
   :param device: Target device (``"cuda"``, ``"cpu"``, …).
   :type device: str | torch.device
   :param dtype: Storage data type.
   :type dtype: torch.dtype
   :param v_head_num: Number of value heads.  Defaults to *k_head_num*.
   :type v_head_num: int, optional
   :param v_head_dim: Dimension of each value head.  Defaults to *k_head_dim*.
   :type v_head_dim: int, optional
   :param pin_memory: Whether to use pinned memory.  Defaults to True.
   :type pin_memory: bool, optional


   .. py:attribute:: size


   .. py:attribute:: layer_num


   .. py:attribute:: k_head_num


   .. py:attribute:: k_head_dim


   .. py:attribute:: v_head_num


   .. py:attribute:: v_head_dim


   .. py:attribute:: device


   .. py:attribute:: dtype


   .. py:attribute:: k_buffer
      :type:  List[torch.Tensor]


   .. py:attribute:: v_buffer
      :type:  List[torch.Tensor]


   .. py:method:: get_key_buffer(layer_id)


   .. py:method:: get_value_buffer(layer_id)


   .. py:method:: get_kv_buffer(layer_id)


   .. py:method:: set_kv_buffer(layer_id, indices, k, v)

      Write K/V vectors into the cache at the given *indices*.

      ``k`` / ``v`` can be any shape as long as the trailing dimensions
      multiply to ``head_num * head_dim`` (the row dimension).  All leading
      dimensions are treated as the batch axis and must match ``indices``
      after flattening.  Typical shapes::

          k: [num_tokens, head_num, head_dim]          indices: [num_tokens]
          k: [batch, seq_len, head_num, head_dim]      indices: [batch, seq_len]
          k: [num_tokens, head_num * head_dim]          indices: [num_tokens]


.. py:class:: TokenToKVPoolAllocator(size, device = 'cuda', page_size = 1, need_sort = True)

   Manages allocation / deallocation of integer indices into a :class:`KVPool`.

   Each ``alloc(n)`` returns *n* free indices; each ``free(indices)`` returns
   them to the pool.

   Uses a **dual-buffer** strategy (``free_slots`` + ``release_slots``) so
   that ``free()`` never cats onto the large main free-list.  Freed indices
   accumulate in the smaller ``release_slots`` and are merged lazily (with an
   optional sort) only when ``alloc()`` cannot be satisfied from
   ``free_slots`` alone.

   A **batch-free** API (``free_group_begin`` / ``free_group_end``) further
   amortises cost when many ``free()`` calls happen in a tight loop (e.g.
   during scheduling or eviction).

   Typical usage::

       allocator = TokenToKVPoolAllocator(size=4096, device="cuda")

       # --- basic alloc / free ---
       indices = allocator.alloc(128)      # 128 free slot indices (int32)
       allocator.free(indices[:64])        # return 64 slots

       # --- batch free (amortised) ---
       allocator.free_group_begin()
       for req in finished_requests:
           allocator.free(req.kv_indices)  # O(1) list append each
       allocator.free_group_end()          # single torch.cat + release

   :param size: Total number of allocatable slots (must match ``KVPool.size``).
   :type size: int
   :param device: Device for the free-list tensor.
   :type device: str | torch.device
   :param page_size: When > 1 the allocator works in page-aligned mode: ``alloc`` returns
                     multiples of ``page_size`` contiguous within each page, and ``free``
                     deduplicates by page.
   :type page_size: int
   :param need_sort: When ``True`` (default), ``merge_and_sort_free`` sorts after merging
                     so that lower-index slots are allocated first (better memory locality).
   :type need_sort: bool


   .. py:attribute:: size


   .. py:attribute:: page_size
      :value: 1


   .. py:attribute:: device


   .. py:attribute:: need_sort
      :value: True


   .. py:method:: clear()

      Reset the allocator so that all slots ``[1, size]`` are free. The first slot is reserved for padding.


   .. py:method:: available_size()

      Number of tokens that can still be allocated.


   .. py:method:: merge_and_sort_free()

      Merge ``release_slots`` into ``free_slots`` (and sort if ``need_sort``).


   .. py:method:: free_group_begin()

      Start collecting ``free()`` calls; actual release is deferred to ``free_group_end``.


   .. py:method:: free_group_end()

      Flush all ``free()`` calls collected since ``free_group_begin``.


   .. py:method:: alloc(need_size)

      Allocate *need_size* token indices.

      Returns a 1-D ``int32`` tensor on success, or ``None`` if the pool is
      exhausted.


   .. py:method:: free(indices)

      Return *indices* to the free pool.


.. py:class:: ReqToTokenPool(max_reqs, max_context_len, device = 'cuda')

   Maps each live request to its per-position KV-pool indices.

   Internally a 2-D tensor ``req_to_token[slot, position]`` stores the
   KV-pool index for every token position of every active request.
   Slots are recycled via a simple free-list.

   This class is a **pure mapping table** -- it does **not** track per-request
   sequence lengths.  The caller (typically the ``Req`` / IO-struct object)
   must store ``req_pool_idx`` and ``seq_len`` and use them to slice into
   ``req_to_token`` when reading back KV indices.

   Typical usage::

       pool = ReqToTokenPool(max_reqs=256, max_context_len=4096)

       # --- on new request arrival ---
       [slot] = pool.alloc(1)                       # slot = req_pool_idx
       kv_indices = kv_allocator.alloc(seq_len)      # from TokenToKVPoolAllocator
       pool.write((slot, slice(0, seq_len)), kv_indices)

       # --- read back (caller tracks seq_len) ---
       kv_indices = pool.req_to_token[slot, :seq_len]

       # --- on request completion ---
       kv_allocator.free(pool.req_to_token[slot, :seq_len])
       pool.free(slot)

   :param max_reqs: Maximum number of concurrent requests (number of rows).
   :type max_reqs: int
   :param max_context_len: Maximum sequence length any single request can reach (number of cols).
   :type max_context_len: int
   :param device: Target device for the mapping tensor.
   :type device: str | torch.device


   .. py:attribute:: size


   .. py:attribute:: max_context_len


   .. py:attribute:: device


   .. py:attribute:: req_to_token


   .. py:method:: available_size()


   .. py:method:: alloc(n = 1)

      Allocate *n* request slots.  Returns a list of slot indices.


   .. py:method:: free(slot)

      Return a single request slot to the pool.


   .. py:method:: write(index, values)

      Write KV indices into the mapping table.

      ``index`` is typically ``(req_pool_idx, slice(start, end))``.


   .. py:method:: clear()


.. py:function:: make_full_attention_net_mem_pool(size, layer_num, k_head_num, k_head_dim, v_head_num, v_head_dim, device = 'cuda', dtype = torch.float16, page_size = 1, need_sort = True, pin_memory = True)

   Create a :class:`KVPool` and its :class:`TokenToKVPoolAllocator` for a
   full-attention (non-SWA) model.

   :param size: Number of usable token slots in the KV cache.
   :type size: int
   :param layer_num: Number of transformer layers.
   :type layer_num: int
   :param k_head_num / k_head_dim: Key head count and dimension.
   :type k_head_num / k_head_dim: int
   :param v_head_num / v_head_dim: Value head count and dimension.
   :type v_head_num / v_head_dim: int
   :param device: Target device.
   :type device: str | torch.device
   :param dtype: Storage data type for the KV buffers.
   :type dtype: torch.dtype
   :param page_size: Allocator page size (1 = per-token, >1 = page-aligned).
   :type page_size: int
   :param need_sort: Whether the allocator sorts on merge for memory locality.
   :type need_sort: bool
   :param pin_memory: Whether to use pinned memory for the KV buffers.
   :type pin_memory: bool

   :rtype: (KVPool, TokenToKVPoolAllocator)


.. py:class:: GDNPool(max_reqs, num_gdn_layers, num_v_heads, head_k_dim, head_v_dim, conv_dim, conv_kernel_size, device = 'cuda', dtype = torch.bfloat16, max_track_slots = 0)

   Pre-allocated memory pool for GDN recurrent and conv states.

   Indexed by ``req_pool_idx`` (same index space as :class:`ReqToTokenPool`).
   Slot 0 is reserved as a padding / dummy slot and is never allocated.

   Layout::

       recurrent_state[gdn_layer_idx, slot, num_v_heads, head_k_dim, head_v_dim]
           float32 (FlashInfer requirement)
       conv_state[gdn_layer_idx, slot, conv_dim, kernel_size - 1]
           model dtype (bfloat16 / float16)

   :param max_reqs: Maximum number of concurrent requests (matches ``ReqToTokenPool.size``).
   :type max_reqs: int
   :param num_gdn_layers: Number of GDN (linear attention) layers in the model.
   :type num_gdn_layers: int
   :param num_v_heads: Number of value heads per GDN layer.
   :type num_v_heads: int
   :param head_k_dim: Per-head key dimension.
   :type head_k_dim: int
   :param head_v_dim: Per-head value dimension.
   :type head_v_dim: int
   :param conv_dim: Total convolution input dimension (``key_dim * 2 + value_dim``).
   :type conv_dim: int
   :param conv_kernel_size: Causal conv1d kernel width (state stores ``kernel_size - 1`` columns).
   :type conv_kernel_size: int
   :param device: Target device.
   :type device: str | torch.device
   :param dtype: Storage dtype for conv_state (recurrent_state is always float32).
   :type dtype: torch.dtype


   .. py:attribute:: max_reqs


   .. py:attribute:: num_gdn_layers


   .. py:attribute:: num_v_heads


   .. py:attribute:: head_k_dim


   .. py:attribute:: head_v_dim


   .. py:attribute:: conv_dim


   .. py:attribute:: conv_kernel_size


   .. py:attribute:: device


   .. py:attribute:: dtype


   .. py:attribute:: max_track_slots
      :value: 0


   .. py:attribute:: recurrent_state


   .. py:attribute:: conv_state


   .. py:method:: get_layer_state(gdn_layer_idx)

      Return ``(recurrent_state, conv_state)`` for a specific GDN layer.

      Both are views into the pool tensors with shape:
      - recurrent: ``[pool_size, num_v_heads, head_v_dim, head_k_dim]``
      - conv: ``[pool_size, conv_dim, kernel_size - 1]``


   .. py:method:: reset_states(req_pool_indices)

      Zero-init GDN states for the given request pool indices.

      Called when new requests are allocated to ensure clean state.


   .. py:method:: alloc_track_slot()

      Allocate a single track slot index.  Returns ``None`` if exhausted.


   .. py:method:: free_track_slot(slot)

      Return a track slot to the free list.


   .. py:method:: copy_states(src_index, dst_index)

      Copy recurrent and conv states from *src_index* to *dst_index*.

      Works for any pool indices (working or track slots).


   .. py:method:: mem_bytes()

      Total memory consumption in bytes.


.. py:function:: make_req_to_token_pool(max_reqs, max_context_len, device = 'cuda')