================
MLLM LM Cache
================

The MLLM LM Cache module provides an efficient Key-Value caching mechanism for optimizing inference performance of large language models and multimodal models. This module supports both static and dynamic caching strategies, effectively reducing redundant computations and improving inference speed.

Overview
========

In Transformer architecture models, the attention mechanism needs to maintain key-value caches to avoid recomputing representations of historical tokens. MLLM provides multiple cache implementations to meet different performance and memory requirements:

- **StaticCache**: Pre-allocates fixed-size cache, suitable for scenarios with known maximum sequence length
- **DynamicCache**: Dynamically allocates cache, suitable for variable-length sequence scenarios
- **SubStaticCache**: A sub-view of static cache, supporting cache slicing operations

API Reference
=============

StaticCache
-----------

Pre-allocates fixed-size cache, suitable for performance optimization during inference.

.. code-block:: cpp

   #include "mllm/nn/lmcache/StaticCache.hpp"

   // Create static cache
   auto cache = mllm::nn::StaticCache(
       max_cache_length,  // Maximum cache length
       layer_nums,        // Number of layers
       q_heads,          // Number of query heads
       kv_heads,         // Number of key-value heads
       kv_dims,          // Key-value dimensions
       k_dtype,          // Key data type
       v_dtype,          // Value data type
       device_type,      // Device type (kCPU, kOpenCL, etc.)
       use_fa2           // Whether to use FlashAttention2
   );

   // Update cache
   auto [k_cached, v_cached] = cache.updateKVCache(layer_idx, k_tensor, v_tensor);

   // Get current sequence length
   int32_t seq_len = cache.getCurrentSeqCnt(layer_idx);

Constructor Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~

+-------------------+----------------+-------------------------------+
| Parameter         | Type           | Description                   |
+===================+================+===============================+
| max_cache_length  | int32_t        | Maximum cache sequence length |
+-------------------+----------------+-------------------------------+
| layer_nums        | int32_t        | Number of model layers        |
+-------------------+----------------+-------------------------------+
| q_heads           | int32_t        | Number of query attention     |
|                   |                | heads                         |
+-------------------+----------------+-------------------------------+
| kv_heads          | int32_t        | Number of key-value attention |
|                   |                | heads                         |
+-------------------+----------------+-------------------------------+
| kv_dims           | int32_t        | Key-value dimensions          |
+-------------------+----------------+-------------------------------+
| k_dtype           | DataTypes      | Key tensor data type          |
+-------------------+----------------+-------------------------------+
| v_dtype           | DataTypes      | Value tensor data type        |
+-------------------+----------------+-------------------------------+
| device_type       | DeviceTypes    | Device type (default kCPU)    |
+-------------------+----------------+-------------------------------+
| use_fa2           | bool           | Whether to use FlashAttention2|
|                   |                | (default true)                |
+-------------------+----------------+-------------------------------+

DynamicCache
------------

Dynamically allocates cache, suitable for training or variable-length inference scenarios.

.. code-block:: cpp

   #include "mllm/nn/lmcache/DynamicCache.hpp"

   // Create dynamic cache
   auto cache = mllm::nn::DynamicCache(
       layer_nums,  // Number of layers
       q_heads,     // Number of query heads
       kv_heads,    // Number of key-value heads
       kv_dims,     // Key-value dimensions
       use_fa2      // Whether to use FlashAttention2
   );

   // Update cache
   auto [k_cached, v_cached] = cache.updateKVCache(layer_idx, k_tensor, v_tensor);

   // Get current sequence length
   int32_t seq_len = cache.getCurrentSeqCnt();

SubStaticCache
--------------

A sub-view of static cache that allows slicing operations on the cache.

.. code-block:: cpp

   // Create sub-cache from existing static cache
   auto sub_cache = mllm::nn::SubStaticCache(
       parent_cache,  // Parent cache reference
       start_idx,     // Start index
       len           // Length
   );

   // Use in the same way as StaticCache
   auto [k_cached, v_cached] = sub_cache.updateKVCache(layer_idx, k_tensor, v_tensor);

Tensor Format
=============

Non-FlashAttention2 Mode
------------------------

Input tensor format: ``[Batch, Heads, Sequence, Dimension]``

.. code-block:: cpp

   // Example: single batch, 32 heads, sequence length 1, dimension 128
   Tensor k = Tensor::random({1, 32, 1, 128});
   Tensor v = Tensor::random({1, 32, 1, 128});

FlashAttention2 Mode
--------------------

Input tensor format: ``[Batch, Sequence, Heads, Dimension]``

.. code-block:: cpp

   // Example: single batch, sequence length 1, 32 heads, dimension 128
   Tensor k = Tensor::random({1, 1, 32, 128});
   Tensor v = Tensor::random({1, 1, 32, 128});

Usage Examples
==============

Basic Usage
-----------

.. code-block:: cpp

   #include "mllm/nn/lmcache/StaticCache.hpp"

   // Configure parameters
   const int32_t max_seq_len = 2048;
   const int32_t num_layers = 24;
   const int32_t num_q_heads = 32;
   const int32_t num_kv_heads = 8;  // Support GQA (Grouped Query Attention)
   const int32_t head_dim = 128;

   // Create cache
   auto cache = mllm::nn::StaticCache(
       max_seq_len, num_layers, num_q_heads, num_kv_heads, head_dim,
       mllm::DataTypes::kFP16, mllm::DataTypes::kFP16, mllm::DeviceTypes::kCPU
   );

   // Use in inference loop
   for (int layer = 0; layer < num_layers; ++layer) {
       // Assume k, v are key-value tensors of current layer
       auto [k_cache, v_cache] = cache.updateKVCache(layer, k, v);
       
       // Use cached key-values for attention computation
       auto attention_output = attention_func(q, k_cache, v_cache);
   }

Dynamic Cache Example
---------------------

.. code-block:: cpp

   #include "mllm/nn/lmcache/DynamicCache.hpp"

   auto dynamic_cache = mllm::nn::DynamicCache(num_layers, num_q_heads, num_kv_heads, head_dim);

   // Build cache step by step
   for (int step = 0; step < max_steps; ++step) {
       for (int layer = 0; layer < num_layers; ++layer) {
           auto [k_cache, v_cache] = dynamic_cache.updateKVCache(layer, k_step, v_step);
           // Process current step
       }
   }

Performance Optimization
=============================

Memory Layout Optimization
---------------------------

- **CPU**: Uses ``memcpy`` for efficient memory copying
- **GPU/NPU**: Uses tensor's ``copy2`` method for device-optimized copying operations

GQA Support
-----------

Supports Grouped Query Attention by calculating the repeat factor through ``q_heads / kv_heads``, automatically handling cases where the number of key-value heads is less than query heads.

Device-Specific Optimization
----------------------------

.. code-block:: cpp

   // CPU optimization path
   case kCPU: {
       // Use memcpy for block copying
       std::memcpy(cache_ptr, input_ptr, copy_size);
       break;
   }

   // GPU/NPU optimization path
   default: {
       // Use tensor operations for device-optimized copying
       input_tensor.copy2(cache_tensor);
       break;
   }

Important Notes
===============

1. **Memory Pre-allocation**: StaticCache pre-allocates all memory during construction, suitable for scenarios with known maximum sequence length
2. **FA2 Compatibility**: Different attention implementations require different tensor layouts, ensure to choose the correct ``use_fa2`` parameter
3. **Device Compatibility**: Ensure cache and input tensors are on the same device
4. **Data Types**: Supports mixed precision, keys and values can use different data types

Error Handling
==============

.. code-block:: cpp

   // Check sequence length limits
   if (current_seq_len + input_seq_len > max_cache_length) {
       throw std::runtime_error("Sequence length exceeds cache capacity");
   }

   // Validate tensor shapes
   MLLM_RT_ASSERT_EQ(k.shape()[1], kv_heads);
   MLLM_RT_ASSERT_EQ(v.shape()[1], kv_heads);

Best Practices
==============

1. **Choose Appropriate Cache Type**:
   
   - Use ``StaticCache`` for inference to achieve optimal performance
   - Use ``DynamicCache`` for training or variable-length scenarios

2. **Memory Management**:
   
   - Estimate maximum sequence length to avoid memory shortage
   - Consider using ``SubStaticCache`` for memory slicing

3. **Performance Tuning**:
   
   - Choose appropriate data types based on hardware characteristics
   - Enable FlashAttention2 for better memory efficiency

Related Documentation
=====================

- `MLLM Architecture Documentation <../arch/index.rst>`_
- `CPU Backend Optimization <../cpu_backend/index.rst>`_
- `API Reference <../api/index.rst>`_