OpenCL Backend
==============

Overview
--------
The OpenCL backend in MLLM is designed to enable Large Language Model (LLM) inference on a wide range of devices that support the OpenCL standard, such as mobile GPUs (Adreno, Mali) and desktop GPUs. This document outlines the current preliminary design and implementation details.

.. note::
   This is an initial implementation. Significant optimizations in memory management and inference speed are planned for future updates.

Design
------

Memory Management
~~~~~~~~~~~~~~~~~
The memory management is handled by the ``OpenCLAllocator`` class.

*   **Mechanism**: It implements a basic memory pool mechanism to reduce the overhead of frequent memory allocation and deallocation.
*   **Implementation**:
    *   It maintains a ``memory_pool_`` (a map of buffer sizes to ``cl_mem`` objects).
    *   When ``alloc`` is called, it checks the pool for an available buffer of suitable size. If found, it reuses it; otherwise, it creates a new ``cl_mem`` buffer using ``clCreateBuffer``.
    *   When ``free`` is called, the buffer is not immediately released to the OpenCL runtime but returned to the pool for future reuse.
    *   Thread safety is managed via ``std::mutex``.

Model Implementation
~~~~~~~~~~~~~~~~~~~~
The model implementation (e.g., Llama) follows the standard MLLM module structure but is adapted for the OpenCL backend.

*   **Device Type**: Tensors and Modules are initialized or moved to the ``mllm::kOpenCL`` device.
*   **KV Cache**: Uses ``nn::StaticCache`` configured for ``kOpenCL`` to store key-value pairs on the GPU memory.
*   **Data Flow**: Input tensors (like token sequences) are moved to the OpenCL device before inference. Intermediate computations (Attention, MLP) happen on the device.

Usage
-----
To use the OpenCL backend, the application must initialize it and move the model and inputs to the appropriate device.

.. code-block:: cpp

    // Initialize the backend
    mllm::initOpenCLBackend();

    // Load model and move to OpenCL device
    auto llama = mllm::models::llama::LlamaForCausalLM("", llama_cfg);
    llama.load(param);
    llama.to(mllm::kOpenCL);

    // Prepare inputs
    inputs["sequence"] = inputs["sequence"].to(mllm::kOpenCL);

Current Limitations & Future Work
---------------------------------

As a preliminary implementation, there are several areas identified for improvement:

1.  **Memory Management**:
    *   The current pooling strategy is basic.
    *   **Optimization Needed**: More advanced allocators (e.g., sub-allocators, better fragmentation handling) are needed to reduce memory footprint and allocation overhead.

2.  **Inference Speed**:
    *   The current performance is functional but not fully optimized.
    *   **Optimization Needed**: Kernel tuning (work-group sizes, memory access patterns), operator fusion, and minimizing host-device synchronization are required to improve throughput and latency.

3.  **Operator Support**:
    *   Currently supports a subset of operators required for models like Llama. Support for more operators and architectures will be added.