Parallel API and Thread Configuration in MLLM¶

Introduction¶

MLLM provides a flexible parallel computing framework that allows CPU kernels to utilize multiple threads for improved performance. The parallel execution is controlled through the Parallel API defined in mllm/core/Parallel.hpp and configured via CMake build options.

Parallel API¶

The Parallel API provides macros for parallel execution that abstract away the underlying threading implementation. The API automatically selects the appropriate threading backend based on build configuration:

Apple Grand Central Dispatch (GCD) - Used on Apple platforms when MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD is enabled
OpenMP - Used on most platforms when MLLM_KERNEL_THREADS_VENDOR_OPENMP is enabled
Sequential execution - Used when threading is disabled

Key Parallel API Macros¶

MLLM_AUTO_PARALLEL_BEGIN(__iter__, __num__) / MLLM_AUTO_PARALLEL_END() - Execute a loop with __num__ iterations in parallel
MLLM_AUTO_PARALLEL_FOR_BEGIN(__iter__, __start__, __end__, __step__) / MLLM_AUTO_PARALLEL_FOR_END() - Execute a for-loop in parallel
MLLM_SET_NUM_THREADS(num_threads) - Set the number of threads to use for parallel execution
MLLM_SERIAL_FOR_BEGIN(__iter__, __start__, __end__, __step__) / MLLM_SERIAL_FOR_END() - Execute a serial for-loop
MLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, ...) - Conditionally execute a loop in parallel or serial

MLLM_CONDITIONAL_PARALLEL_FOR¶

The MLLM_CONDITIONAL_PARALLEL_FOR is a macro that provides conditional parallel execution based on a specified condition. It allows switching between parallel and serial execution modes.

Syntax¶

MLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, body)

Parameters¶

condition: A boolean expression that determines whether to execute in parallel or serial mode. If true, parallel execution is used; if false, serial execution is used.
num_threads: The number of threads to use in parallel mode.
iter: The loop iterator variable name.
start: The starting value of the iterator.
end: The ending value of the iterator (exclusive).
step: The increment step for each iteration.
body: The loop body code to execute in each iteration.

Implementation Details¶

The macro is defined as follows:

#define MLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, ...) \
  do { \
    if (condition) { \
      MLLM_SET_NUM_THREADS(num_threads); \
      MLLM_AUTO_PARALLEL_FOR_BEGIN(iter, start, end, step){__VA_ARGS__} MLLM_AUTO_PARALLEL_FOR_END() \
    } else { \
      MLLM_SET_NUM_THREADS(1); \
      MLLM_SERIAL_FOR_BEGIN(iter, start, end, step){__VA_ARGS__} MLLM_SERIAL_FOR_END() \
    } \
  } while (0)

In the current implementation: - MLLM_SET_NUM_THREADS is a no-op macro that doesn’t actually set thread count - MLLM_AUTO_PARALLEL_FOR_BEGIN and MLLM_SERIAL_FOR_BEGIN both expand to simple for-loops - The key difference is in the intent - one path is meant for parallel execution and the other for serial

Usage Example¶

const bool use_parallel = options_.getThreads() > 1;
const int thread_count = options_.getThreads();

MLLM_CONDITIONAL_PARALLEL_FOR(use_parallel, thread_count, i, 0, N, 1, {
    // Loop body code
    process_element(i);
});

In this example: - If use_parallel is true (when options_.getThreads() > 1), the loop will execute with the specified number of threads - If use_parallel is false, the loop will execute serially with a single thread

Capture Mechanism¶

Since MLLM_CONDITIONAL_PARALLEL_FOR expands to standard for-loops, variable capture follows the standard C++ rules:

Direct Variable Access: Variables from the enclosing scope are directly accessible within the loop body
Non-const Access: Variables can be modified within the loop body (subject to their own const-ness)
No Special Capture Syntax: Unlike lambdas or blocks, there’s no explicit capture clause - all visible variables are accessible

This is different from lambda expressions or GCD blocks where explicit capture mechanisms are required. The macro simply generates regular for-loops, so standard C++ scoping and access rules apply.

Usage in CPU Kernels¶

CPU kernels use the Parallel API to parallelize operations across data elements. Here’s an example of how it’s used in the gelu activation function:

if (thread_cnt > 1) {
  MLLM_SET_NUM_THREADS(thread_cnt);
  int tails = N % 4;
  int loops = N - tails;
  MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, loops, 4) {
    // Process 4 elements at a time in parallel
    float32x4_t x = vld1q_f32(X + i);
    // ... vectorized computations ...
    vst1q_f32(Z + i, result);
  }
  MLLM_AUTO_PARALLEL_FOR_END()
  // Handle remaining elements serially
  for (; i < N; i++) {
    // ... scalar computations ...
  }
} else {
  // Serial execution
  // ... regular loop implementation ...
}

In this example:

If thread_cnt > 1, the kernel uses parallel execution
MLLM_SET_NUM_THREADS sets the desired number of threads
MLLM_AUTO_PARALLEL_FOR_BEGIN and MLLM_AUTO_PARALLEL_FOR_END define the parallel loop section
Vectorized operations are performed on chunks of data (4 elements at a time for float32)
Remaining elements that don’t fit in chunks are handled serially

Another example from cast_types.cpp shows how to use the parallel macros with conditional handling:

if (thread_count > 1) {
  MLLM_SET_NUM_THREADS(thread_count);
  MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, len, 4)
  int remain = len - i;
  if (remain >= 4) {
    int32x4_t v32_src = vld1q_s32(src + i);
    vst1q_f32(dst + i, vcvtq_f32_s32(v32_src));
  } else {
    for (int j = i; j < len; j++) { dst[j] = (mllm_fp32_t)src[j]; }
  }
  MLLM_AUTO_PARALLEL_FOR_END();
} else {
  // Serial implementation
}

CMake Thread Configuration¶

MLLM provides several CMake options to configure threading support:

Threading Options¶

MLLM_KERNEL_USE_THREADS (default: ON) - Enable or disable threading support entirely
MLLM_KERNEL_THREADS_VENDOR_OPENMP (default: ON) - Enable OpenMP threading
MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD (default: OFF) - Enable Apple Grand Central Dispatch threading

Platform-Specific Configuration¶

Apple Platforms¶

On Apple platforms (macOS, iOS), MLLM supports both OpenMP and GCD threading models:

Example CMake configuration for Apple platforms¶

-DMLLM_KERNEL_USE_THREADS=ON
-DMLLM_KERNEL_THREADS_VENDOR_OPENMP=ON
-DMLLM_KERNEL_THREADS_VENDOR_APPLE_GCD=OFF

If both OpenMP and GCD are enabled, GCD takes precedence with a warning message.

Non-Apple Platforms¶

On non-Apple platforms, OpenMP is typically used:

Example CMake configuration for non-Apple platforms¶

-DMLLM_KERNEL_USE_THREADS=ON
-DMLLM_KERNEL_THREADS_VENDOR_OPENMP=ON

Best Practices¶

Conditional Parallelization: Only use parallel execution when there’s enough work to justify the overhead:

if (thread_count > 1 && len > 1024 * 4) {
  // Parallel implementation
} else {
  // Serial implementation
}

Proper Chunking: Divide work into appropriately sized chunks for better load balancing:

size_t chunk_size = (vec_size + thread_count - 1) / thread_count;
chunk_size = (chunk_size + lanes - 1) & ~(lanes - 1);

Handling Remainders: Always handle data that doesn’t fit evenly into vectorized chunks:

// Process main chunks in parallel
MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, vec_size, lanes) {
  // Vectorized operations
}
MLLM_AUTO_PARALLEL_FOR_END()

// Handle remainder elements serially
if (vec_size < size) {
  // Process remaining elements
}

Conclusion¶

The Parallel API in MLLM provides a flexible and portable way to parallelize CPU kernel operations. Through CMake configuration options, developers can choose the appropriate threading backend for their platform while the API abstracts away the implementation details. CPU kernels can leverage these macros to achieve better performance on multi-core systems while maintaining code clarity and portability.