Parallel API and Thread Configuration in MLLM¶
Introduction¶
MLLM provides a flexible parallel computing framework that allows CPU kernels to utilize multiple threads for improved performance. The parallel execution is controlled through the Parallel API defined in mllm/core/Parallel.hpp and configured via CMake build options.
Parallel API¶
The Parallel API provides macros for parallel execution that abstract away the underlying threading implementation. The API automatically selects the appropriate threading backend based on build configuration:
Apple Grand Central Dispatch (GCD) - Used on Apple platforms when
MLLM_KERNEL_THREADS_VENDOR_APPLE_GCDis enabledOpenMP - Used on most platforms when
MLLM_KERNEL_THREADS_VENDOR_OPENMPis enabledSequential execution - Used when threading is disabled
Key Parallel API Macros¶
MLLM_AUTO_PARALLEL_BEGIN(__iter__, __num__)/MLLM_AUTO_PARALLEL_END()- Execute a loop with__num__iterations in parallelMLLM_AUTO_PARALLEL_FOR_BEGIN(__iter__, __start__, __end__, __step__)/MLLM_AUTO_PARALLEL_FOR_END()- Execute a for-loop in parallelMLLM_SET_NUM_THREADS(num_threads)- Set the number of threads to use for parallel executionMLLM_SERIAL_FOR_BEGIN(__iter__, __start__, __end__, __step__)/MLLM_SERIAL_FOR_END()- Execute a serial for-loopMLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, ...)- Conditionally execute a loop in parallel or serial
MLLM_CONDITIONAL_PARALLEL_FOR¶
The MLLM_CONDITIONAL_PARALLEL_FOR is a macro that provides conditional parallel execution based on a specified condition. It allows switching between parallel and serial execution modes.
Syntax¶
MLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, body)
Parameters¶
condition: A boolean expression that determines whether to execute in parallel or serial mode. If true, parallel execution is used; if false, serial execution is used.num_threads: The number of threads to use in parallel mode.iter: The loop iterator variable name.start: The starting value of the iterator.end: The ending value of the iterator (exclusive).step: The increment step for each iteration.body: The loop body code to execute in each iteration.
Implementation Details¶
The macro is defined as follows:
#define MLLM_CONDITIONAL_PARALLEL_FOR(condition, num_threads, iter, start, end, step, ...) \
do { \
if (condition) { \
MLLM_SET_NUM_THREADS(num_threads); \
MLLM_AUTO_PARALLEL_FOR_BEGIN(iter, start, end, step){__VA_ARGS__} MLLM_AUTO_PARALLEL_FOR_END() \
} else { \
MLLM_SET_NUM_THREADS(1); \
MLLM_SERIAL_FOR_BEGIN(iter, start, end, step){__VA_ARGS__} MLLM_SERIAL_FOR_END() \
} \
} while (0)
In the current implementation:
- MLLM_SET_NUM_THREADS is a no-op macro that doesn’t actually set thread count
- MLLM_AUTO_PARALLEL_FOR_BEGIN and MLLM_SERIAL_FOR_BEGIN both expand to simple for-loops
- The key difference is in the intent - one path is meant for parallel execution and the other for serial
Usage Example¶
const bool use_parallel = options_.getThreads() > 1;
const int thread_count = options_.getThreads();
MLLM_CONDITIONAL_PARALLEL_FOR(use_parallel, thread_count, i, 0, N, 1, {
// Loop body code
process_element(i);
});
In this example:
- If use_parallel is true (when options_.getThreads() > 1), the loop will execute with the specified number of threads
- If use_parallel is false, the loop will execute serially with a single thread
Capture Mechanism¶
Since MLLM_CONDITIONAL_PARALLEL_FOR expands to standard for-loops, variable capture follows the standard C++ rules:
Direct Variable Access: Variables from the enclosing scope are directly accessible within the loop body
Non-const Access: Variables can be modified within the loop body (subject to their own const-ness)
No Special Capture Syntax: Unlike lambdas or blocks, there’s no explicit capture clause - all visible variables are accessible
This is different from lambda expressions or GCD blocks where explicit capture mechanisms are required. The macro simply generates regular for-loops, so standard C++ scoping and access rules apply.
Usage in CPU Kernels¶
CPU kernels use the Parallel API to parallelize operations across data elements. Here’s an example of how it’s used in the gelu activation function:
if (thread_cnt > 1) {
MLLM_SET_NUM_THREADS(thread_cnt);
int tails = N % 4;
int loops = N - tails;
MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, loops, 4) {
// Process 4 elements at a time in parallel
float32x4_t x = vld1q_f32(X + i);
// ... vectorized computations ...
vst1q_f32(Z + i, result);
}
MLLM_AUTO_PARALLEL_FOR_END()
// Handle remaining elements serially
for (; i < N; i++) {
// ... scalar computations ...
}
} else {
// Serial execution
// ... regular loop implementation ...
}
In this example:
If
thread_cnt > 1, the kernel uses parallel executionMLLM_SET_NUM_THREADSsets the desired number of threadsMLLM_AUTO_PARALLEL_FOR_BEGINandMLLM_AUTO_PARALLEL_FOR_ENDdefine the parallel loop sectionVectorized operations are performed on chunks of data (4 elements at a time for float32)
Remaining elements that don’t fit in chunks are handled serially
Another example from cast_types.cpp shows how to use the parallel macros with conditional handling:
if (thread_count > 1) {
MLLM_SET_NUM_THREADS(thread_count);
MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, len, 4)
int remain = len - i;
if (remain >= 4) {
int32x4_t v32_src = vld1q_s32(src + i);
vst1q_f32(dst + i, vcvtq_f32_s32(v32_src));
} else {
for (int j = i; j < len; j++) { dst[j] = (mllm_fp32_t)src[j]; }
}
MLLM_AUTO_PARALLEL_FOR_END();
} else {
// Serial implementation
}
CMake Thread Configuration¶
MLLM provides several CMake options to configure threading support:
Threading Options¶
MLLM_KERNEL_USE_THREADS(default: ON) - Enable or disable threading support entirelyMLLM_KERNEL_THREADS_VENDOR_OPENMP(default: ON) - Enable OpenMP threadingMLLM_KERNEL_THREADS_VENDOR_APPLE_GCD(default: OFF) - Enable Apple Grand Central Dispatch threading
Platform-Specific Configuration¶
Apple Platforms¶
On Apple platforms (macOS, iOS), MLLM supports both OpenMP and GCD threading models:
-DMLLM_KERNEL_USE_THREADS=ON
-DMLLM_KERNEL_THREADS_VENDOR_OPENMP=ON
-DMLLM_KERNEL_THREADS_VENDOR_APPLE_GCD=OFF
If both OpenMP and GCD are enabled, GCD takes precedence with a warning message.
Non-Apple Platforms¶
On non-Apple platforms, OpenMP is typically used:
-DMLLM_KERNEL_USE_THREADS=ON
-DMLLM_KERNEL_THREADS_VENDOR_OPENMP=ON
Best Practices¶
Conditional Parallelization: Only use parallel execution when there’s enough work to justify the overhead:
if (thread_count > 1 && len > 1024 * 4) { // Parallel implementation } else { // Serial implementation }
Proper Chunking: Divide work into appropriately sized chunks for better load balancing:
size_t chunk_size = (vec_size + thread_count - 1) / thread_count; chunk_size = (chunk_size + lanes - 1) & ~(lanes - 1);
Handling Remainders: Always handle data that doesn’t fit evenly into vectorized chunks:
// Process main chunks in parallel MLLM_AUTO_PARALLEL_FOR_BEGIN(i, 0, vec_size, lanes) { // Vectorized operations } MLLM_AUTO_PARALLEL_FOR_END() // Handle remainder elements serially if (vec_size < size) { // Process remaining elements }
Conclusion¶
The Parallel API in MLLM provides a flexible and portable way to parallelize CPU kernel operations. Through CMake configuration options, developers can choose the appropriate threading backend for their platform while the API abstracts away the implementation details. CPU kernels can leverage these macros to achieve better performance on multi-core systems while maintaining code clarity and portability.