Efficient Parallel Prefix Sum in Metal for Apple M1

Comparison of optimal M1 GPU scan primitives to vectorized CPU performance

Published in

Better Programming

20 min readSep 9, 2021

In my last story, Optimizing Parallel Reduction in Metal for Apple M1, we explored several strategies for maximizing throughput in parallel reduce on the M1 processor. Given the low arithmetic intensity of reduction, we could optimize kernels that ran at the memory bandwidth limit regardless if the kernels were tuned to maximize parallelism or cost efficiency. This story will explore a more complex optimization problem, parallel prefix sum.

All prefix sum, or inclusive “scan,” is common data parallel primitive that finds use in sorting, stream compaction, multi-precision arithmetic, among many other uses. The operation is so common that it is in the C++ STL and is built into several languages. The scan operation takes a sequence of values [x0, x1, …, x(n–1)] and applies a binary operator ⊕ to yield a second sequence of values [x0, (x0⊕x1),…,(x0⊕x1⊕…⊕x(n–1))]. For example, a scan with the binary operator plus on the sequence [0,1,0,0,1,0,1] yields a cumulative sum in the second sequence [0,1,1,1,2,2,3].

Comparison of inclusive and exclusive scan. Image © Matthew Kieber-Emmons. All rights reserved.

A related and perhaps more common operation called “prescan” or exclusive prefix sum shifts the inclusive sequence one to the right and inserts an identity value I at index 0 (I = 0 with the plus operator), [0, x0, (x0⊕x1),…,(x0⊕x1⊕…⊕x(n–2))]. The sequential computation of scan can be trivially computed by result[i] = input[i] ⊕ result[i –1]. Thus, the total number of operations required is n–1, where n is the number of values in the sequence, and so the scan requires O(n) time.

While scan appears inherently sequential because each value depends on a prior value, there is a significant pool of literature on the parallelization of this primitive. Hillis and Steele reported an early algorithm based on the Kogge-Stone adder that, despite requiring more additions than the sequential algorithm, maps well onto the simdgroup (warp) execution model of modern GPUs because of its step-efficiency. Blelloch reported a work-efficient tree-based algorithm based on the Brent-Kung adder that was commonly used prior to the rise in popularity of SIMD execution models on GPUs.

Optimization considerations

In the prior story on optimization of parallel reduction on M1, we used ideas originally implemented in CUDA (and OpenCL) to write kernels in Metal that, when applied to a tile of values (the local optimization problem), were optimized for cost efficiency, or parallelism, or in between. Combining the partial results from all the individual tiles (the global optimization problem) was no contest — atomic operations on the M1 GPU are very fast and so for 32-bit integers, one should always update the result using the built-in atomic functions. For types (or operations) that are unable to atomically update, a multi-level strategy is required. The unified memory of M1 means that the final level can be performed on the CPU. Regardless, these programs all ran at the memory bandwidth limit of M1. These kernels could only beat out the performance of vectorized code on the M1 CPU cores at quite large data sizes(>16M values) and even then only barely so that there is really no need to perform reduction on the M1 GPU.

Reduction and scan operations applied to a tile of values within a single threadgroup can be optimized to maximize parallelism, maximize efficiency, or balance the two extremes. Maximization of parallelism may lead to faster total turnaround times on smaller datasets, while the best throughput and energy use will be achieved by maximizing efficiency. Image © Matthew Kieber-Emmons. All rights reserved.

In this article, we will take the concepts from reduction and apply them to the scan operation. When the number of elements is much greater than the number of processors, three optimization levels must be considered: i) the global algorithm, ii) the local or threadgroup algorithm, and iii) the execution unit algorithm. Considering the global algorithm first, one major difference between modern CUDA devices and Metal on the M1 is that the only global memory synchronization available beyond the kernel submission boundary are atomic functions, which are limited to 32-bit signed and unsigned integers and certain operations. Moreover, no guarantees are made regarding behavior of the threadgroup scheduler in Metal and so one cannot use these atomics to implement any sort of locking on M1. And, the behavior is different on previous device architectures, where technically undefined behavior could be used to complete tiles in sequence, presumably because of differences in the eviction rules of stalled threadgroups on these other architectures.

Thus, single-pass or “chained” scans like StreamScan or Decoupled Lookback are not possible on M1, which is unfortunate because these algorithms only require ~2n global data movement (n reads and n writes). *Note: I tried very hard to get these to work, and it's just not possible. I had a brief conversation with a very helpful Apple Engineer on the developer forums if you are interested in the insights.

Consequently, we are resigned to multi-level algorithms on M1 for scan, which are slower theoretically and in practice. The classic global prefix sum algorithm, “scan-then-propagate,” requires three phases. In the first upsweep phase, the values are scanned by tile, each assigned to a threadgroup, and the inclusive sum of each tile is stored to an auxiliary partial sum array. Next, the auxiliary array is scanned in place. Finally, the auxiliary array is uniformly added back to the values per tile back down the tree to complete the operation. When the auxiliary array is large, this may mean multiple recursive levels of scan.

Device (global) data flow of scan-then-propagate global parallel prefix sum algorithm. This algorithm requires one value per threadgroup of intermediate storage, and optionally n-levels of recursion if the intermediate vector cannot be scanned in a single threadgroup. The algorithm requires ~4n global memory accesses (n read and n write in both scan and uniform add step plus scan of auxiliary array). Image © Matthew Kieber-Emmons. All rights reserved.

Given that scan-then-propagate requires a read and write to global memory in both the scan and propagate phases, this algorithm requires 4n global memory accesses, not including the scan of the auxiliary storage array or possible recursion within that step as needed. This strategy was used in the original Apple OpenCL Parallel Prefix Scan example. Optimization of this algorithm yields a “reduce-then-scan” strategy where the initial tiles are simply reduced into a partial sum array, rather than scanned and written to device (global) memory. By only writing the reduction value, this eliminates a write-to-device memory resulting in an overall process that requires ~3n global memory accesses. Based on these considerations, a reduce-then-scan algorithm will be the fastest global strategy employed on M1.

Since, ultimately, we wish to generate scan programs that operate at the bandwidth limit of the device, we need to understand what that upper bound will be. In the last article, we benchmarked the memory bandwidth of the M1 GPU and found a memory bandwidth of 58 GB/sec. This means that with the ~3n global memory accesses in the reduce-then-scan strategy, our maximum scan rate will be ~4.8 billion 32-bit values per second (58 GB/sec ÷ 3 global memory accesses = 19.3 GB/sec; 4 bytes per value yields 4.8 G values/sec). Our actual rate will be slightly less as some time will be lost scanning the intermediate storage.

A second noteworthy optimization consideration is at the execution unit (SIMD-group) level. Metal on M1 offers a handful of built-in SIMD-group scan functions. These functions take the form T simd_prefix_exclusive_sum(T data) and are also available in variants for inclusive sums and product scans. While the implementations of these are not published, they are likely optimal, and so we will use them in this story. It is worthwhile, nonetheless, to briefly consider how to write an optimal execution unit (SIMD-group) scan in case other scan variants (min/max, for example) are needed.

Because the execution unit threads run in lockstep, it means that work can not be scheduled on stalled threads until all threads in the execution unit have been completed. Thus, step efficiency is more important than work efficiency. So scan networks like Hillis-Steele will be the fastest algorithms within a SIMD group, and so we do not need to optimize this either. A starting point of how to do this with shuffle intrinsics looks like the following:

T other = simd_shuffle_up(value, 1);
if (lid >= 1) value += other;
other = simd_shuffle_up(value, 2);
if (lid >= 2) value += other;
other = simd_shuffle_up(value, 4);
if (lid >= 4) value += other;
other = simd_shuffle_up(value, 8);
if (lid >= 8) value += other;
other = simd_shuffle_up(value, 16);
if (lid >= 16) value += other;

A reduce-then-scan strategy also means we get to reuse our threadgroup local reduction code that we optimized last article and the insights therein. Thus, our optimization problem only focuses on the threadgroup local algorithm. We will assess three kernels that i) expose maximum parallelism, ii) minimize cost, or iii) split the difference. The two adjustable parameters in the kernels are the values per thread and threads per threadgroup. We will compare how we do on the M1 GPU to vectorized code using NEON intrinsics on the M1 CPU. While we do expect our GPU scan to be faster, the performance cores of the M1 CPU are stellar, and consequently, we may be surprised.

Threadgroup Implementation

Since our previous article optimized reduction on M1 and our overall global reduction strategy will be reduce-then-scan, we will focus this section on the implementation of the scan kernels and just give a few details about how the previous reduction code was made more general purpose.

Load and store. We modify our initial reduction per thread to enable reuse of the blocked load function for both scan and reduction. Instead of reduction during load, we load all values from the device (global) into a fixed-size stack array (thread) memory. We also added store functions to write the local array to global memory (only unsafe versions are shown for simplicity):

template<ushort GRAIN_SIZE, typename T> static void
LoadBlockedLocalFromGlobal(thread T (&value)[GRAIN_SIZE],
                           const device T* input_data,
                           const ushort local_id) {
  for (ushort i = 0; i < GRAIN_SIZE; i++) {
    value[i] = input_data[local_id * GRAIN_SIZE + i];
  }
}template<ushort GRAIN_SIZE, typename T> static void
StoreBlockedLocalToGlobal(device T *output_data,
                          thread const T (&value)[GRAIN_SIZE],
                          const ushort local_id) {
  for (ushort i = 0; i < GRAIN_SIZE; i++) {
    output_data[local_id * GRAIN_SIZE + i] = value[i];
  }
}

In Metal, dynamically indexing into stack arrays is a big performance no-no; however, in this case, the compiler unrolls the loop because GRAIN_SIZE is given at compile time, and so the indexing and, consequently, the load/store is fast because it's statically indexed rather than dynamically indexed. This array is reduced in each thread in a second step, which does increase our register usage, but is inconsequential to the overall occupancy of the kernel on the GPU at moderate GRAIN_SIZE.

Scan kernel. Turning our attention back to the scan operations, our overall strategy will be to i) load from global into registers, ii) scan registers by thread and store the per thread aggregate value, iii) scan the aggregate values either for maximum efficiency, maximum parallelism, or balance between the two; iv) uniformly propagate the scanned aggregate values back to the thread values; v) store from registers back to global. To simplify this story, we will focus simply on prefix sums as opposed to other binary operations. See the prior story for ideas on how to generalize to other operations. Our kernel function is thus as follows:

template<ushort BLOCK_SIZE, ushort GRAIN_SIZE, typename T>
kernel void
PrefixScanAddPrefixKernel(device T* output_data,
                          device const T* input_data,
                          constant uint&  n,
                          device const T* partial_sums,
                          uint group_id  
                               [[threadgroup_position_in_grid]],
                          ushort local_id 
                               [[thread_position_in_threadgroup]]) {  uint base_id = group_id * BLOCK_SIZE * GRAIN_SIZE;  // load values into registers
  T values[GRAIN_SIZE];
  LoadBlockedLocalFromGlobal(values,
                             &input_data[base_id], 
                             local_id);  // sequentially scan the values in registers in place
  T aggregate = ThreadPrefixExclusiveSum<GRAIN_SIZE>(values);  // scan the aggregates
  T prefix = 0;
  threadgroup T scratch[BLOCK_SIZE];
  switch (LOCAL_ALGORITHM){
    case 0:            
      prefix = ThreadgroupBlellochPrefixExclusiveSum<BLOCK_SIZE,T>
                                     (aggregate, scratch, local_id);            
      break;        
    case 1:            
      prefix = ThreadgroupRakingPrefixExclusiveSum<BLOCK_SIZE,T>
                                     (aggregate, scratch, local_id);            
      break;        
    case 2:            
      prefix =ThreadgroupCooperativePrefixExclusiveSum<BLOCK_SIZE,T>
                                     (aggregate, scratch, local_id);            
      break;    
    }  //  optionally load or store the inclusive sum as needed     
  switch(GLOBAL_ALGORITHM) {        
    case 0:            // no op            
      break;        
    case 1:            
      if (local_id == BLOCK_SIZE - 1)
        partial_sums[group_id] = aggregate + prefix;            
      threadgroup_barrier(mem_flags::mem_none);            
      break;        
    case 2:            
      if (local_id == 0)                
        scratch[0] = partial_sums[group_id];            
      threadgroup_barrier(mem_flags::mem_threadgroup);            
      prefix += scratch[0];            
      break;    
  }    // sequentially add the scan and prefix to the values
  ThreadUniformAdd<GRAIN_SIZE>(values, scan + prefix);  // store to global
  StoreBlockedLocalToGlobal(&output_data[base_id],
                            values,
                            local_id);
}#if defined(THREADS_PER_THREADGROUP) && defined(VALUES_PER_THREAD)
template [[host_name("prefix_exclusive_scan_load_sum_uint32")]] kernel void PrefixScanAddPrefixKernel<THREADS_PER_THREADGROUP,VALUES_PER_THREAD>(device uint*, device const uint*, constant uint&, device uint*, uint, ushort);
#endif

This kernel uses function constants during pipeline generation to chose which execution path will be used. Thus, in the Metal shader code, we setconstant int LOCAL_ALGORITHM [[function_constant(0)]]; which allows us to tune our strategy at pipeline compilation without an explosion of template instantiations at library compile time and significantly reduces the size of our metal library.

Maximum parallelism scan algorithm (cooperative scan)

To get the fastest turnaround time, but not necessarily the fastest overall throughput, we implement an algorithm that is step efficient at the expense of doing extra work. In the upsweep phase, we cooperatively use threads in a SIMD group to first scan values in place, storing the inclusive sum in shared memory. Next, the shared memory is scanned in SIMD-group 0. Finally, the downsweep phase uniformly adds the accumulated values to the per-thread scanned values. This algorithm exposes the maximal amount of parallelism because all threads are used in the initial scan, and the uniform add step.

Simplified cooperative scan algorithm for a 32-thread threadgroup and 8-thread execution unit (SIMD-group). This algorithm requires (threads per threadgroup) / (execution width) shared memory. Most work is performed cooperatively in registers. Blue squares represent thread (register/local) variables and yellow squares represent threadgroup (shared) variables. Image © Matthew Kieber-Emmons. All rights reserved.

Despite this algorithm being a scan-then-add data flow, it is more step efficient than the reduce-then-scan version because the number of steps to reduce vs scan in a single execution unit is the same. Thus, assuming an execution width of 32, scan-then-add in registers is five steps for the initial scan (Hillis-Steele is log2(n) steps), a copy into shared memory, five steps to scan the shared memory, followed by a uniform add in a single step (~12 steps). In reduce-then-scan, the single uniform adds would be replaced with a scan, increasing the overall step count by 5. Although this algorithm is step efficient, it is not work efficient. Recall the sequential algorithm requires n–1 additions whereas this algorithm is requiring n×(log2(n)—1)+1 work, so O(n×log(n) overall, which is 4× more additions than a sequential algorithm for 1024 values, but 85× fewer steps, and consequently a major win.

template<ushort BLOCK_SIZE, typename T> static T
ThreadgroupCooperativePrefixExclusiveSum(T value,
                                        threadgroup T* sdata,
                                        const ushort lid) {  // first level of reduction in simdgroup
  T scan = simd_prefix_exclusive_sum(value);  // store inclusive sums into sdata[0...31]
  if ( (lid % 32) == (32 - 1) ) {
    sdata[lid / 32] = scan + value;
  }
  threadgroup_barrier(mem_flags::mem_threadgroup);  // scan the shared memory in the first simdgroup
  if (lid < 32) {
    sdata[lid] = simd_prefix_exclusive_sum(sdata[lid]);
  }
  threadgroup_barrier(mem_flags::mem_threadgroup);  // return the sum of both scans
  return scan + sdata[lid / 32];
}

Balanced cost and parallelism prefix scan (Blelloch scan)

In 1990, Blelloch reported a work-efficient scan algorithm. The algorithm is based on a binary tree, which is reduced in a upsweep phase, the last value cleared, then the prefix sum is generated by distributing the reduction values in a downsweep phase. Here is a graphical representation of the algorithm:

Simplified work efficient Blelloch scan algorithm for a 16-thread threadgroup. All work is performed in shared memory and consequently this algorithm requires n-dim shared memory. Blue squares represent thread (register/local) variables and yellow squares represent threadgroup (shared) variables. Image © Matthew Kieber-Emmons. All rights reserved.

The sequential algorithm requires n–1 additions. This algorithm is nearly as work efficient and only requires 2×(n–1) additions and n–1 swaps. It does this at the expense of 2×(log n) steps and requires execution barriers at each step. Consequently, this algorithm is not step efficient. The indexing is tricky and so to make this work efficiently we limit threadgroup sizes to power of two. Note that while the code below uses loops and indexing for simplicity, this is much less efficient than the unrolled version given in the gist.

// Work efficient exclusive scan from Blelloch 1990
// This is the UNOPTIMIZED version adapted from GPU Gems 3 Chpt 39
template<ushort BLOCK_SIZE, typename T> static T
ThreadgroupUnoptimizedBlellochPrefixExclusiveSum(T value,
                                        threadgroup T* sdata,
                                        const ushort lid) {
  // load input into shared memory
  sdata[lid] = value;
  threadgroup_barrier(mem_flags::mem_threadgroup);// build the sum in place up the tree
  ushort offset = 1;
  for (ushort n = BLOCK_SIZE / 2; n > 0; n /= 2){
    if (lid < n)    {
      const ushort ai = offset * (2 * lid + 1) - 1;
      const ushort bi = offset * (2 * lid + 2) - 1;
      sdata[bi] += sdata[ai];
    }
    offset *= 2;
    threadgroup_barrier(mem_flags::mem_threadgroup);
  }
    
  // clear the last element
  if (lid == 0) { sdata[BLOCK_SIZE - 1] = 0; }
  threadgroup_barrier(mem_flags::mem_threadgroup);
    
  // traverse down the tree building the scan in place
  for (ushort n = 1; n < BLOCK_SIZE; n *= 2)  {
    offset /= 2;
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if (lid < n) {
      const ushort ai = offset * (2 * lid + 1) - 1;
      const ushort bi = offset * (2 * lid + 2) - 1;
      T temp = sdata[ai]; sdata[ai] = sdata[bi]; sdata[bi] += temp;
    }
  }
    
  // return result
  threadgroup_barrier(mem_flags::mem_threadgroup);
  return sdata[lid];   
}

Due to the need for execution barriers at each step, this method is rarely used on modern devices because the slight work inefficiency of SIMD intrinsics still vastly outpaces stalled threads. Moreover, on certain devices, avoiding bank conflicts is a major issue that is overcome with more complex indexing algorithms, which adds work and steps, that decreases the work efficiency of the algorithm. As per Apple, bank conflicts are not a major issue on M1, and so this algorithm is worth benchmarking.

Maximally cost-efficient prefix scan (raking algorithm)

Although we are calling the previous local algorithm the Blelloch work efficient scan, it should be noted that Blelloch also came up with the idea of SIMD-group raking. The purpose of a raking algorithm is to more efficiently use global compute resources by maximizing the amount of sequential work done in a given threadgroup. This is typically accomplished by a single execution unit (SIMD-group) doing a large fraction of the work while the rest of the threads are idle or can exit the kernel early. This idea is related to Brent’s theorem, which overcomes fixed overhead of computations by increasing the work done by a given thread. A graphical depiction of the algorithm is as follows:

Simplified work efficient raking scan algorithm for a 32-thread threadgroup and 8-thread execution unit (SIMD-group). This algorithm requires n-dim shared memory. Blue squares represent thread (register/local) variables and yellow squares represent threadgroup (shared) variables. Image © Matthew Kieber-Emmons. All rights reserved.

Note that since only a single simdgroup does the work, this algorithm may have a slightly slower turnaround time for a given threadgroup, but much more efficiently uses computational resources because the threads of simdgroup 0 are effectively scanning values in serial. Thus, each thread is assigned x values, and so is performing an optimal x–1 additions sequentially in O(x) time. This amortizes the O(y log2(y)) cost of the intermediate scan and data movement. Due to this scaling, it is obvious that this algorithm will be particularly sensitive to the number of values per thread and threads per threadgroup.

// helper function to prescan shared memory in place
template<ushort LENGTH, typename T> static inline T 
ThreadPrefixExclusiveSum(threadgroup T* values){  // do as an inclusive scan first
  for (ushort i = 1; i < LENGTH; i++){
    values[i] += values[i - 1];
  }  // store the inclusive prefix before we overwrite
  T inclusive_prefix = values[LENGTH - 1];  // convert to an exclusive scan in the reverse direction
  for (ushort i = LENGTH - 1; i > 0; i--){
    values[i] = values[i - 1];
  }
  values[0] = 0;  return inclusive_prefix;
}// helper function to add a value to shared memory in place
template<ushort LENGTH, typename T> static inline void
ThreadUniformAdd(threadgroup T* values, T uni){
  for (ushort i = 0; i < LENGTH; i++){
    values[i] += uni;
  }
}//  Raking threadgroup scan
template<ushort BLOCK_SIZE, typename T> static T
ThreadgroupRakingPrefixExclusiveSum(T value,
                                    threadgroup T* shared,
                                    const ushort lid) {  // load values into shared memory
  shared[lid] = value;
  threadgroup_barrier(mem_flags::mem_threadgroup);  // rake over shared mem in first simdgroup
  if (lid < 32) {
    T partial_sum = ThreadPrefixExclusiveSum<BLOCK_SIZE/32>
                      (&shared[lid * (BLOCK_SIZE/32)]);
    T prefix = simd_prefix_exclusive_sum(partial_sum);
    ThreadUniformAdd<BLOCK_SIZE/32>(&shared[lid * (BLOCK_SIZE/32)], 
                                    prefix);
  }
  threadgroup_barrier(mem_flags::mem_threadgroup);  return shared[lid];
}

CPU Implementation and Performance

We will first create a sequential function that can operate both in- and out-of-place to generate an optimal CPU implementation. Next, we will vectorize this function using SIMD/NEON intrinsics. Finally, we will parallelize this function by creating a multithreaded version. The reference sequential implementation of an exclusive prefix sum is trivial; however, the in-place implementation is a bit less trivial because we are overwriting the array value we need in the next iteration. Our function is as follows:

#define SWAP(x, y) do { uint temp = x; x = y; y = temp; } while (0)void prefix_exclusive_sum(uint* result, 
                          const uint* input,
                          const size_t length,
                          const uint prefix) {
  if (length < 1 ) return;  if (input == result) {
    SWAP(prefix,input[0]);
    for (size_t i = 1; i < length; i++) {
      SWAP(prefix,input[i]);
      input[i] += input[i-1];
    }
    return;
  }  result[0] = prefix;
  for (int i = 1; i < length; i++) {
    result[i] = result[i-1] + input[i-1];
  }
}

Although this is a common pattern, this code will not autovectorize in Clang. Thus, we need to write our own vectorized version. The M1 CPU enables access to vectorized SIMD instructions either via the Accelerate framework or direct access via NEON intrinsics. The vector unit is 128-bit wide, and so with uint32_t we can do 4-way vectorization. Although NEON does not include a shuffle instruction like the __mm_slli_si128 instruction found in SSE2, it does have a fast vector extract instruction vextq_u32 that we can use to simulate a shuffle operation.

The algorithm is somewhat easier to understand with the SIMD builtin datatypes; here is the key portion:

v  = simd_make_uint4(A, B, C, D);  // v = [A,B,C,D]
v += simd_make_uint4(0, v.xyz);    // v = [A,AB,BC,CD]
v += simd_make_uint4(0, 0, v.xy);  // v = [A,AB,ABC,ABCD]
v  = simd_make_uint4(0, v.xyz);    // v = [0,A,AB,ABC]

Given a vector v composed of four values [A, B, C, D], we shift v by one to the right to make a new vector [0, A, B, C], then add it to the original vector to generate [A, AB, BC, CD]. Next, we shift v by two to the right to generate [0, 0, A, AB] and add that back to itself to yield the prefix inclusive sum [A, AB, ABC, ABCD]. The exclusive sum is made by shifting one to the right and inserting 0. This algorithm is analogous to a Hillis-Steele scan.

To generalize this method to arbitrary length arrays, we need to loop over the values with a stride of 4, saving the inclusive sum from each 4-way tuple to add to the next tuple, and finally do the remainder sequentially. Malloc on macOS does use 16-byte memory alignment in the ABI, but this is technically undefined. Nonetheless, we don’t want to expect the input and result pointers are necessarily aligned to 16 B as required for a simd_uint4, and so we need to reinterpret input and result to simd_packed_uint4* to be sure that the compiler knows how to deal with the alignment issues. Alternatively, we could have loaded the value v by usingsimd_uint4 v = *(simd_packed_uint4*)&input[i].

#define USE_INTRINSICS (1)void prefix_exclusive_sum_vectorized(uint* result,
                                     const uint* input, 
                                     const size_t length,
                                     const uint prefix) {// reinterpret the pointers as packed vector types
  simd_packed_uint4* vinput = (simd_packed_uint4*)input;
  simd_packed_uint4* vresult = (simd_packed_uint4*)result;
  
  size_t n = length / 4;
  for (size_t i = 0; i < n; i++){
#if USE_INTRINSICS && defined(__ARM_NEON)
    simd_uint4 v = vinput[i];
    v = vaddq_u32(v, vextq_u32(vdupq_n_u32(0),v,3) );
    v = vaddq_u32(v, vextq_u32(vdupq_n_u32(0),v,2) );
    vresult[i] = vaddq_u32(vdupq_n_u32(prefix), 
                           vextq_u32(vdupq_n_u32(0),v,3));
    prefix += v.w;
#else
    simd_uint4 v = vinput[i];
    v += simd_make_uint4(0, v.xyz);
    v += simd_make_uint4(0, 0, v.xy);
    vresult[i] = simd_make_uint4(0,v.xyz) + prefix; 
    prefix += v.w;
#endif
  }  // do the remainder sequentially
  prefix_exclusive_sum(&result[n * 4], &input[n * 4], 
                       length - n * 4, result[n * 4 - 1]);
}

In this function you probably also noted that I included a flag to choose between using the Apple provided SIMD types and operations or the NEON intrinsics so we can see which is faster. On M1, simd_uint4 and the NEON type uint32x4_t share the same type and alignment and thus are interchangeable. To parallelize this vectorized code, we will use dispatch_apply from Grand Central Dispatch to spawn threads.

void prefix_exclusive_sum_parallelized(uint* result, 
                                       uint* input,
                                       size_t length, 
                                       uint prefix) {  if (length < 1 ) return;  // stride is an adjustable value
  size_t stride = (length > 1024) ? (length + 511) / 512 : 1024;
  if (stride % 4) stride += 4 - (stride % 4);
  size_t dispatches = (length + stride - 1) / stride;  // exit early if this can be done in a single dispatch
  if (dispatches == 1) {
    prefix_exclusive_sum_vectorized(result, input, length, prefix);
    return;
  }  // multithreaded reduce into partial sum buffer
  uint* partial_sums = (uint*)malloc(dispatches * sizeof(uint));
  dispatch_apply(dispatches, DISPATCH_APPLY_AUTO, ^(size_t i) {
    partial_sums[i] = reduce_sum_vectorized(&input[i * stride],
                                   MIN(stride,length - i * stride));
  });  // scan the partial sums in place
  prefix_exclusive_sum_vectorized(partial_sums, partial_sums,   
                                  dispatches, 0);  // scan and add partial sums
  dispatch_apply(dispatches, DISPATCH_APPLY_AUTO, ^(size_t i) {
    prefix_exclusive_sum_vectorized(&result[i * stride], 
                                    &input[i * stride], 
                                    MIN(stride,length - i * stride), 
                                    partial_sums[I]);
  });  free(partial_sums);
}

In this function, we will use dispatch to run tiles of the data array. We have a single adjustable parameter stride that defines how long each segment will be. We round up the stride to the nearest 4 so that vectorized loads are aligned. After trying several global strategies, we chose reduce-then-scan because we have a very fast CPU reduce that we optimized in our last story. Within this strategy, we reduce into a partial sum array, then scan this array in place and add it to the scanned values. Let’s take a look at how the sequential, the two vectorized versions, and the parallelized version do.

Parallel prefix sum bandwidth on M1 CPU as a function of length of data using code described in text. Benchmark recorded on an M1 Mac Mini with 8 GB RAM and 256 GB SSD. Code compiled at -Os optimization level, and represents the average of 50 trials. Image © Matthew Kieber-Emmons. All rights reserved.

The benchmark results are plotted as throughput. Since the sequential and vectorized algorithms are doing 2n global data movements (one read and one write), our best case rate would be 29 GB/sec with 32-bit integers. In contrast, our threaded implementation uses a multi-level strategy and thus 3n data movements (so max 19 GB/sec), and so we are trading some memory bandwidth for parallelism.

The sequential out-of-place scan achieves slightly under 10 GB/sec. Vectorizing this code yields a ~2.2× improvement at ~22 GB/sec. Using dispatch with varying stride lengths improves performance for short arrays that fit in cache, but is substantially inferior for longer arrays that must spill to main memory. Surprisingly, the hand-coded NEON instructions were roughly equivalent to the use of SIMD operators provided by Apple in Accelerate and supported with Clang magic. I suppose I shouldn’t have been surprised; Clang does a great job optimizing this code. Comparison of the assembly for the inner loop is instructive:

   NEON intrinsics code        |           SIMD operators
  ldr      q1, [x9], #16       |      ldr      q1, [x9], #16
  ext.16b  v2, v0, v1, #12     |      ext.16b  v2, v0, v1, #12
  add.4s   v1, v2, v1          |      add.4s   v1, v2, v1
->ext.16b  v2, v0, v1, #8      |    ->movi.2d  v2, #0000000000000000
                               |    ->mov.d    v2[1], v1[0]
  add.4s   v1, v2, v1          |      add.4s   v1, v2, v1
  dup.4s   v2, w3              |      ext.16b  v2, v0, v1, #12
  ext.16b  v3, v0, v1, #12     |      dup.4s   v3, w3
  add.4s   v2, v3, v2          |      add.4s   v2, v2, v3
  str      q2, [x10], #16      |      str      q2, [x10], #16
  mov.s    w11, v1[3]          |      mov.s    w11, v1[3]
  add      w3, w11, w3         |      add      w3, w11, w3
  subs     x8, x8, #1          |      subs     x8, x8, #1

They are almost identical! The only difference (annotated by arrows) is that a vector extract instruction ext is replaced with two moves movi and mov, with some slight changes in register usage and instruction ordering. I couldn’t find cycles per instruction in the ARM docs; assuming one cycle per instruction, the NEON code is ever so slightly more efficient and certainly not worth the effort. From now on, I will use the built-in SIMD types and operations and take a peek at the assembly to see if anything strange is happening, and if not, feel secure in using it.

Overall Performance Comparison

So our benchmark vectorized CPU code can achieve a blistering 22 GB/sec on large arrays. How do the GPU codes compare? Slower. The raking and Blelloch algorithms achieve just a shade over 18 GB/sec at 128M elements. Recall that with the reduce-then-scan global algorithm, 19.3 GB/sec was our absolute upper limit and that we expected to come in slightly lower than this because this does not include the effort required to scan the intermediate partial sum buffers. This means that these two algorithms are operating at the memory bandwidth limit.

The cooperative algorithm, which is not work efficient, is 1-2 GB/sec slower and cannot saturate the bus. At sizes greater than 128 M elements (512 MB), performance degrades. It is not clear why this would be the case. On the CPU, we saw degraded performance past the size of the L2 cache (12 MB), but the M1 GPU should be able to access the entire 8 GB in the machine. I suspect this must be from behind-the-scenes layers of caching, but it will be something to keep an eye on in the next generation of processors.

Parallel prefix sum bandwidth as a function of problem size and threadgroup algorithm on M1 CPU using code described in text. Benchmark recorded on an M1 Mac Mini with 8 GB RAM and 256 GB SSD. Code compiled at -Os optimization level with Metal fast math enabled, and represents the average of 50 trials. Image © Matthew Kieber-Emmons. All rights reserved.

Conclusion

As expected, raking algorithms should be the go-to technique for threadgroup prefix scans on GPUs as they are most work and energy efficient. However, in most circumstances, the CPU code should be preferred. While the GPU is faster than unoptimized CPU code, the optimized CPU code performs better at every problem size and is trivial to write. If future versions of Metal provided additional global synchronization primitives, this would change the game as it would enable alternative algorithms like Decoupled Lookback with ~2n global memory access to achieve throughputs in the range of 29 GB/sec. This does make me wonder, though, about speedups reported by others relative to CPU code — are authors using vectorized CPU code or naïve sequential code in claims about enormous speedups for prefix scans on GPUs? The scan and reduce kernels used are provided in the gist below.

Thanks for reading!