Member-only story

Efficient Parallel Prefix Sum in Metal for Apple M1

Comparison of optimal M1 GPU scan primitives to vectorized CPU performance

Matthew Kieber-Emmons

Published in

Better Programming

20 min readSep 9, 2021

In my last story, Optimizing Parallel Reduction in Metal for Apple M1, we explored several strategies for maximizing throughput in parallel reduce on the M1 processor. Given the low arithmetic intensity of reduction, we could optimize kernels that ran at the memory bandwidth limit regardless if the kernels were tuned to maximize parallelism or cost efficiency. This story will explore a more complex optimization problem, parallel prefix sum.

All prefix sum, or inclusive “scan,” is common data parallel primitive that finds use in sorting, stream compaction, multi-precision arithmetic, among many other uses. The operation is so common that it is in the C++ STL and is built into several languages. The scan operation takes a sequence of values [x0, x1, …, x(n–1)] and applies a binary operator ⊕ to yield a second sequence of values [x0, (x0⊕x1),…,(x0⊕x1⊕…⊕x(n–1))]. For example, a scan with the binary operator plus on the sequence [0,1,0,0,1,0,1] yields a cumulative sum in the second sequence [0,1,1,1,2,2,3].

A related and perhaps more common operation called “prescan” or exclusive prefix sum shifts the inclusive sequence one to the right and inserts an identity value I at index 0 (I = 0 with the plus operator), [0, x0, (x0⊕x1),…,(x0⊕x1⊕…⊕x(n–2))]. The sequential computation of scan can be trivially computed by result[i] = input[i] ⊕ result[i –1]. Thus, the total number of operations required is n–1, where n is the number of values in the sequence, and so the scan requires O(n) time.

While scan appears inherently sequential because each value depends on a prior value, there is a significant pool of literature on the parallelization of this primitive. Hillis and Steele reported an early algorithm based on the Kogge-Stone adder that, despite requiring more additions than the sequential algorithm, maps well onto the simdgroup (warp) execution model of modern GPUs because of its step-efficiency. Blelloch reported a work-efficient tree-based algorithm based on the Brent-Kung adder that was commonly used prior to the rise in popularity of SIMD execution models on GPUs.

Optimization considerations

Matthew, am I missing something?

To avoid trivial and unhelpful details are we willing to agree with the following points?

- A small enough kernel (eg a gridsize of 4*8192, a threadgroup size of 8192), so 32 lanes* 4 lanesets per core * 64 time…

Better Programming

Efficient Parallel Prefix Sum in Metal for Apple M1

Comparison of optimal M1 GPU scan primitives to vectorized CPU performance

Optimization considerations

Create an account to read the full story.

Published in Better Programming

Written by Matthew Kieber-Emmons

Responses (1)

More from Matthew Kieber-Emmons and Better Programming

Optimizing Parallel Reduction in Metal for Apple M1

Exploring how an optimal implementation should approach the memory bandwidth of the architecture

How To Update Your Status During Standup Like a Senior Engineer

A status update is where you can showcase how well you manage ambiguity and is an important way to build trust with your team

Why I Prefer Regular Merge Commits Over Squash Commits

I used to think squash commits were so cool, and then I had to use them all day, every day. Here’s why you should avoid squash

Memory Bandwidth Optimized Parallel Radix Sort in Metal for Apple M1 and Beyond

A sort shader in Metal for use on primitive data types

Recommended from Medium

Basic Introduction to C Language

This might be one the most hated languages by standard developers due to its tedious requirements needed to build something, let alone when…

Decoding the ‘Virtual’ Function in C++ Programming Language

Probably the least understood and often misunderstood part of learning and using C++ programming is the virtualisation of a function in a…

Fired From Meta After 1 Week: Here’s All The Dirt I Got

This is not just another story of a disgruntled ex-employee. I’m not shying away from the serious corporate espionage or the ethical…

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Understanding the Global Interpreter Lock (GIL) in Python 3.13

Introduction