Metal Journal
Detailing my journey to writing performant Metal kernels & systems
Created on May 5|Last edited on May 7
Comment
May 6, 2024
Every small step feels like a breakthrough lol.
I just found these:
which explain how to choose where a resource is stored & whether GPU and/or CPU have access to it.
I started making this chart below, which led me to a question - is there such thing as local memory - private only to 1 threadgroup - in Metal?
I originally named the column Location/Bandwidth because the bandwidth is what we really care about. Actually I'm just going to call it access latency to imply where it is.
| CUDA | Access Latency | Metal | Scope |
|---|---|---|---|
| Register | faster | Register | GPU Thread |
| Shared Memory | fast | [Memoryless] Tile Memory | Block / GPU only (all threads? threadgroup?) |
| Local Memory | slower | [Private] System memory | Block / GPU only (all threads? threadgroup?) |
| Global Memory | slower | [Shared] System Memory | CPU & GPU |

it's bedtime cause I worked a shift and had a run. Tomorrow I'd like to write a metal tiled matmul, and soon might be worth taking a closer look at the ANE/NPU & coreml
May 5, 2024
I'm trying to piece together an understanding of Apple Silicon (M-Series) GPUs that's as deep as my (probably beginner level) understanding of NVIDIA GPUs.
Here's some images (from the PMPP book & online resources) that illustrate my...
NVIDIA GPU understanding:
SMs:




Regarding my note in red: my understanding is, in an NVIDIA Graphics card, the global memory will be in that card, but off of the actual processing chip.
Since all of these figures are simplified & explained in the PMPP book, while I'm missing some understanding and don't even know what it is that I'm missing, I understand what the different levels of memory are, and there are at least some resources out there, for example [Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking](https://arxiv.org/abs/1804.06826), which gives us a more detailed figure than the one above:

and tells us the bandwidth between types of memory:

Upon pasting this here, I'm wondering if Shared Memory -> Banks (Bs & ws) refers to the banks described here in PMPP:


Remember the reason we care about banks is for Coalesced Memory Accesses, which means chunks of memory are accessed in bursts, so data that's close together can be loaded from Global Memory into Shared Memory faster (with 1 read).
Otherwise, what we care about is Shared Memory & Global Memory > Measured Bandwith, because we can see that they are 12,080 GiB/s (12970.8 GB/s) and 750 GiB/s (805.3 GB/s) respectively.
What this means for us is that when we're trying to get the maximum performance out of a GPU... (let's refer to the Roofline Model below)

To get the most computational throughput, OPs, we need to make sure we aren't memory bandwidth bound. Point A1 is an example of this, which means that it's memory bound.
So considering how much faster the GPU cores can load data from Shared Memory than from Global Memory, we want to use Global Memory as seldomly as possible.
Now how about my...
Apple Silicon [CPU, GPU, & ANE/NPU] understanding
Here we'll look at the M3 Max (12-CPU core version), because it's the most fun (without being the Ultra, which is 2 Maxs put together).

CPU: Consists of (red) 12 Performance (P) cores, 8 above, 4 below large shared cache & (light green) 4 Efficienct (E) cores.What are the levels of cache/memory, and perhaps unimportantly since it will never be the bottleneck, what is their bandwidth?
AMX: "Advanced Matrix Attention Unit" Parts of the CPU which I suspect to be like ALUs specialized for matricies. Just got an awesome find to access it directly: https://github.com/corsix/amx. What level of memory is attached to this, probably CPU core registers?
NPU: Neural Processing Unit or Neural Engine Consists of (pink) 16 cores, I believe intended only to be used through CoreML. 18 TOPS (I'm assuming 18 tera- (10^12) operations per second). I'm curious about the precision / data type this deals with? It touches the RAM - does it interface through the CPU or does it access memory directly to RAM?
GPU: Consists of (blue) 40 cores, with 16 execution units (EU) each, which each have 8 artithmetic logic units (ALU).
From the M1 wiki (https://en.wikipedia.org/wiki/Apple_M1), since it says the M1 chip can execute 24,576 threads simultaneously, and M1 has 128 EU & 1024 ALU, I'm assuming that means each EU contains 192 hardware threads, and possibly every 24 threads share an ALU.
Let's remember that a SIMD group (the equivalent of a warp, used to schedule threads), has 32 threads.
(RAM) LPDDR5: (Low-powered DDR5 6400) 64-128 GB, BANDWIDTH: 409.6 GB/s. Ok, so does this bandwidth refer to both read & write speed to all CPU, GPU, & NPU?
SLC: System Level Cache "a kind of system-wide L3 cache" This takes up so much surface area in the M3, what is the capacity & bandwidth? I think it's safe to assume this won't be accessible through Metal, and instead is an abstraction of the RAM, but does that then play into the RAM's bandwidth of 409.6 GB/s ?QASZ34EZ
the image is from ( youtu.be/8bf3ORrE5hQ?si=Dx8wkREsidrrAgn2&t=477 )
Add a comment