Post

Replies

Boosts

Views

Activity

M1 GPU violates atomic_thread_fence across threadgroups
I have an M1 Pro with a 16-core GPU. When I run a shader with 8193 threads, atomic_thread_fence is violated across the boundary between thread 8191 (the last thread in the 7th threadgroup) and 8192 (the first thread in the 9th threadgroup). I've attached the Metal and Swift files, but I'll repost the relevant kernel here. It's a function that launches N threads to iterate through a binary tree from the leaves, where the first thread to reach the parent terminates and the second one populates it with the sum of the nodes two children. // clang-format off void sum(device const int& size, device const int* __restrict__ in, device int* __restrict__ out, device atomic_int* visited, uint i [[thread_position_in_grid]]) { // clang-format on int val = in[i]; uint cur = (size + i - 1); out[cur] = val; atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst); cur = (cur - 1) / 2; int proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed); while (proceed == 1) { uint left = 2 * cur + 1; uint right = 2 * cur + 2; uint val_left = out[left]; uint val_right = out[right]; uint val_cur = val_left + val_right; out[cur] = val_cur; if (cur == 0) { break; } cur = (cur - 1) / 2; atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst); proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed); } } What I'm observing is that thread 8192 hits the atomic_fetch_add first and terminates, while thread 8191 hits it second (observes that thread 8192 had incremented it by 1) and proceeds into the loop. Thread 8191 reads out[16383] (which it populated with 8191) and out[16384] (which thread 8192 populated with 8192 prior to the atomic_thread_fence). Instead of reading 8192 from out[16384] though, it reads 0. Maybe I'm missing something but this seems like a pretty clear violation of the atomic_thread_fence which (I thought) was supposed to guarantee that the write from thread 8192 to out[16384] would be visible to any thread observing the effects of the following atomic_fetch_add. Is atomic_fetch_add not a store operation? Modifying it to an atomic_store or atomic_exchange still results in the bug. Adding another atomic_thread_fence between the atomic_fetch_add and reading of out also doesn't change anything. I only begin to observe this on grid sizes of 8193 and upwards. That's 9 threadgroups per grid, which I assume could be related to my M1 Pro GPU having 16 cores. Running the same example on an A17 Pro GPU doesn't show any of this behavior up through a tested grid size of 4194303 (2^22-1), at which point testing larger grid sizes starts to run into other issues so I can't test anything larger. Removing the atomic_thread_fences on both the M1 and A17 cause the test to fail at much smaller grid sizes, as expected. sum.metal main.swift
2
0
405
Dec ’24
Xcode Instruments CPU Profiler not logging os_signpost Points of Interest
If I create a new project with the following code in main.swift and then Profile it in Instruments with the CPU Profiler template, nothing is logged in the Points of Interest category. I'm not sure if this is related to the recent macOS 14.2 update, but I'm running Xcode 15.1. import Foundation import OSLog let signposter = OSSignposter(subsystem: "hofstee.test", category: .pointsOfInterest) // os_signpost event #1 signposter.emitEvent("foo") // os_signpost event #2 signposter.withIntervalSignpost("bar") { print("Hello, World!") } If I change the template to the System Trace template and profile again, then the two os_signpost events show up as expected. This used to work before, and this is a completely clean Xcode project created from the macOS Command Line Tool template. I'm not sure what's going on and searching for answers hasn't been fruitful. Changing the Bundle ID doesn't have any effect either.
1
0
1.2k
Dec ’23