Posts

Post not yet marked as solved
1 Replies
535 Views
I have a complex CAS loop with branches that essentially implement a mutex and I'm porting it from CUDA to Metal. I'm looking for the equivalent of CUDA __treadfence(); => docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock. I need to have one store to buffer A happen before another store to buffer B. Atomic memory oder options are only one: "relaxed". How to accomplish this ordering guarantee?
Posted
by rianflo.
Last updated
.
Post not yet marked as solved
2 Replies
530 Views
Metal is great. I love the API's design. One thing though about MSL is that it renders using preexisting C++ functions useless due to explicit address space qualifiers. Was that decision made for the sake of lower compile times? With CUDA, which isn't too dissimilar, it can inherit the address space of a pointer. Let's say I have a very large collection of useful headers as a git submodule from somewhere I do not maintain myself. In MSL I cannot just use them, but rather have to modify each and every function declaration such that the argument list includes the ASQ. Example: // some header only lib void usefulFunction(MyType& a) { ... } ... // calling kernel code device MyType* x; MyType x0 = x[0]; usefulFunction(x[0]); usefulFunction(x0); Non of the above works. In CUDA this isn't an issue.
Posted
by rianflo.
Last updated
.
Post marked as solved
3 Replies
435 Views
Please add support for uint64_t buffers. I can reinterpret_cast an uint2 to a uint64_t and the metal compiler can generate load instructions to fetch uint2s. The remainder is just unnecessary typing work for the programmer.
Posted
by rianflo.
Last updated
.