Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

54 Posts
Sort by:
Post not yet marked as solved
2 Replies
49 Views
HI - I'm trying to implement a Barnes-Hut N-Body simulation code in Metal. The code requires construction of a tree. The CUDA implementation uses locks to allow insertion of new nodes into the tree. I've tried using an array of atomic ints in a test case, but this doesn't seem to work: kernel void binning_compute_function( device MyArgument *arg1 [[ buffer(0)]],        constant float *ranarr      [[ buffer(1) ]],         device volatile atomic_int *flagArr [[ buffer(2) ]],         device int *bins [[buffer(3)]],          uint   index [[ thread_position_in_grid ]]) {     int expected=0;     int ibin = (ranarr[index] * arg1->nbins);     for (int i = 0; i < 100000000; i++) { // Lock         expected = 0;         bool test = !atomic_compare_exchange_weak_explicit(&flagArr[ibin],&expected,1,memory_order_relaxed,memory_order_relaxed);         if (test) {             bins[ibin] += 1;             atomic_store_explicit(&flagArr[ibin], 0, memory_order_relaxed);             break;         }     } } Any other suggestions? The alternative is to use the CPU for this, but seems a shame to miss out on the processing power of the GPU. Thank you, Colin
Posted
by mcgillca.
Last updated
.
Post not yet marked as solved
0 Replies
45 Views
I'm trying to work with an MTLBuffer that has a size greater than the maximum allowed buffer length for an MTLBuffer. As such I have split it into an array of MTLBuffers like this: // swift declaration var _mBuffers : [MTLBuffer]! // swift allocation _mBuffers = [] for _ in 0..<64 { // use 64 buffers if let buffer = _mDevice.makeBuffer(length: BUFFER_BYTES, options: .storageModeShared) {       _mBuffers.append(buffer)   } } // we now have an array of 64 MTLBuffers in _mBuffers Then in my metal compute kernel I have a function like this, where I want to organise data across the 64 buffers: struct Foo {      device uint* data_buffers[64]; }; kernel void test_buffers(                          constant Foo & f [[buffer(0)]],                          device metal::atomic_uint *out_counters,                          const uint index [[thread_position_in_grid]]) {     // figure out which group value belows to     uint group = index % 64;     // find the slot in memory for that group     uint position = atomic_fetch_add_explicit(&out_counters[group], 1, memory_order_relaxed);     // add the value to the buffer with that group, and it's position in buffer     f.data_buffers[group][position] = index; } Now I can't figure out how to call this kernel function from swift code using the command encoders. // swift calling code snippets...(not even close to working) // this should be the same as the C struct in the kernel code? struct Foo {     var buffers : [MTLBuffer] }; // how to pass the _mBuffer with 64 elements to the computeEncoder arguments?? var arguments : Foo = {         buffers: [MTLBuffer]! } // how to pass the arguments to the computeEncoder correctly? computeEncoder.setBytes(&arguments, length: MemoryLayout<Foo>.stride, index: 0) Any help would be much appreciated! I feel like this should be simpler than it is.
Posted
by drnick.
Last updated
.
Post not yet marked as solved
6 Replies
875 Views
If your exchange mailbox get damage or corrupted and you think how to recover EDB file data then don’t think more just use InQuit EDB mailbox recovery software that able to repair EDB file data after that convert EDB file data into MS outlook PST file with other format as: - EML, MSG, HTML, MBOX, vCal, vCard as well as it also support office 365. You can also use the converter to see preview of the conversion also. You can convert the EDB files with the password protection. Try the free demo version of the converter and convert 30 emails for free as well. Search the Google, Yahoo and Bing: InQuit EDB To PST Converter Software
Posted Last updated
.
Post not yet marked as solved
0 Replies
72 Views
Hi, Related to another post I made trying to explore better ways to increase the efficiency of Metal compute kernels for very large calculations, we gave a shot to use indirect compute command buffers. When we implemented the app mentioned in post with indirect buffers, we always got our final result memory buffers just with 0s. Then we implemented a much simpler code where we just do a simple operation adding two buffers to try to understand where there could be the issue. In this operation, we use the following kernel as a proof of concept, which just does an accumulation task in a cubic 3D array: #define size (*size_pr) kernel void metalswift_add(const device float *Buffer1 [[ buffer(0) ]], const device float *Buffer2[[ buffer(1) ]], device float *OutputBuffer[[ buffer(2) ]], const device int *size_pr[[ buffer(3) ]], uint3 gid[[thread_position_in_grid]]) { int i = gid.x; int j = gid.y; int k = gid.z; if (i < size && j < size && k < size) { int index=i*size*size+j*size+k; OutputBuffer[index]+= Buffer1[index] + Buffer2[index]; } } The indirect command buffer works as expected for memory buffers with a modest size; for example, matrices smaller than 8 x 8 x 8 = 512 float entries. But then, when the buffer becomes a bit bigger (for example, anything equal to or larger than 16 x 16 x 16 = 4096 float entries), then the indirect command buffer seems to stop working as our result buffer just has 0s. When running using a standard command pipeline, we obtain the desired results. Below there is the Swift implementation of the indirect command buffer (the standard pipeline is not shown for post space limitations): ... @_cdecl("metalswift_add_indirect") public func metalswift_add_indirect(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> { print("USING INDIRECT COMMAND BUFFER") var device : MTLDevice! device = MTLCreateSystemDefaultDevice()! let defaultLibrary = try! device.makeLibrary(source: computeKernel, options: nil) let kernel_function = defaultLibrary.makeFunction(name: "kernel_operation")! let descriptor = MTLComputePipelineDescriptor() descriptor.computeFunction = kernel_function descriptor.supportIndirectCommandBuffers = true let computePipelineState = try! device.makeComputePipelineState(descriptor: descriptor, options: .init(), reflection: nil) let Ref1 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array1) let Ref2 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array2) var size = ldim let SizeBuffer : UnsafeMutablePointer = UnsafeMutablePointer(&size) let ll = MemoryLayout<Float>.stride * ldim*ldim*ldim let Buffer1:MTLBuffer! = device.makeBuffer(bytes:Ref1, length: ll, options: MTLResourceOptions.storageModeShared) let Buffer2:MTLBuffer! = device.makeBuffer(bytes:Ref2, length: ll, options: MTLResourceOptions.storageModeShared) let OutputBuffer:MTLBuffer! = device.makeBuffer(length: ll, options: MTLResourceOptions.storageModeShared) let Size:MTLBuffer! = device.makeBuffer(bytes: SizeBuffer, length: MemoryLayout<Int>.size, options: MTLResourceOptions.storageModeShared) let icbDescriptor:MTLIndirectCommandBufferDescriptor = MTLIndirectCommandBufferDescriptor() icbDescriptor.commandTypes.insert(MTLIndirectCommandType.concurrentDispatchThreads) icbDescriptor.inheritBuffers = false icbDescriptor.inheritPipelineState = false icbDescriptor.maxKernelBufferBindCount = 4 let indirectCommandBuffer = device.makeIndirectCommandBuffer(descriptor: icbDescriptor, maxCommandCount: 1)! let icbCommand = indirectCommandBuffer.indirectComputeCommandAt(0) icbCommand.setComputePipelineState(computePipelineState) icbCommand.setKernelBuffer(Buffer1, offset: 0, at: 0) icbCommand.setKernelBuffer(Buffer2, offset: 0, at: 1) icbCommand.setKernelBuffer(OutputBuffer, offset: 0, at: 2) icbCommand.setKernelBuffer(Size, offset: 0, at: 3) let w = computePipelineState.threadExecutionWidth let h = Int(computePipelineState.maxTotalThreadsPerThreadgroup / w) let z = 1 icbCommand.concurrentDispatchThreads(MTLSize(width:ldim, height: ldim, depth: ldim), threadsPerThreadgroup:MTLSize(width:w, height: h, depth: z)) let commandQueue = device.makeCommandQueue()! for _ in 0..<10 { let commandBuffer = commandQueue.makeCommandBuffer()! let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()! computeCommandEncoder.executeCommandsInBuffer(indirectCommandBuffer, range:0..<1) computeCommandEncoder.endEncoding() commandBuffer.commit() commandBuffer.waitUntilCompleted() } print("Last entry of buffer (it should not be 0)",OutputBuffer!.contents().assumingMemoryBound(to: Float.self)[ldim*ldim*ldim-1]) return(OutputBuffer!.contents().assumingMemoryBound(to: Float.self)) } @_cdecl("metalswift_add_standard") public func metalswift_add_standard(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> { //Rest of code not shown for space constrains in post We run this Swift code as part of a C Python extension, with a Python code such as: import numpy as np import myModule # this wraps the Swift library Side3DArray=12 a = np.arange(Side3DArray**3, dtype=np.single).reshape(Side3DArray,Side3DArray,Side3DArray) res=myModule.addition_command_buffer(a*2, a) print(res.flatten()) res2=myModule.addition_standard(a*2, a) print(res2.flatten()) print(np.all((10*(a*2+a) )==res)) print(np.all((10*(a*2+a) )==res2)) When we run with Side3DArray=12, the last test result returns True for both indirect command buffer and the standard pipeline : USING INDIRECT COMMAND BUFFER Last entry of buffer (it should not be 0) 51810.0 [0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04] USING STANDARD COMMAND Last entry of buffer (it should not be 0) 51810.0 [0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04] True True but when running with Side3DArray=16 the indirect buffer approach returns 0s in the output: USING INDIRECT COMMAND BUFFER Last entry of buffer (it should not be 0) 0.0 [0. 0. 0. ... 0. 0. 0.] USING STANDARD COMMAND Last entry of buffer (it should not be 0) 122850.0 [0.0000e+00 3.0000e+01 6.0000e+01 ... 1.2279e+05 1.2282e+05 1.2285e+05] False True Tests with MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 did not indicate any issue. Then we'd like to know if: Is something else we are missing for a correct indirect buffer command execution? Is there some sort of limitation with indirect buffer commands that would prevent using more demanding compute kernels? Thanks for andy advice, Sam PS: Forgot to mention, we tried to capture the instrument trace, but XCode crashes when trying opening the trace
Posted Last updated
.
Post not yet marked as solved
3 Replies
513 Views
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine. I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably. It looks like the problem might be in some Metal Performance Shaders init code. The state of everything (backtraces, etc.) when the program freezes is attached. Backtraces
Posted
by andmis.
Last updated
.
Post not yet marked as solved
10 Replies
2.5k Views
I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results: First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes. Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot. Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
Posted Last updated
.
Post not yet marked as solved
0 Replies
127 Views
Hello, I started experimenting with MPSGraph and decided to start with the sample project provided from WWDC 2020 session 10677 but I'm experiencing some strange behavior. Without making any alterations, I ran the build in the simulator and everything appears fine until I hit the train button. On first attempt the build crashes with an error indicated at the attempt to commit the command buffer within the runTrainingIterationBatch() function with the message stating "-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600000640880 (label: (null)), referenced in cmd buffer 0x15000f600 (label: (null)) is in volatile or empty purgeable state at commit'". From there I checked the documentation for the graph.encode command in the header file and it states that commitAndContinue might be called and not to rely on the buffer to remain uncommitted. I figured there must have been a change under the hood since the sample was made and commented out the .commit line the error originated from then attempt to run it again. No crash this time but there doesn't appear that the network is training. The loss chart does not populate with any values nor does the operation appear to ever finish. I added one print statement in the updateProgressCubeAndLoss completion to print the iteration number and the loss at that iteration. It showed a loss value of 0.0 at every iteration and no further iterations took place after 63. Occasionally on repeat tests 1-3 iterations would show a loss value other than 0.0 but that's it. Haven't been able to find much elsewhere online or other sample projects for MPSGraph. Any insight or assistance would be appreciated. Again this is the sample project demonstrated in WWDC 2020 session 10677 and pulled directly from the developer documentation page listed below. https://developer.apple.com/documentation/metalperformanceshadersgraph/training_a_neural_network_using_mps_graph I'm running Xcode version 13.4.1 and MacOS 12.4 on a 2020 MacBook Pro with an M1 chip. I also tried a 2018 Mac mini with a 3 GHz 6-Core Intel i5 with the same software versions and got the same result.
Posted
by ModernEd.
Last updated
.
Post not yet marked as solved
0 Replies
108 Views
In my game project, there is a functions.data file in then /AppData/Library/Caches/[bundleID]/com.apple.metal/functions.data, when we reboot and launch the game, this file was rest to about 40KB, normaly this file's is about 30MB, this operation was done by the metal, Is there any way to avoid it?
Posted Last updated
.
Post not yet marked as solved
0 Replies
144 Views
I have few loops that are part of my Deep Learning pipeline i got everything in MPSGraph but graph.for loop with scatter is extremely slow for my use case and there is no sane way to vectorize it. What is best way to update value in existing tensor/or to help Metal optimizer to vectorize this ops, maybe something like jax.at[idx].set? (Right now im using graph.for loop with scatter/gather ops)
Posted
by machineko.
Last updated
.
Post not yet marked as solved
0 Replies
139 Views
I am rendering around 12K vertices . I have some fragment functions which are lengthy around 500 lines . these shaders are doing some light calculation and that's why accessing constant buffers. so problem is these shaders are taking too much time around 50-70 ms just to render 12K vertices. when I profiled my app then Xcode tells me that there is 4KiB memory spilling in the fragment function. fragment function is wasting 60% time in waiting memory read from buffers. I can't optimise my fragment shaders and I am surprised why Metal APIs is too slow just to render very few vertices with lengthy fragment function. what could be the problem ? is there any issue while binding my buffers to the fragment shaders ? why is metal fragment function too slow in comparison to openGL?
Posted Last updated
.
Post marked as solved
3 Replies
369 Views
Hi, I'm writing a metal backend for the Leela Chess Zero NN-based chess engine, using MPSGraph for inference. I have found that I get an error: /System/Volumes/Data/SWE/macOS/BuildRoots/220e8a1b79/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MetalPerformanceShaders-124.6.1/MPSCore/Utility/MPSLibrary.mm:311: failed assertion MPSLibrary::MPSKey_Create internal error: Unable to get MPS kernel ndArrayConvolution2D. whenever I run inference on the graph. Does anyone know what would cause this error?
Posted
by almaudoh.
Last updated
.
Post not yet marked as solved
1 Replies
197 Views
Hi, there, I wanna use Xcode to profile my UE4 game on mobile platform. I read this article here: Optimizing Performance with the Shader Profiler The picture above is what I want. I need to use this tool to show the time consuming of the each instruction each shader line. However, my UE4 xcode project seems not to provide the same build option as mentioned in that article. even though I enabled all the CVar variables in Engine/Config/ConsoleVarible.ini , as follows: r.Shaders.Optimize=0 r.Shaders.KeepDebugInfo=1 r.Shaders.SkipCompression=1 r.ShaderDevelopmentMode=1 r.DumpShaderDebugShortNames=1 What shuold I do now? further, I checked out the official Demo project on here Rendering a Scene with Deferred Lighting in C++ It has the exact build options “produce debugging information” and worked perfectly. so I wonder how I can get the profiler feature while using UE4?? This same question above that I posted in Unreal forums hasn't been replied yet. As all the metal shader files are compiled by UE4 cooking process, not Xcode shader compiler, so I suspect if this is the key problem. What's the correct way to get the metal shader performance pie chart? could somebody help? anything will be appreciated.
Posted
by Evan_CJ.
Last updated
.
Post not yet marked as solved
3 Replies
1.2k Views
Hello guys. With the release of the M1 Pro and M1 Max in particular, the Mac has become a platform that could become very interesting for games in the future. However, since some features are still missing in Metal, it could be problematic for some developers to port their games to Metal. Especially with the Unreal Engine 5 you can already see a tendency in this direction, since e.g. Nanite and Lumen are unfortunately not available on the Mac. As a Vulkan developer I wanted to inquire about some features that are not yet available in Metal at the moment. These features are very interesting if you want to write a GPU driven renderer for modern game engines. Furthermore, these features could be used to emulate D3D12 on the Mac via MoltenVK, which would result in more games being available on the Mac. Buffer device address: This feature allows the application to query a 64-bit buffer device address value for a buffer. It is very useful for D3D12 emulation and for compatibility with Vulkan, e.g. to implement ray tracing on MoltenVK. DrawIndirectCount: This feature allows an application to source the number of draws for indirect drawing calls from a buffer. Also very useful in many gpu driven situations Only 500000 resources per argument buffer Metal has a limit of 500000 resources per argument buffer. To be equivalent to D3D12 Resource Binding Tear 2, you would need 1 million. This is also very important as so many DirectX12 game engines could be ported to Metal more easily. Mesh shader / Task shader: Two interesting new shader stages to optimize the rendering pipeline Are there any plans to implement this features in future? Is there a roadmap for metal? Is there a website where I can suggest features to the metal developers? I hope to see at least the first 3 features in metal in the future and I think that many developers feel the same way. Best regards, Marlon
Posted
by zmxrlxn.
Last updated
.
Post marked as solved
1 Replies
227 Views
Hello, developers, I'm implementing slice rendering of 3d volume. And then, I have a simple question... I use a simple vertex buffer type both in swift code and in metal code. Firstly, I defined uv to float2 but it's not working. It has weird texture coordinates when I use float2... public struct VertexIn: sizeable {     var position = float3()     var normal = float3()     var uv = float3()   } struct VertexIn {   float3 position [[ attribute(0) ]];   float3 normal [[ attribute(1) ]];   float3 uv [[ attribute(2) ]]; }; like this float2. float3. It has just difference at the uv type. And I have same issue at passing uniform to shader. When I pass uniform that includes float or short types it doesn't work. So I change type to float3... So I inquire that metal data type is so difference compared with swift type??? Or what types are same and supported from metal.
Posted
by wonkieun.
Last updated
.
Post not yet marked as solved
0 Replies
239 Views
I'm try to physical device always getting error on Error for Family Controls: Error Domain=FamilyControls.FamilyControlsError Code=2 "(null)" AuthorizationCenter.shared.requestAuthorization{result in       switch result {       case .success():         print("Allow to controle App ")         break       case .failure(let error):         print("Error for Family Controls: \(error)")       }     } My question is - how can I authorize my parents using Family Control API in order to use for example Device Activity framework and Managed Settings framework?
Posted Last updated
.