Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

54 Posts
Sort by:
Post not yet marked as solved
2 Replies
49 Views
HI - I'm trying to implement a Barnes-Hut N-Body simulation code in Metal. The code requires construction of a tree. The CUDA implementation uses locks to allow insertion of new nodes into the tree. I've tried using an array of atomic ints in a test case, but this doesn't seem to work: kernel void binning_compute_function( device MyArgument *arg1 [[ buffer(0)]],        constant float *ranarr      [[ buffer(1) ]],         device volatile atomic_int *flagArr [[ buffer(2) ]],         device int *bins [[buffer(3)]],          uint   index [[ thread_position_in_grid ]]) {     int expected=0;     int ibin = (ranarr[index] * arg1->nbins);     for (int i = 0; i < 100000000; i++) { // Lock         expected = 0;         bool test = !atomic_compare_exchange_weak_explicit(&flagArr[ibin],&expected,1,memory_order_relaxed,memory_order_relaxed);         if (test) {             bins[ibin] += 1;             atomic_store_explicit(&flagArr[ibin], 0, memory_order_relaxed);             break;         }     } } Any other suggestions? The alternative is to use the CPU for this, but seems a shame to miss out on the processing power of the GPU. Thank you, Colin
Posted
by
Post not yet marked as solved
0 Replies
45 Views
I'm trying to work with an MTLBuffer that has a size greater than the maximum allowed buffer length for an MTLBuffer. As such I have split it into an array of MTLBuffers like this: // swift declaration var _mBuffers : [MTLBuffer]! // swift allocation _mBuffers = [] for _ in 0..<64 { // use 64 buffers if let buffer = _mDevice.makeBuffer(length: BUFFER_BYTES, options: .storageModeShared) {       _mBuffers.append(buffer)   } } // we now have an array of 64 MTLBuffers in _mBuffers Then in my metal compute kernel I have a function like this, where I want to organise data across the 64 buffers: struct Foo {      device uint* data_buffers[64]; }; kernel void test_buffers(                          constant Foo & f [[buffer(0)]],                          device metal::atomic_uint *out_counters,                          const uint index [[thread_position_in_grid]]) {     // figure out which group value belows to     uint group = index % 64;     // find the slot in memory for that group     uint position = atomic_fetch_add_explicit(&out_counters[group], 1, memory_order_relaxed);     // add the value to the buffer with that group, and it's position in buffer     f.data_buffers[group][position] = index; } Now I can't figure out how to call this kernel function from swift code using the command encoders. // swift calling code snippets...(not even close to working) // this should be the same as the C struct in the kernel code? struct Foo {     var buffers : [MTLBuffer] }; // how to pass the _mBuffer with 64 elements to the computeEncoder arguments?? var arguments : Foo = {         buffers: [MTLBuffer]! } // how to pass the arguments to the computeEncoder correctly? computeEncoder.setBytes(&arguments, length: MemoryLayout<Foo>.stride, index: 0) Any help would be much appreciated! I feel like this should be simpler than it is.
Posted
by
Post not yet marked as solved
0 Replies
72 Views
Hi, Related to another post I made trying to explore better ways to increase the efficiency of Metal compute kernels for very large calculations, we gave a shot to use indirect compute command buffers. When we implemented the app mentioned in post with indirect buffers, we always got our final result memory buffers just with 0s. Then we implemented a much simpler code where we just do a simple operation adding two buffers to try to understand where there could be the issue. In this operation, we use the following kernel as a proof of concept, which just does an accumulation task in a cubic 3D array: #define size (*size_pr) kernel void metalswift_add(const device float *Buffer1 [[ buffer(0) ]], const device float *Buffer2[[ buffer(1) ]], device float *OutputBuffer[[ buffer(2) ]], const device int *size_pr[[ buffer(3) ]], uint3 gid[[thread_position_in_grid]]) { int i = gid.x; int j = gid.y; int k = gid.z; if (i < size && j < size && k < size) { int index=i*size*size+j*size+k; OutputBuffer[index]+= Buffer1[index] + Buffer2[index]; } } The indirect command buffer works as expected for memory buffers with a modest size; for example, matrices smaller than 8 x 8 x 8 = 512 float entries. But then, when the buffer becomes a bit bigger (for example, anything equal to or larger than 16 x 16 x 16 = 4096 float entries), then the indirect command buffer seems to stop working as our result buffer just has 0s. When running using a standard command pipeline, we obtain the desired results. Below there is the Swift implementation of the indirect command buffer (the standard pipeline is not shown for post space limitations): ... @_cdecl("metalswift_add_indirect") public func metalswift_add_indirect(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> { print("USING INDIRECT COMMAND BUFFER") var device : MTLDevice! device = MTLCreateSystemDefaultDevice()! let defaultLibrary = try! device.makeLibrary(source: computeKernel, options: nil) let kernel_function = defaultLibrary.makeFunction(name: "kernel_operation")! let descriptor = MTLComputePipelineDescriptor() descriptor.computeFunction = kernel_function descriptor.supportIndirectCommandBuffers = true let computePipelineState = try! device.makeComputePipelineState(descriptor: descriptor, options: .init(), reflection: nil) let Ref1 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array1) let Ref2 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array2) var size = ldim let SizeBuffer : UnsafeMutablePointer = UnsafeMutablePointer(&size) let ll = MemoryLayout<Float>.stride * ldim*ldim*ldim let Buffer1:MTLBuffer! = device.makeBuffer(bytes:Ref1, length: ll, options: MTLResourceOptions.storageModeShared) let Buffer2:MTLBuffer! = device.makeBuffer(bytes:Ref2, length: ll, options: MTLResourceOptions.storageModeShared) let OutputBuffer:MTLBuffer! = device.makeBuffer(length: ll, options: MTLResourceOptions.storageModeShared) let Size:MTLBuffer! = device.makeBuffer(bytes: SizeBuffer, length: MemoryLayout<Int>.size, options: MTLResourceOptions.storageModeShared) let icbDescriptor:MTLIndirectCommandBufferDescriptor = MTLIndirectCommandBufferDescriptor() icbDescriptor.commandTypes.insert(MTLIndirectCommandType.concurrentDispatchThreads) icbDescriptor.inheritBuffers = false icbDescriptor.inheritPipelineState = false icbDescriptor.maxKernelBufferBindCount = 4 let indirectCommandBuffer = device.makeIndirectCommandBuffer(descriptor: icbDescriptor, maxCommandCount: 1)! let icbCommand = indirectCommandBuffer.indirectComputeCommandAt(0) icbCommand.setComputePipelineState(computePipelineState) icbCommand.setKernelBuffer(Buffer1, offset: 0, at: 0) icbCommand.setKernelBuffer(Buffer2, offset: 0, at: 1) icbCommand.setKernelBuffer(OutputBuffer, offset: 0, at: 2) icbCommand.setKernelBuffer(Size, offset: 0, at: 3) let w = computePipelineState.threadExecutionWidth let h = Int(computePipelineState.maxTotalThreadsPerThreadgroup / w) let z = 1 icbCommand.concurrentDispatchThreads(MTLSize(width:ldim, height: ldim, depth: ldim), threadsPerThreadgroup:MTLSize(width:w, height: h, depth: z)) let commandQueue = device.makeCommandQueue()! for _ in 0..<10 { let commandBuffer = commandQueue.makeCommandBuffer()! let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()! computeCommandEncoder.executeCommandsInBuffer(indirectCommandBuffer, range:0..<1) computeCommandEncoder.endEncoding() commandBuffer.commit() commandBuffer.waitUntilCompleted() } print("Last entry of buffer (it should not be 0)",OutputBuffer!.contents().assumingMemoryBound(to: Float.self)[ldim*ldim*ldim-1]) return(OutputBuffer!.contents().assumingMemoryBound(to: Float.self)) } @_cdecl("metalswift_add_standard") public func metalswift_add_standard(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> { //Rest of code not shown for space constrains in post We run this Swift code as part of a C Python extension, with a Python code such as: import numpy as np import myModule # this wraps the Swift library Side3DArray=12 a = np.arange(Side3DArray**3, dtype=np.single).reshape(Side3DArray,Side3DArray,Side3DArray) res=myModule.addition_command_buffer(a*2, a) print(res.flatten()) res2=myModule.addition_standard(a*2, a) print(res2.flatten()) print(np.all((10*(a*2+a) )==res)) print(np.all((10*(a*2+a) )==res2)) When we run with Side3DArray=12, the last test result returns True for both indirect command buffer and the standard pipeline : USING INDIRECT COMMAND BUFFER Last entry of buffer (it should not be 0) 51810.0 [0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04] USING STANDARD COMMAND Last entry of buffer (it should not be 0) 51810.0 [0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04] True True but when running with Side3DArray=16 the indirect buffer approach returns 0s in the output: USING INDIRECT COMMAND BUFFER Last entry of buffer (it should not be 0) 0.0 [0. 0. 0. ... 0. 0. 0.] USING STANDARD COMMAND Last entry of buffer (it should not be 0) 122850.0 [0.0000e+00 3.0000e+01 6.0000e+01 ... 1.2279e+05 1.2282e+05 1.2285e+05] False True Tests with MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 did not indicate any issue. Then we'd like to know if: Is something else we are missing for a correct indirect buffer command execution? Is there some sort of limitation with indirect buffer commands that would prevent using more demanding compute kernels? Thanks for andy advice, Sam PS: Forgot to mention, we tried to capture the instrument trace, but XCode crashes when trying opening the trace
Posted
by
Post not yet marked as solved
0 Replies
127 Views
Hello, I started experimenting with MPSGraph and decided to start with the sample project provided from WWDC 2020 session 10677 but I'm experiencing some strange behavior. Without making any alterations, I ran the build in the simulator and everything appears fine until I hit the train button. On first attempt the build crashes with an error indicated at the attempt to commit the command buffer within the runTrainingIterationBatch() function with the message stating "-[MTLDebugCommandBuffer lockPurgeableObjects]:2103: failed assertion `MTLResource 0x600000640880 (label: (null)), referenced in cmd buffer 0x15000f600 (label: (null)) is in volatile or empty purgeable state at commit'". From there I checked the documentation for the graph.encode command in the header file and it states that commitAndContinue might be called and not to rely on the buffer to remain uncommitted. I figured there must have been a change under the hood since the sample was made and commented out the .commit line the error originated from then attempt to run it again. No crash this time but there doesn't appear that the network is training. The loss chart does not populate with any values nor does the operation appear to ever finish. I added one print statement in the updateProgressCubeAndLoss completion to print the iteration number and the loss at that iteration. It showed a loss value of 0.0 at every iteration and no further iterations took place after 63. Occasionally on repeat tests 1-3 iterations would show a loss value other than 0.0 but that's it. Haven't been able to find much elsewhere online or other sample projects for MPSGraph. Any insight or assistance would be appreciated. Again this is the sample project demonstrated in WWDC 2020 session 10677 and pulled directly from the developer documentation page listed below. https://developer.apple.com/documentation/metalperformanceshadersgraph/training_a_neural_network_using_mps_graph I'm running Xcode version 13.4.1 and MacOS 12.4 on a 2020 MacBook Pro with an M1 chip. I also tried a 2018 Mac mini with a 3 GHz 6-Core Intel i5 with the same software versions and got the same result.
Posted
by
Post not yet marked as solved
0 Replies
108 Views
In my game project, there is a functions.data file in then /AppData/Library/Caches/[bundleID]/com.apple.metal/functions.data, when we reboot and launch the game, this file was rest to about 40KB, normaly this file's is about 30MB, this operation was done by the metal, Is there any way to avoid it?
Posted
by
Post not yet marked as solved
0 Replies
144 Views
I have few loops that are part of my Deep Learning pipeline i got everything in MPSGraph but graph.for loop with scatter is extremely slow for my use case and there is no sane way to vectorize it. What is best way to update value in existing tensor/or to help Metal optimizer to vectorize this ops, maybe something like jax.at[idx].set? (Right now im using graph.for loop with scatter/gather ops)
Posted
by
Post not yet marked as solved
0 Replies
139 Views
I am rendering around 12K vertices . I have some fragment functions which are lengthy around 500 lines . these shaders are doing some light calculation and that's why accessing constant buffers. so problem is these shaders are taking too much time around 50-70 ms just to render 12K vertices. when I profiled my app then Xcode tells me that there is 4KiB memory spilling in the fragment function. fragment function is wasting 60% time in waiting memory read from buffers. I can't optimise my fragment shaders and I am surprised why Metal APIs is too slow just to render very few vertices with lengthy fragment function. what could be the problem ? is there any issue while binding my buffers to the fragment shaders ? why is metal fragment function too slow in comparison to openGL?
Posted
by
Post not yet marked as solved
1 Replies
197 Views
Hi, there, I wanna use Xcode to profile my UE4 game on mobile platform. I read this article here: Optimizing Performance with the Shader Profiler The picture above is what I want. I need to use this tool to show the time consuming of the each instruction each shader line. However, my UE4 xcode project seems not to provide the same build option as mentioned in that article. even though I enabled all the CVar variables in Engine/Config/ConsoleVarible.ini , as follows: r.Shaders.Optimize=0 r.Shaders.KeepDebugInfo=1 r.Shaders.SkipCompression=1 r.ShaderDevelopmentMode=1 r.DumpShaderDebugShortNames=1 What shuold I do now? further, I checked out the official Demo project on here Rendering a Scene with Deferred Lighting in C++ It has the exact build options “produce debugging information” and worked perfectly. so I wonder how I can get the profiler feature while using UE4?? This same question above that I posted in Unreal forums hasn't been replied yet. As all the metal shader files are compiled by UE4 cooking process, not Xcode shader compiler, so I suspect if this is the key problem. What's the correct way to get the metal shader performance pie chart? could somebody help? anything will be appreciated.
Posted
by
Post marked as solved
1 Replies
227 Views
Hello, developers, I'm implementing slice rendering of 3d volume. And then, I have a simple question... I use a simple vertex buffer type both in swift code and in metal code. Firstly, I defined uv to float2 but it's not working. It has weird texture coordinates when I use float2... public struct VertexIn: sizeable {     var position = float3()     var normal = float3()     var uv = float3()   } struct VertexIn {   float3 position [[ attribute(0) ]];   float3 normal [[ attribute(1) ]];   float3 uv [[ attribute(2) ]]; }; like this float2. float3. It has just difference at the uv type. And I have same issue at passing uniform to shader. When I pass uniform that includes float or short types it doesn't work. So I change type to float3... So I inquire that metal data type is so difference compared with swift type??? Or what types are same and supported from metal.
Posted
by
Post not yet marked as solved
3 Replies
513 Views
I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine. I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably. It looks like the problem might be in some Metal Performance Shaders init code. The state of everything (backtraces, etc.) when the program freezes is attached. Backtraces
Posted
by
Post not yet marked as solved
0 Replies
239 Views
I'm try to physical device always getting error on Error for Family Controls: Error Domain=FamilyControls.FamilyControlsError Code=2 "(null)" AuthorizationCenter.shared.requestAuthorization{result in       switch result {       case .success():         print("Allow to controle App ")         break       case .failure(let error):         print("Error for Family Controls: \(error)")       }     } My question is - how can I authorize my parents using Family Control API in order to use for example Device Activity framework and Managed Settings framework?
Posted
by
Post not yet marked as solved
4 Replies
556 Views
I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so: uint | uint | .... | uint | float offsets (contiguous): 0 | ... where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes. I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps): let device = MTLCreateSystemDefaultDevice()! let commandQueue = device.makeCommandQueue()! let commandBuffer = commandQueue.makeCommandBuffer()! let sharedEvent = device.makeSharedEvent()! let sharedEventQueue = DispatchQueue(label: "my-queue") let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue) // Compute pipeline kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)         commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in var value = Float(0.0) bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset) } kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset) kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture) Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively. In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA. The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU. This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA) What explains this strange behavior with Metal? Note: This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3
Posted
by
Post marked as solved
3 Replies
369 Views
Hi, I'm writing a metal backend for the Leela Chess Zero NN-based chess engine, using MPSGraph for inference. I have found that I get an error: /System/Volumes/Data/SWE/macOS/BuildRoots/220e8a1b79/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MetalPerformanceShaders-124.6.1/MPSCore/Utility/MPSLibrary.mm:311: failed assertion MPSLibrary::MPSKey_Create internal error: Unable to get MPS kernel ndArrayConvolution2D. whenever I run inference on the graph. Does anyone know what would cause this error?
Posted
by
Post not yet marked as solved
1 Replies
260 Views
I want to debug metal shader using Xcode, but it prompt like this: but the apple doc says that we can use this memod to solve, but I cannot find where " Produce debugging information " in XCode 12.5 https://developer.apple.com/documentation/metal/shader_authoring/developing_and_debugging_metal_shaders
Posted
by