Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

54 Posts
Sort by:
Post not yet marked as solved
4 Replies
556 Views
I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so: uint | uint | .... | uint | float offsets (contiguous): 0 | ... where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes. I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps): let device = MTLCreateSystemDefaultDevice()! let commandQueue = device.makeCommandQueue()! let commandBuffer = commandQueue.makeCommandBuffer()! let sharedEvent = device.makeSharedEvent()! let sharedEventQueue = DispatchQueue(label: "my-queue") let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue) // Compute pipeline kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)         commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in var value = Float(0.0) bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset) } kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset) kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture) Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively. In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA. The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU. This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA) What explains this strange behavior with Metal? Note: This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3
Posted Last updated
.
Post not yet marked as solved
0 Replies
469 Views
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
Posted Last updated
.
Post not yet marked as solved
1 Replies
261 Views
I want to debug metal shader using Xcode, but it prompt like this: but the apple doc says that we can use this memod to solve, but I cannot find where " Produce debugging information " in XCode 12.5 https://developer.apple.com/documentation/metal/shader_authoring/developing_and_debugging_metal_shaders
Posted
by zhanghua.
Last updated
.
Post not yet marked as solved
7 Replies
1.4k Views
I am working on the implementation of a highly-demanding signal processing algorithm, and I am not able to reach an acceptable execution time with vDSP's routines. I am now having a look at Metal and learn how to use it. It seems like Metal Performance Shaders as well as MPS Graph could replace almost all of my vDSP calls, but not the Fast Fourier Transform (which is the most time consuming part of the algorithm). I was wondering if there's a way for FFT methods to be included in MPS, because it could be insanely fast if optimized for unified architecture of the M1. Thanks !
Posted Last updated
.
Post marked as solved
1 Replies
231 Views
When I try to run my matrix multiplication I receive the following warning in iOS but not in macOS : 'init(dimensions:columns:rowBytes:dataType:)' was deprecated in iOS 11.0 How may I change my code to remove the iOS warning? Here is my line generating the warning: let mdesc = MPSMatrixDescriptor( dimensions: 2, columns: 2, rowBytes: rowbytes, dataType: MPSDataType.float16)
Posted Last updated
.
Post not yet marked as solved
1 Replies
377 Views
I have a metal compute kernel for dense matrix mutiply, and I'd like to optimize it with simdgroup_float8x8 and simdgroup_half8x8. However, it seems no one apply them in Metal. Can you give me some more demo on how to use them excpet that in Metal Shading Language Specification Version 2.4. Thanks!
Posted
by PYNing.
Last updated
.
Post not yet marked as solved
0 Replies
300 Views
Hello Everybody. I'm trying to port graphic code written cg in unity to metal. And, one more thing I don't want to manually implement scene graph, so I gonna use SceneKit. So I should use SCNProgram or SCNNodeRendererDelegate, and I think SCNProgram is more comfort. And real my question is how convert this code, in cg Cull Front ZTest LEqual ZWrite On Blend SrcAlpha OneMinusSrcAlpha I know source alpha blending in MTLPipelineDescriptor, zbuffer in RenderCommandEncoder and Cull Face also. But When I use SCNProgram or SCNSceneRendererNode, can't find these options... how I change these. Help me.
Posted
by wonkieun.
Last updated
.
Post not yet marked as solved
1 Replies
338 Views
After capturing several metal frames of my IOS games which I packaged IPA file from UE4, I fail to get the shader source as before. The following msg box come across. As the image content says , I check my build settings. However, there is no Metal compiler build options and no "produce debuggering information" item either. my MacOS 12.1 Monterey my Xode 13.1(13A1030d) any help will be appreciated.
Posted
by Dlphn.
Last updated
.
Post marked as solved
1 Replies
410 Views
Hello, Everyone. I try to use a metal kit with a scene kit. Because, the scene kits scene graph is great, I want to implement a low-level metal shader. I want to use SCNNodeRenderDelegate, without SCNProgram. Because I want low-level implement for example passing custom extra MTLBuffer, or multi-pass-rendering. So I pass model view projection matrix like that, in metal shader struct NodeBuffer {   float4x4 modelTransform;   float4x4 modelViewProjectionTransform;   float4x4 modelViewTransform;   float4x4 normalTransform;   float2x3 boundingBox; }; in Swift code struct NodeMatrix: sizeable {     var modelTransform = float4x4()     var modelViewProjectionTransform = float4x4()     var modelViewTransform = float4x4()     var normalTransform = float4x4()     var boundingBox = float2x3()   } ...    private func updateNodeMatrix(_ camNode: SCNNode) {     guard let camera = camNode.camera else {       return     }           let modelMatrix = transform     let viewMatrix = camNode.transform     let projectionMatrix = camera.projectionTransform           let viewProjection = SCNMatrix4Mult(viewMatrix, projectionMatrix)     let modelViewProjection = SCNMatrix4Mult(modelMatrix, viewProjection)     nodeMatrix.modelViewProjectionTransform = float4x4(modelViewProjectionMatrix)   } ... public func renderNode(_ node: SCNNode,               renderer: SCNRenderer,               arguments: [String: Any])   {     guard let renderTexturePipelineState = renderTexturePipelineState,        let renderCommandEncoder = renderer.currentRenderCommandEncoder,        let camNode = renderer.pointOfView,        let texture = texture     else { return }           updateNodeMatrix(camNode)     guard let nodeBuffer       = renderer.device?.makeBuffer(bytes: &nodeMatrix,                      length: NodeMatrix.stride,                      options: [])     else { return }           renderCommandEncoder.setDepthStencilState(depthState)     renderCommandEncoder.setRenderPipelineState(renderTexturePipelineState)     renderCommandEncoder.setFragmentTexture(texture, index: 0)     renderCommandEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)     renderCommandEncoder.setVertexBuffer(nodeBuffer, offset: 0, index: 1)     renderCommandEncoder.drawIndexedPrimitives(type: .triangle,                           indexCount: indexCount,                           indexType: .uint16,                           indexBuffer: indexBuffer,                           indexBufferOffset: 0)   } But I got the wrong model view projection matrix in the shader. I think scene kit has modify intermediate transform hiding. I can't know, help me...
Posted
by wonkieun.
Last updated
.
Post not yet marked as solved
0 Replies
192 Views
I am trying measuring performance in my app, I used two difference ways to measure commandBuffer completion time. One way is using MTLCommandBuffer addCompletedHandler: commandBuffer.addCompletedHandler { cb in let executionDuration = cb.gpuEndTime - cb.gpuStartTime /* ... */ } The other way is to use MTLCaptureManager. And I found two interesting things: First, the completion time from addCompletedHandler was 26.82ms, on the other hand, GPU time from the capture manager was 13.49ms. I have been trying to understand why these two number are way different, but couldn't fine any concrete answer. Second, the GPU time is different from shader times shown in timeline in Performances. Here is screenshot. According to the timeline, it took 17.57ms. There is inconsistency. I did same test multiple times, and sometimes process time on timeline is less than GPU time, or vise versa. Within this command buffer, there are 56 dispatches. Is this because there are too many dispatches? I tested this on iPhone 12 Max with iOS 15.2.1 If there is someone who can give me clear explanation, it would be really appreciated.
Posted
by dhseo118.
Last updated
.
Post marked as solved
1 Replies
294 Views
As the documentation says, Limiter counters tell you which subsystems of the GPU are active by providing a percentage of the total number of processor cycles during which this subsystem was active. Besides, Instrument also provides some Utilization Counters and the value is different from the Limiter. What do Utilization Counters mean?
Posted
by HackHarry.
Last updated
.
Post not yet marked as solved
0 Replies
256 Views
How to clear OpenCL cache which contains pre-compilled OpenCL kernels? It is saved somewhere on the disk, because cache persists even after system restart. I suppose it uses the same cache as Metal, but this I also cannot locate. This cache is problematic because if some of header files for OpenCL code is modified the OpenCL kernel is not re-compilled.
Posted Last updated
.
Post not yet marked as solved
1 Replies
368 Views
Is it possible to do any of the following: Export a model created using MetalPerformanceShadersGraph to a CoreML file; Failing 1., save a trained MetalPerformanceShadersGraph model in any other way for deployment; Import a CoreML model and use it as a part of a MetalPerformanceShadersGraph model. Thanks!
Posted
by Alan_Z.
Last updated
.
Post not yet marked as solved
1 Replies
535 Views
I have a complex CAS loop with branches that essentially implement a mutex and I'm porting it from CUDA to Metal. I'm looking for the equivalent of CUDA __treadfence(); => docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock. I need to have one store to buffer A happen before another store to buffer B. Atomic memory oder options are only one: "relaxed". How to accomplish this ordering guarantee?
Posted
by rianflo.
Last updated
.
Post not yet marked as solved
0 Replies
328 Views
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training: extension MLCTensor { func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric { let count = self.descriptor.shape.reduce(into: 1) { (result, value) in result *= value } var array = [Scalar](repeating: 0, count: count) self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU _ = try array.withUnsafeMutableBytes { (pointer) in guard let data = self.data else { throw DataError.uninitialized // A custom error that I declare elsewhere } data.copyBytes(to: pointer) } return array } } The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect. This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically. For reference, I’m testing my code on M1 Max. Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Posted Last updated
.
Post marked as solved
1 Replies
311 Views
There is a write function documented in the CoreImage Metal shader reference here: https://developer.apple.com/metal/MetalCIKLReference6.pdf But I'm not sure how to use it. I assumed one would be able to use it on the destination parameter i.e. dest.write(...) but I get the error, "no member named 'write' in 'coreimage::destination'" How do I use this function?
Posted Last updated
.
Post not yet marked as solved
2 Replies
404 Views
I've created a custom BoxBlur kernel that produces identical results to Apple's built-in box blur (CIBoxBlur) kernel but my custom kernel is orders of magnitude slower. So naturally I am wondering what I'm doing wrong to get such poor performance. Below is my custom kernel in the Metal shading language. Can you spot why it's so slow? The built-in filter performs well so I can only assume it's something I'm doing wrong. #include <CoreImage/CoreImage.h> #import <simd/simd.h> extern "C" { namespace coreimage { float4 customBoxBlurFilterKernel(sampler src) { float2 crd = src.coord(); int edge = 100; int minx = crd.x - edge; int maxx = crd.x + edge; int miny = crd.y - edge; int maxy = crd.y + edge; float4 sums = float4(0,0,0,0); float cnt = 0; // compute average of surrounding rgb values for(int row=miny; row < maxy; row++) { for(int col=minx; col < maxx; col++) { float4 samp = src.sample(float2(col, row)); sums[0] += samp[0]; sums[1] += samp[1]; sums[2] += samp[2]; cnt += 1.; } } return float4(sums[0]/cnt, sums[1]/cnt, sums[2]/cnt, 1); } } }
Posted Last updated
.