Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

54 Posts
Sort by:
Post not yet marked as solved
1 Replies
535 Views
I have a complex CAS loop with branches that essentially implement a mutex and I'm porting it from CUDA to Metal. I'm looking for the equivalent of CUDA __treadfence(); => docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock. I need to have one store to buffer A happen before another store to buffer B. Atomic memory oder options are only one: "relaxed". How to accomplish this ordering guarantee?
Posted
by
Post not yet marked as solved
6 Replies
875 Views
If your exchange mailbox get damage or corrupted and you think how to recover EDB file data then don’t think more just use InQuit EDB mailbox recovery software that able to repair EDB file data after that convert EDB file data into MS outlook PST file with other format as: - EML, MSG, HTML, MBOX, vCal, vCard as well as it also support office 365. You can also use the converter to see preview of the conversion also. You can convert the EDB files with the password protection. Try the free demo version of the converter and convert 30 emails for free as well. Search the Google, Yahoo and Bing: InQuit EDB To PST Converter Software
Posted
by
Post not yet marked as solved
2 Replies
674 Views
I am using a converted custom PyTorch Model on device for use with real time video. The Model was converted successfully using both CoreMLTools V4.1 and V5.0b3 (both versions exhibit the same issues). When running the model both from a python environment using CoreMLTools, as well as a MacOS app, using the same input image and supplementary data the output is identical, correct and matches the output of the pure PyTorch model. However, when running it on device, the models output is incorrect. On an iPhone XR, using the .all or .cpuAndGPU value of computeUnits, the output is simply a white square with no error or warning message. What this means is our output, which we normal expect to be in the range of [0,255], has values of 255 in every location. However, running using .cpuOnly on the iPhone XR produces the correct output. Furthermore, when simulating a device from a MacOS machine, the output is correct regardless of the computeUnits value. On an iPhone 12 this situation gets even more confusing. With the setting .cpuAndGPU, we get the pure white incorrect output, using .cpuOnly we get the correct output, but with .all we get a different incorrect output, an image of wildly incorrect colors but a vaguely similar form to the image we expect. In addition with the .all setting we get the following error. 2021-09-01 15:07:16.595048-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::ANERuntimeEngine::__forward_segment 3] evaluate[RealTime]WithModel returned 0; code=5 err=Error Domain=com.apple.appleneuralengine Code=5 "processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow" UserInfo={NSLocalizedDescription=processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow} 2021-09-01 15:07:16.595103-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::overflow_error] /private/var/containers/Bundle/Application/16433631-57DE-488C-8772-D9560C3D8B48/sensoriumViewer.app/SensoriumMLTest16V1.mlmodelc/model.espresso.net:3 Which makes it pretty clear that there is some sort of either Integer or Floating Point overflow error. What I believe is happening is this: Regardless of model, using the GPU causes the overflow to truncate, giving us values of 255 for all pixels, on the iPhone 12 using .all passes it to the ANE (Apple Neural Engine) which wraps the overflow error, giving unpredictable colors but a kind of correct shape, using .all on the iPhone XR just uses the GPU because for some reason this model won't go to the XR ANE, and lastly, using .cpuOnly does not overflow and gives us the correct result. Why does the XR not use its ANE for this model? Can the ANE and GPU just not handle 32 bit floats? We are quantizing the model to 16bit using CoreMLTools, why are we still overflowing? I see the documentation for the new MLProgram format and it seems promising, will that solve this issue? Is there any documentation surrounding the supported operations and number precision for Pytorch converted models? Why are there no errors or warnings when passing this through the GPU? Any help or insight would be greatly appreciated as the documentation I've seen surrounding the ANE is not very comprehensive.
Posted
by
Post marked as solved
1 Replies
486 Views
Hi, I wrote the following compute shader to blur images with some complex kernel: // use Packhalf5 to align to 16 B struct Packhalf5 { half4 a; half4 b; }; kernel void cs_main( texture2d_array<float> t_f123 [[texture(0)]] , texture2d_array<float, access::write> t_normal [[texture(3)]] , sampler s_f123 [[sampler(0)]] , uint3 gl_GlobalInvocationID [[thread_position_in_grid]] , uint3 gl_LocalInvocationID [[thread_position_in_threadgroup]] ) { // use Packhalf5 to align to 16 B // BLOCK_SIZE_Y = 1 // BLOCK_SIZE_X = 128 or 64 or 32 // MAX_PIXR = 20 threadgroup Packhalf5 gCache[BLOCK_SIZE_Y][BLOCK_SIZE_X + 2 * MAX_PIXR]; // cache texture samples in thread group memory gCache so that we can read it quickly and avoid most texture samples in loop if (gl_LocalInvocationID.y < _35) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _2.xy, uint(round(_2.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _2.xy, uint(round(_2.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = pkh5; } if (gl_LocalInvocationID.y >= uint(BLOCK_SIZE_X - _34)) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _3.xy, uint(round(_3.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_36] = pkh5; } Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _31.xy, uint(round(_31.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _31.xy, uint(round(_31.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_37] = pkh5; threadgroup_barrier(mem_flags::mem_threadgroup); // use gCache to blur image for (int i = 0;i<kernel_size;++i) { // calculate index int a = f1(i); int b = f2(i); Packhalf5 pkh5; //it is extremly slow than directly sampling texture (t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) pkh5 = gCache[gl_LocalInvocationID.x][a]; float2 _42 = float2(pkh5.b.xy); float3 _43 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); //it is extremly slow than directly sampling texture pkh5 = gCache[gl_LocalInvocationID.x][b]; float2 _45 = float2(pkh5.b.xy); float3 _46 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); // use _42,_43,_45_46 } // wirte blur result t_normal.write(_16, uint2(gl_GlobalInvocationID.xy), uint(gl_GlobalInvocationID.z)); } I wrote this shader to optimize some blur operations which is similar to Gaussian Blur on a 128x128 picture, and I test and profile it on iPhone XR in the Xcode frame debugger found that : the loading from thread group memory "gCache" in the loop is so slow than directly sample texture (ie. t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) (Shader take 30% of total time to load gCache, but take only < 5% of the total time if I change to directly sample texture to sample texture ) As for the performance counter, the texture reading is actually going down but things strange is that the main memory bandwidth is nearly not changed. I guess that there is a data hazard because that the use of too much thread group memory?
Posted
by
Post not yet marked as solved
0 Replies
342 Views
Is there a way to force precompilation of SKShader()s? I am initializing a basic shader in my SKScene properties, but the shader only gets compiled when it is actually attached and rendered to a sprite. class GameScene: SKScene {   let myShader = SKShader(fileNamed: "myEffect.fsh") override func update(_ currentTime: TimeInterval) { if touches.count > 0 { mySprite.shader = myShader // this is where the warning triggers for the first time } } } I know this because the scene pauses for a bit, immediately before the shader is rendered for the first time, and I also get the Metal warning (which seems to also be a known bug in recent releases - https://developer.apple.com/forums/thread/661774): [Metal Compiler Warning] Warning: Compilation succeeded with:  program_source:3:19: warning: unused variable 's' constexpr sampler s(coord::normalized, I was expecting the shader to compile when it is initialized with SKShader(fileNamed:), since Apple docs says: Compiling a shader and the uniform data associated with it can be expensive. Because of this, you should: Initialize shader objects when your game launches, not while the game is running. https://developer.apple.com/documentation/spritekit/skshader
Posted
by
Post marked as solved
1 Replies
561 Views
I spend the majority of my time these days working on and optimizing shaders on iOS. Xcode GPU Profiler has been very helpful for the most part, however one problem that still plagues me is that profiling is very inconsistent. When I first start up an app I generally get a pretty good boost in GPU performance, then it slows down after 30 seconds or so. I imagine this is due to increased workload at the beginning. This means that I will get lower GPU timings for shaders during the first ~30 seconds of the app, but then if I let it sit for longer until I profiler, I get about 10% worse performance. I believe viewing the GPU clock speed would help me better profile this. However, I have not found any way to view this metric. Is there any way to view current GPU clock speed on iOS devices? Even if just through private methods for debugging purposes. Thanks!
Posted
by
Post not yet marked as solved
0 Replies
305 Views
I see a lot of timings like this in Xcode GPU Frame Capture. Specifically what I want to call attention to is that the sum of the parts of the command buffer time does not add up to the total that is displayed. I notice that for GPU frame time, Xcode adds up all the command buffer times. But if I only add up the shader times, I get a much lower time (1.5-2 ms lower). I am trying to understand what is going on, I mean where is the GPU time going if it isn't going to the shaders? Notice how the total time is listed as 0.211 ms, but the sum of the parts is only 0.047 ms. I just want to better understand what is going on. Thanks.
Posted
by
Post marked as solved
1 Replies
260 Views
Hi. I am implementing some neural network model by MPSGraph on Radeon Mac. I want to accelerate it by float16 since Radeon can execute kernels with float16 twice faster than float32. Is it possible? I mean, does MPSGraph support native float16 on Radeon GPU? If so, how can I do it? Setting all datatypes to float16? Thanks.
Posted
by
Post not yet marked as solved
2 Replies
411 Views
Hi. Excuse me for no reprodcution code. My app on macOS Monterey runs MPSGraph#run repeatedly. For a minutes, Xcode console shows "Context leak detected, CoreAnalytics returned false" repeatedly and the system slows down. Do I need to release some resource for each calling of run method? Thanks
Posted
by
Post marked as solved
1 Replies
282 Views
Hello, printing the dataType property some tensor, results in a non-disclosing "MPSDataType" string. Is this intended? If so, how do I get string representation of type of the tensor please? let graph = MPSGraph() print(graph.constant(23.0, dataType: .float32).dataType) // prints "MPSDataType", and not "f32" or something useful
Posted
by
Post not yet marked as solved
0 Replies
259 Views
Hello, I am trying to implement numpy.arrange using MPSGraph primitives. Along the way I got to this code, which I would expect is correct, however the program crashes with information about type mismatch. What am I doing wrong? Code let G = MPSGraph() let length = G.constant(9, shape: [1], dataType: .int32) let base = G.constant(1, shape: [1], dataType: .int32) let template = G.broadcast(base, shapeTensor: length, name: nil) let r = G.for(     numberOfIterations: length,     initialBodyArguments: [template], // just return the argument     body: { i, args in [args[0]] },     name: nil ) Error -:22:11: error: 'scf.for' op types mismatch between 0th iter operand and defined value -:22:11: note: see current operation: %19 = "scf.for"(%16, %17, %18, %12) ( { ^bb0(%arg1: index, %arg2: tensor<*xi32>):  // no predecessors   %21 = "tensor.from_elements"(%arg1) : (index) -> tensor<1xindex>   %22 = "mps.cast"(%21) {resultElementType = i32} : (tensor<1xindex>) -> tensor<1xi32>   "scf.yield"(%arg2) : (tensor<*xi32>) -> () }) : (index, index, index, tensor<9xi32>) -> tensor<*xi32> /Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph_Sim/MetalPerformanceShadersGraph-2.0.22/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:1052: failed assertion `Error: MLIR pass manager failed' P.S. Is there a way I can return differently-shaped tensor from each iteration? In other words: body: { i, args in [G.concatTensors([args[0], i], dimension: 0, name: nil)] }
Posted
by
Post not yet marked as solved
0 Replies
220 Views
In the function processLastArData() a command buffer is committed and the output of the last MPS is immediately assigned without issuing a waitUntilCompleted() on the buffer. What am I missing? https://developer.apple.com/documentation/arkit/environmental_analysis/displaying_a_point_cloud_using_scene_depth?language=objc
Posted
by
Post not yet marked as solved
7 Replies
1.4k Views
I am working on the implementation of a highly-demanding signal processing algorithm, and I am not able to reach an acceptable execution time with vDSP's routines. I am now having a look at Metal and learn how to use it. It seems like Metal Performance Shaders as well as MPS Graph could replace almost all of my vDSP calls, but not the Fast Fourier Transform (which is the most time consuming part of the algorithm). I was wondering if there's a way for FFT methods to be included in MPS, because it could be insanely fast if optimized for unified architecture of the M1. Thanks !
Posted
by
Post not yet marked as solved
0 Replies
263 Views
Hi, I'm trying to debug a metal kernel for analytics computation on GPU. If I set a breakpoint in the metal kernel source code I received the info what the Xcode won't pause on this breakpoint because it has not been resolved and breakpoint is disabled. Is it possible in any way to debug the metal kernel source code? Thanks in advance
Posted
by
Post not yet marked as solved
0 Replies
212 Views
MPS API allows to run kernels in MTLCommandBuffer but is it possible to create MTLComputeCommandEncoder and run several kernels in it without creating a separate encoder for each kernel under the hood? Something like: // Create Command Buffer // Create Encoder kernel1.encode(encoder: encoder, sourceTexture: source, destinationTexture: k1Destination) kernel2.encode(encoder: encoder, sourceTexture: k1Destination, destinationTexture: destination) encoder.endEncoding() commandBuffer.commit()
Posted
by
Post not yet marked as solved
2 Replies
404 Views
I've created a custom BoxBlur kernel that produces identical results to Apple's built-in box blur (CIBoxBlur) kernel but my custom kernel is orders of magnitude slower. So naturally I am wondering what I'm doing wrong to get such poor performance. Below is my custom kernel in the Metal shading language. Can you spot why it's so slow? The built-in filter performs well so I can only assume it's something I'm doing wrong. #include <CoreImage/CoreImage.h> #import <simd/simd.h> extern "C" { namespace coreimage { float4 customBoxBlurFilterKernel(sampler src) { float2 crd = src.coord(); int edge = 100; int minx = crd.x - edge; int maxx = crd.x + edge; int miny = crd.y - edge; int maxy = crd.y + edge; float4 sums = float4(0,0,0,0); float cnt = 0; // compute average of surrounding rgb values for(int row=miny; row < maxy; row++) { for(int col=minx; col < maxx; col++) { float4 samp = src.sample(float2(col, row)); sums[0] += samp[0]; sums[1] += samp[1]; sums[2] += samp[2]; cnt += 1.; } } return float4(sums[0]/cnt, sums[1]/cnt, sums[2]/cnt, 1); } } }
Posted
by
Post marked as solved
1 Replies
311 Views
There is a write function documented in the CoreImage Metal shader reference here: https://developer.apple.com/metal/MetalCIKLReference6.pdf But I'm not sure how to use it. I assumed one would be able to use it on the destination parameter i.e. dest.write(...) but I get the error, "no member named 'write' in 'coreimage::destination'" How do I use this function?
Posted
by