Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

54 Posts
Sort by:
Post not yet marked as solved
0 Replies
212 Views
MPS API allows to run kernels in MTLCommandBuffer but is it possible to create MTLComputeCommandEncoder and run several kernels in it without creating a separate encoder for each kernel under the hood? Something like: // Create Command Buffer // Create Encoder kernel1.encode(encoder: encoder, sourceTexture: source, destinationTexture: k1Destination) kernel2.encode(encoder: encoder, sourceTexture: k1Destination, destinationTexture: destination) encoder.endEncoding() commandBuffer.commit()
Posted Last updated
.
Post not yet marked as solved
0 Replies
263 Views
Hi, I'm trying to debug a metal kernel for analytics computation on GPU. If I set a breakpoint in the metal kernel source code I received the info what the Xcode won't pause on this breakpoint because it has not been resolved and breakpoint is disabled. Is it possible in any way to debug the metal kernel source code? Thanks in advance
Posted
by Richard_M.
Last updated
.
Post not yet marked as solved
0 Replies
260 Views
Hello, I am trying to implement numpy.arrange using MPSGraph primitives. Along the way I got to this code, which I would expect is correct, however the program crashes with information about type mismatch. What am I doing wrong? Code let G = MPSGraph() let length = G.constant(9, shape: [1], dataType: .int32) let base = G.constant(1, shape: [1], dataType: .int32) let template = G.broadcast(base, shapeTensor: length, name: nil) let r = G.for(     numberOfIterations: length,     initialBodyArguments: [template], // just return the argument     body: { i, args in [args[0]] },     name: nil ) Error -:22:11: error: 'scf.for' op types mismatch between 0th iter operand and defined value -:22:11: note: see current operation: %19 = "scf.for"(%16, %17, %18, %12) ( { ^bb0(%arg1: index, %arg2: tensor<*xi32>):  // no predecessors   %21 = "tensor.from_elements"(%arg1) : (index) -> tensor<1xindex>   %22 = "mps.cast"(%21) {resultElementType = i32} : (tensor<1xindex>) -> tensor<1xi32>   "scf.yield"(%arg2) : (tensor<*xi32>) -> () }) : (index, index, index, tensor<9xi32>) -> tensor<*xi32> /Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph_Sim/MetalPerformanceShadersGraph-2.0.22/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:1052: failed assertion `Error: MLIR pass manager failed' P.S. Is there a way I can return differently-shaped tensor from each iteration? In other words: body: { i, args in [G.concatTensors([args[0], i], dimension: 0, name: nil)] }
Posted
by mlajtos.
Last updated
.
Post not yet marked as solved
2 Replies
412 Views
Hi. Excuse me for no reprodcution code. My app on macOS Monterey runs MPSGraph#run repeatedly. For a minutes, Xcode console shows "Context leak detected, CoreAnalytics returned false" repeatedly and the system slows down. Do I need to release some resource for each calling of run method? Thanks
Posted
by y_ich.
Last updated
.
Post not yet marked as solved
0 Replies
220 Views
In the function processLastArData() a command buffer is committed and the output of the last MPS is immediately assigned without issuing a waitUntilCompleted() on the buffer. What am I missing? https://developer.apple.com/documentation/arkit/environmental_analysis/displaying_a_point_cloud_using_scene_depth?language=objc
Posted
by jselikof.
Last updated
.
Post marked as solved
1 Replies
284 Views
Hello, printing the dataType property some tensor, results in a non-disclosing "MPSDataType" string. Is this intended? If so, how do I get string representation of type of the tensor please? let graph = MPSGraph() print(graph.constant(23.0, dataType: .float32).dataType) // prints "MPSDataType", and not "f32" or something useful
Posted
by mlajtos.
Last updated
.
Post marked as solved
1 Replies
261 Views
Hi. I am implementing some neural network model by MPSGraph on Radeon Mac. I want to accelerate it by float16 since Radeon can execute kernels with float16 twice faster than float32. Is it possible? I mean, does MPSGraph support native float16 on Radeon GPU? If so, how can I do it? Setting all datatypes to float16? Thanks.
Posted
by y_ich.
Last updated
.
Post not yet marked as solved
0 Replies
306 Views
I see a lot of timings like this in Xcode GPU Frame Capture. Specifically what I want to call attention to is that the sum of the parts of the command buffer time does not add up to the total that is displayed. I notice that for GPU frame time, Xcode adds up all the command buffer times. But if I only add up the shader times, I get a much lower time (1.5-2 ms lower). I am trying to understand what is going on, I mean where is the GPU time going if it isn't going to the shaders? Notice how the total time is listed as 0.211 ms, but the sum of the parts is only 0.047 ms. I just want to better understand what is going on. Thanks.
Posted
by jwilde.
Last updated
.
Post marked as solved
1 Replies
561 Views
I spend the majority of my time these days working on and optimizing shaders on iOS. Xcode GPU Profiler has been very helpful for the most part, however one problem that still plagues me is that profiling is very inconsistent. When I first start up an app I generally get a pretty good boost in GPU performance, then it slows down after 30 seconds or so. I imagine this is due to increased workload at the beginning. This means that I will get lower GPU timings for shaders during the first ~30 seconds of the app, but then if I let it sit for longer until I profiler, I get about 10% worse performance. I believe viewing the GPU clock speed would help me better profile this. However, I have not found any way to view this metric. Is there any way to view current GPU clock speed on iOS devices? Even if just through private methods for debugging purposes. Thanks!
Posted
by jwilde.
Last updated
.
Post not yet marked as solved
0 Replies
342 Views
Is there a way to force precompilation of SKShader()s? I am initializing a basic shader in my SKScene properties, but the shader only gets compiled when it is actually attached and rendered to a sprite. class GameScene: SKScene {   let myShader = SKShader(fileNamed: "myEffect.fsh") override func update(_ currentTime: TimeInterval) { if touches.count > 0 { mySprite.shader = myShader // this is where the warning triggers for the first time } } } I know this because the scene pauses for a bit, immediately before the shader is rendered for the first time, and I also get the Metal warning (which seems to also be a known bug in recent releases - https://developer.apple.com/forums/thread/661774): [Metal Compiler Warning] Warning: Compilation succeeded with:  program_source:3:19: warning: unused variable 's' constexpr sampler s(coord::normalized, I was expecting the shader to compile when it is initialized with SKShader(fileNamed:), since Apple docs says: Compiling a shader and the uniform data associated with it can be expensive. Because of this, you should: Initialize shader objects when your game launches, not while the game is running. https://developer.apple.com/documentation/spritekit/skshader
Posted
by calin.
Last updated
.
Post marked as solved
1 Replies
486 Views
Hi, I wrote the following compute shader to blur images with some complex kernel: // use Packhalf5 to align to 16 B struct Packhalf5 { half4 a; half4 b; }; kernel void cs_main( texture2d_array<float> t_f123 [[texture(0)]] , texture2d_array<float, access::write> t_normal [[texture(3)]] , sampler s_f123 [[sampler(0)]] , uint3 gl_GlobalInvocationID [[thread_position_in_grid]] , uint3 gl_LocalInvocationID [[thread_position_in_threadgroup]] ) { // use Packhalf5 to align to 16 B // BLOCK_SIZE_Y = 1 // BLOCK_SIZE_X = 128 or 64 or 32 // MAX_PIXR = 20 threadgroup Packhalf5 gCache[BLOCK_SIZE_Y][BLOCK_SIZE_X + 2 * MAX_PIXR]; // cache texture samples in thread group memory gCache so that we can read it quickly and avoid most texture samples in loop if (gl_LocalInvocationID.y < _35) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _2.xy, uint(round(_2.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _2.xy, uint(round(_2.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = pkh5; } if (gl_LocalInvocationID.y >= uint(BLOCK_SIZE_X - _34)) { // ... Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _3.xy, uint(round(_3.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_36] = pkh5; } Packhalf5 pkh5; pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _31.xy, uint(round(_31.z)), level(0.0))).xyz); pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _31.xy, uint(round(_31.z)), level(0.0))).xy); gCache[gl_LocalInvocationID.x][_37] = pkh5; threadgroup_barrier(mem_flags::mem_threadgroup); // use gCache to blur image for (int i = 0;i<kernel_size;++i) { // calculate index int a = f1(i); int b = f2(i); Packhalf5 pkh5; //it is extremly slow than directly sampling texture (t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) pkh5 = gCache[gl_LocalInvocationID.x][a]; float2 _42 = float2(pkh5.b.xy); float3 _43 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); //it is extremly slow than directly sampling texture pkh5 = gCache[gl_LocalInvocationID.x][b]; float2 _45 = float2(pkh5.b.xy); float3 _46 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25); // use _42,_43,_45_46 } // wirte blur result t_normal.write(_16, uint2(gl_GlobalInvocationID.xy), uint(gl_GlobalInvocationID.z)); } I wrote this shader to optimize some blur operations which is similar to Gaussian Blur on a 128x128 picture, and I test and profile it on iPhone XR in the Xcode frame debugger found that : the loading from thread group memory "gCache" in the loop is so slow than directly sample texture (ie. t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0)) (Shader take 30% of total time to load gCache, but take only < 5% of the total time if I change to directly sample texture to sample texture ) As for the performance counter, the texture reading is actually going down but things strange is that the main memory bandwidth is nearly not changed. I guess that there is a data hazard because that the use of too much thread group memory?
Posted
by wubugui.
Last updated
.
Post not yet marked as solved
2 Replies
676 Views
I am using a converted custom PyTorch Model on device for use with real time video. The Model was converted successfully using both CoreMLTools V4.1 and V5.0b3 (both versions exhibit the same issues). When running the model both from a python environment using CoreMLTools, as well as a MacOS app, using the same input image and supplementary data the output is identical, correct and matches the output of the pure PyTorch model. However, when running it on device, the models output is incorrect. On an iPhone XR, using the .all or .cpuAndGPU value of computeUnits, the output is simply a white square with no error or warning message. What this means is our output, which we normal expect to be in the range of [0,255], has values of 255 in every location. However, running using .cpuOnly on the iPhone XR produces the correct output. Furthermore, when simulating a device from a MacOS machine, the output is correct regardless of the computeUnits value. On an iPhone 12 this situation gets even more confusing. With the setting .cpuAndGPU, we get the pure white incorrect output, using .cpuOnly we get the correct output, but with .all we get a different incorrect output, an image of wildly incorrect colors but a vaguely similar form to the image we expect. In addition with the .all setting we get the following error. 2021-09-01 15:07:16.595048-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::ANERuntimeEngine::__forward_segment 3] evaluate[RealTime]WithModel returned 0; code=5 err=Error Domain=com.apple.appleneuralengine Code=5 "processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow" UserInfo={NSLocalizedDescription=processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow} 2021-09-01 15:07:16.595103-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::overflow_error] /private/var/containers/Bundle/Application/16433631-57DE-488C-8772-D9560C3D8B48/sensoriumViewer.app/SensoriumMLTest16V1.mlmodelc/model.espresso.net:3 Which makes it pretty clear that there is some sort of either Integer or Floating Point overflow error. What I believe is happening is this: Regardless of model, using the GPU causes the overflow to truncate, giving us values of 255 for all pixels, on the iPhone 12 using .all passes it to the ANE (Apple Neural Engine) which wraps the overflow error, giving unpredictable colors but a kind of correct shape, using .all on the iPhone XR just uses the GPU because for some reason this model won't go to the XR ANE, and lastly, using .cpuOnly does not overflow and gives us the correct result. Why does the XR not use its ANE for this model? Can the ANE and GPU just not handle 32 bit floats? We are quantizing the model to 16bit using CoreMLTools, why are we still overflowing? I see the documentation for the new MLProgram format and it seems promising, will that solve this issue? Is there any documentation surrounding the supported operations and number precision for Pytorch converted models? Why are there no errors or warnings when passing this through the GPU? Any help or insight would be greatly appreciated as the documentation I've seen surrounding the ANE is not very comprehensive.
Posted
by pmj714.
Last updated
.