Post not yet marked as solved
MPS API allows to run kernels in MTLCommandBuffer but is it possible to create MTLComputeCommandEncoder and run several kernels in it without creating a separate encoder for each kernel under the hood?
Something like:
// Create Command Buffer
// Create Encoder
kernel1.encode(encoder: encoder, sourceTexture: source, destinationTexture: k1Destination)
kernel2.encode(encoder: encoder, sourceTexture: k1Destination, destinationTexture: destination)
encoder.endEncoding()
commandBuffer.commit()
Post not yet marked as solved
Hi,
I'm trying to debug a metal kernel for analytics computation on GPU.
If I set a breakpoint in the metal kernel source code I received the info what the Xcode won't pause on this breakpoint because it has not been resolved and breakpoint is disabled.
Is it possible in any way to debug the metal kernel source code?
Thanks in advance
Post not yet marked as solved
Hello,
I am trying to implement numpy.arrange using MPSGraph primitives. Along the way I got to this code, which I would expect is correct, however the program crashes with information about type mismatch.
What am I doing wrong?
Code
let G = MPSGraph()
let length = G.constant(9, shape: [1], dataType: .int32)
let base = G.constant(1, shape: [1], dataType: .int32)
let template = G.broadcast(base, shapeTensor: length, name: nil)
let r = G.for(
numberOfIterations: length,
initialBodyArguments: [template],
// just return the argument
body: { i, args in [args[0]] },
name: nil
)
Error
-:22:11: error: 'scf.for' op types mismatch between 0th iter operand and defined value
-:22:11: note: see current operation: %19 = "scf.for"(%16, %17, %18, %12) ( {
^bb0(%arg1: index, %arg2: tensor<*xi32>): // no predecessors
%21 = "tensor.from_elements"(%arg1) : (index) -> tensor<1xindex>
%22 = "mps.cast"(%21) {resultElementType = i32} : (tensor<1xindex>) -> tensor<1xi32>
"scf.yield"(%arg2) : (tensor<*xi32>) -> ()
}) : (index, index, index, tensor<9xi32>) -> tensor<*xi32>
/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph_Sim/MetalPerformanceShadersGraph-2.0.22/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:1052: failed assertion `Error: MLIR pass manager failed'
P.S. Is there a way I can return differently-shaped tensor from each iteration? In other words:
body: { i, args in
[G.concatTensors([args[0], i], dimension: 0, name: nil)]
}
Post not yet marked as solved
Hi.
Excuse me for no reprodcution code.
My app on macOS Monterey runs MPSGraph#run repeatedly.
For a minutes, Xcode console shows "Context leak detected, CoreAnalytics returned false" repeatedly and the system slows down.
Do I need to release some resource for each calling of run method?
Thanks
Post not yet marked as solved
In the function processLastArData() a command buffer is committed and the output of the last MPS is immediately assigned without issuing a waitUntilCompleted() on the buffer. What am I missing?
https://developer.apple.com/documentation/arkit/environmental_analysis/displaying_a_point_cloud_using_scene_depth?language=objc
I am using Frame Capture Debugging Tools works well for vertex shader functions, but I can't debug the compute function, the debug button is gray and show me Unsupported Post vertex transform data.
Hello,
printing the dataType property some tensor, results in a non-disclosing "MPSDataType" string. Is this intended? If so, how do I get string representation of type of the tensor please?
let graph = MPSGraph()
print(graph.constant(23.0, dataType: .float32).dataType)
// prints "MPSDataType", and not "f32" or something useful
Hi.
I am implementing some neural network model by MPSGraph on Radeon Mac.
I want to accelerate it by float16 since Radeon can execute kernels with float16 twice faster than float32.
Is it possible?
I mean, does MPSGraph support native float16 on Radeon GPU?
If so, how can I do it?
Setting all datatypes to float16?
Thanks.
Post not yet marked as solved
I see a lot of timings like this in Xcode GPU Frame Capture.
Specifically what I want to call attention to is that the sum of the parts of the command buffer time does not add up to the total that is displayed. I notice that for GPU frame time, Xcode adds up all the command buffer times. But if I only add up the shader times, I get a much lower time (1.5-2 ms lower). I am trying to understand what is going on, I mean where is the GPU time going if it isn't going to the shaders?
Notice how the total time is listed as 0.211 ms, but the sum of the parts is only 0.047 ms.
I just want to better understand what is going on.
Thanks.
I spend the majority of my time these days working on and optimizing shaders on iOS.
Xcode GPU Profiler has been very helpful for the most part, however one problem that still plagues me is that profiling is very inconsistent. When I first start up an app I generally get a pretty good boost in GPU performance, then it slows down after 30 seconds or so. I imagine this is due to increased workload at the beginning.
This means that I will get lower GPU timings for shaders during the first ~30 seconds of the app, but then if I let it sit for longer until I profiler, I get about 10% worse performance.
I believe viewing the GPU clock speed would help me better profile this. However, I have not found any way to view this metric.
Is there any way to view current GPU clock speed on iOS devices? Even if just through private methods for debugging purposes.
Thanks!
Post not yet marked as solved
Is there a way to force precompilation of SKShader()s?
I am initializing a basic shader in my SKScene properties, but the shader only gets compiled when it is actually attached and rendered to a sprite.
class GameScene: SKScene {
let myShader = SKShader(fileNamed: "myEffect.fsh")
override func update(_ currentTime: TimeInterval) {
if touches.count > 0 {
mySprite.shader = myShader
// this is where the warning triggers for the first time
}
}
}
I know this because the scene pauses for a bit, immediately before the shader is rendered for the first time, and I also get the Metal warning (which seems to also be a known bug in recent releases - https://developer.apple.com/forums/thread/661774):
[Metal Compiler Warning] Warning: Compilation succeeded with:
program_source:3:19: warning: unused variable 's'
constexpr sampler s(coord::normalized,
I was expecting the shader to compile when it is initialized with SKShader(fileNamed:), since Apple docs says:
Compiling a shader and the uniform data associated with it can be expensive. Because of this, you should:
Initialize shader objects when your game launches, not while the game is running.
https://developer.apple.com/documentation/spritekit/skshader
Hi, I wrote the following compute shader to blur images with some complex kernel:
// use Packhalf5 to align to 16 B
struct Packhalf5
{
half4 a;
half4 b;
};
kernel void cs_main(
texture2d_array<float> t_f123 [[texture(0)]]
, texture2d_array<float, access::write> t_normal [[texture(3)]]
, sampler s_f123 [[sampler(0)]]
, uint3 gl_GlobalInvocationID [[thread_position_in_grid]]
, uint3 gl_LocalInvocationID [[thread_position_in_threadgroup]]
)
{
// use Packhalf5 to align to 16 B
// BLOCK_SIZE_Y = 1
// BLOCK_SIZE_X = 128 or 64 or 32
// MAX_PIXR = 20
threadgroup Packhalf5 gCache[BLOCK_SIZE_Y][BLOCK_SIZE_X + 2 * MAX_PIXR];
// cache texture samples in thread group memory gCache so that we can read it quickly and avoid most texture samples in loop
if (gl_LocalInvocationID.y < _35)
{
// ...
Packhalf5 pkh5;
pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _2.xy, uint(round(_2.z)), level(0.0))).xyz);
pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _2.xy, uint(round(_2.z)), level(0.0))).xy);
gCache[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = pkh5;
}
if (gl_LocalInvocationID.y >= uint(BLOCK_SIZE_X - _34))
{
// ...
Packhalf5 pkh5;
pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))).xyz);
pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _3.xy, uint(round(_3.z)), level(0.0))).xy);
gCache[gl_LocalInvocationID.x][_36] = pkh5;
}
Packhalf5 pkh5;
pkh5.a.xyz = half3(half4(t_f123.sample(s_f123, _31.xy, uint(round(_31.z)), level(0.0))).xyz);
pkh5.b.xy = half2(half4(t_f45.sample(s_f45, _31.xy, uint(round(_31.z)), level(0.0))).xy);
gCache[gl_LocalInvocationID.x][_37] = pkh5;
threadgroup_barrier(mem_flags::mem_threadgroup);
// use gCache to blur image
for (int i = 0;i<kernel_size;++i)
{
// calculate index
int a = f1(i);
int b = f2(i);
Packhalf5 pkh5;
//it is extremly slow than directly sampling texture (t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))
pkh5 = gCache[gl_LocalInvocationID.x][a];
float2 _42 = float2(pkh5.b.xy);
float3 _43 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25);
//it is extremly slow than directly sampling texture
pkh5 = gCache[gl_LocalInvocationID.x][b];
float2 _45 = float2(pkh5.b.xy);
float3 _46 = (float3(pkh5.a.xyz) * float3(1.0, 0.5, 0.5)) + float3(0.0, -0.25, -0.25);
// use _42,_43,_45_46
}
// wirte blur result
t_normal.write(_16, uint2(gl_GlobalInvocationID.xy), uint(gl_GlobalInvocationID.z));
}
I wrote this shader to optimize some blur operations which is similar to Gaussian Blur on a 128x128 picture, and I test and profile it on iPhone XR in the Xcode frame debugger found that :
the loading from thread group memory "gCache" in the loop is so slow than directly sample texture (ie. t_f123.sample(s_f123, _3.xy, uint(round(_3.z)), level(0.0))
(Shader take 30% of total time to load gCache, but take only < 5% of the total time if I change to directly sample texture to sample texture )
As for the performance counter, the texture reading is actually going down but things strange is that the main memory bandwidth is nearly not changed.
I guess that there is a data hazard because that the use of too much thread group memory?
Post not yet marked as solved
I am using a converted custom PyTorch Model on device for use with real time video.
The Model was converted successfully using both CoreMLTools V4.1 and V5.0b3 (both versions exhibit the same issues). When running the model both from a python environment using CoreMLTools, as well as a MacOS app, using the same input image and supplementary data the output is identical, correct and matches the output of the pure PyTorch model.
However, when running it on device, the models output is incorrect. On an iPhone XR, using the .all or .cpuAndGPU value of computeUnits, the output is simply a white square with no error or warning message. What this means is our output, which we normal expect to be in the range of [0,255], has values of 255 in every location. However, running using .cpuOnly on the iPhone XR produces the correct output.
Furthermore, when simulating a device from a MacOS machine, the output is correct regardless of the computeUnits value.
On an iPhone 12 this situation gets even more confusing. With the setting .cpuAndGPU, we get the pure white incorrect output, using .cpuOnly we get the correct output, but with .all we get a different incorrect output, an image of wildly incorrect colors but a vaguely similar form to the image we expect. In addition with the .all setting we get the following error.
2021-09-01 15:07:16.595048-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::ANERuntimeEngine::__forward_segment 3] evaluate[RealTime]WithModel returned 0; code=5 err=Error Domain=com.apple.appleneuralengine Code=5 "processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow" UserInfo={NSLocalizedDescription=processRequest:qos:qIndex:modelStringID:options:error:: 0xd: Program Inference overflow}
2021-09-01 15:07:16.595103-0500 sensoriumViewer[33717:10399075] [espresso] [Espresso::overflow_error] /private/var/containers/Bundle/Application/16433631-57DE-488C-8772-D9560C3D8B48/sensoriumViewer.app/SensoriumMLTest16V1.mlmodelc/model.espresso.net:3
Which makes it pretty clear that there is some sort of either Integer or Floating Point overflow error. What I believe is happening is this:
Regardless of model, using the GPU causes the overflow to truncate, giving us values of 255 for all pixels, on the iPhone 12 using .all passes it to the ANE (Apple Neural Engine) which wraps the overflow error, giving unpredictable colors but a kind of correct shape, using .all on the iPhone XR just uses the GPU because for some reason this model won't go to the XR ANE, and lastly, using .cpuOnly does not overflow and gives us the correct result.
Why does the XR not use its ANE for this model? Can the ANE and GPU just not handle 32 bit floats? We are quantizing the model to 16bit using CoreMLTools, why are we still overflowing?
I see the documentation for the new MLProgram format and it seems promising, will that solve this issue?
Is there any documentation surrounding the supported operations and number precision for Pytorch converted models?
Why are there no errors or warnings when passing this through the GPU?
Any help or insight would be greatly appreciated as the documentation I've seen surrounding the ANE is not very comprehensive.
Post not yet marked as solved
I need to reshape a tensor A to shape S that is not available at the compile time. I would like to compute the shape S inside the graph and then reshape the tensor A to the shape S. Is this even possible with MPS?