When I try to run my matrix multiplication I receive the following warning in iOS but not in macOS :
'init(dimensions:columns:rowBytes:dataType:)' was deprecated in iOS 11.0
How may I change my code to remove the iOS warning? Here is my line generating the warning:
let mdesc = MPSMatrixDescriptor( dimensions: 2, columns: 2, rowBytes: rowbytes, dataType: MPSDataType.float16)
Post not yet marked as solved
Hello Everybody.
I'm trying to port graphic code written cg in unity to metal.
And, one more thing I don't want to manually implement scene graph, so I gonna use SceneKit.
So I should use SCNProgram or SCNNodeRendererDelegate, and I think SCNProgram is more comfort.
And real my question is how convert this code, in cg
Cull Front
ZTest LEqual
ZWrite On
Blend SrcAlpha OneMinusSrcAlpha
I know source alpha blending in MTLPipelineDescriptor, zbuffer in RenderCommandEncoder and Cull Face also. But When I use SCNProgram or SCNSceneRendererNode, can't find these options... how I change these. Help me.
Post not yet marked as solved
I have a metal compute kernel for dense matrix mutiply, and I'd like to optimize it with simdgroup_float8x8 and simdgroup_half8x8.
However, it seems no one apply them in Metal.
Can you give me some more demo on how to use them excpet that in Metal Shading Language Specification Version 2.4.
Thanks!
Post not yet marked as solved
I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results:
First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.
Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.
Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
Post not yet marked as solved
After capturing several metal frames of my IOS games which I packaged IPA file from UE4, I fail to get the shader source as before. The following msg box come across.
As the image content says , I check my build settings. However, there is no Metal compiler build options and no "produce debuggering information" item either.
my MacOS 12.1 Monterey
my Xode 13.1(13A1030d)
any help will be appreciated.
Hello, Everyone.
I try to use a metal kit with a scene kit. Because, the scene kits scene graph is great, I want to implement a low-level metal shader.
I want to use SCNNodeRenderDelegate, without SCNProgram. Because I want low-level implement for example passing custom extra MTLBuffer, or multi-pass-rendering.
So I pass model view projection matrix like that,
in metal shader
struct NodeBuffer {
float4x4 modelTransform;
float4x4 modelViewProjectionTransform;
float4x4 modelViewTransform;
float4x4 normalTransform;
float2x3 boundingBox;
};
in Swift code
struct NodeMatrix: sizeable {
var modelTransform = float4x4()
var modelViewProjectionTransform = float4x4()
var modelViewTransform = float4x4()
var normalTransform = float4x4()
var boundingBox = float2x3()
}
...
private func updateNodeMatrix(_ camNode: SCNNode) {
guard let camera = camNode.camera else {
return
}
let modelMatrix = transform
let viewMatrix = camNode.transform
let projectionMatrix = camera.projectionTransform
let viewProjection = SCNMatrix4Mult(viewMatrix, projectionMatrix)
let modelViewProjection = SCNMatrix4Mult(modelMatrix, viewProjection)
nodeMatrix.modelViewProjectionTransform = float4x4(modelViewProjectionMatrix)
}
...
public func renderNode(_ node: SCNNode,
renderer: SCNRenderer,
arguments: [String: Any])
{
guard let renderTexturePipelineState = renderTexturePipelineState,
let renderCommandEncoder = renderer.currentRenderCommandEncoder,
let camNode = renderer.pointOfView,
let texture = texture
else { return }
updateNodeMatrix(camNode)
guard let nodeBuffer
= renderer.device?.makeBuffer(bytes: &nodeMatrix,
length: NodeMatrix.stride,
options: [])
else { return }
renderCommandEncoder.setDepthStencilState(depthState)
renderCommandEncoder.setRenderPipelineState(renderTexturePipelineState)
renderCommandEncoder.setFragmentTexture(texture, index: 0)
renderCommandEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
renderCommandEncoder.setVertexBuffer(nodeBuffer, offset: 0, index: 1)
renderCommandEncoder.drawIndexedPrimitives(type: .triangle,
indexCount: indexCount,
indexType: .uint16,
indexBuffer: indexBuffer,
indexBufferOffset: 0)
}
But I got the wrong model view projection matrix in the shader.
I think scene kit has modify intermediate transform hiding.
I can't know, help me...
Post not yet marked as solved
I am trying measuring performance in my app, I used two difference ways to measure commandBuffer completion time. One way is using MTLCommandBuffer addCompletedHandler:
commandBuffer.addCompletedHandler { cb in
let executionDuration = cb.gpuEndTime - cb.gpuStartTime
/* ... */
}
The other way is to use MTLCaptureManager. And I found two interesting things:
First, the completion time from addCompletedHandler was 26.82ms, on the other hand, GPU time from the capture manager was 13.49ms. I have been trying to understand why these two number are way different, but couldn't fine any concrete answer.
Second, the GPU time is different from shader times shown in timeline in Performances. Here is screenshot.
According to the timeline, it took 17.57ms. There is inconsistency. I did same test multiple times, and sometimes process time on timeline is less than GPU time, or vise versa.
Within this command buffer, there are 56 dispatches. Is this because there are too many dispatches?
I tested this on iPhone 12 Max with iOS 15.2.1
If there is someone who can give me clear explanation, it would be really appreciated.
As the documentation says, Limiter counters tell you which subsystems of the GPU are active by providing a percentage of the total number of processor cycles during which this subsystem was active.
Besides, Instrument also provides some Utilization Counters and the value is different from the Limiter. What do Utilization Counters mean?
Post not yet marked as solved
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
Post not yet marked as solved
How to clear OpenCL cache which contains pre-compilled OpenCL kernels? It is saved somewhere on the disk, because cache persists even after system restart. I suppose it uses the same cache as Metal, but this I also cannot locate.
This cache is problematic because if some of header files for OpenCL code is modified the OpenCL kernel is not re-compilled.
Is it possible to pass MTLTexture to Metal Core Image Kernel? How can Metal resources be shared with Core Image?
Post not yet marked as solved
Hello guys.
With the release of the M1 Pro and M1 Max in particular, the Mac has become a platform that could become very interesting for games in the future. However, since some features are still missing in Metal, it could be problematic for some developers to port their games to Metal. Especially with the Unreal Engine 5 you can already see a tendency in this direction, since e.g. Nanite and Lumen are unfortunately not available on the Mac.
As a Vulkan developer I wanted to inquire about some features that are not yet available in Metal at the moment. These features are very interesting if you want to write a GPU driven renderer for modern game engines.
Furthermore, these features could be used to emulate D3D12 on the Mac via MoltenVK, which would result in more games being available on the Mac.
Buffer device address:
This feature allows the application to query a 64-bit buffer device address value for a buffer.
It is very useful for D3D12 emulation and for compatibility with Vulkan, e.g. to implement ray tracing on MoltenVK.
DrawIndirectCount:
This feature allows an application to source the number of draws for indirect drawing calls from a buffer. Also very useful in many gpu driven situations
Only 500000 resources per argument buffer
Metal has a limit of 500000 resources per argument buffer. To be equivalent to D3D12 Resource Binding Tear 2, you would need 1 million. This is also very important as so many DirectX12 game engines could be ported to Metal more easily.
Mesh shader / Task shader:
Two interesting new shader stages to optimize the rendering pipeline
Are there any plans to implement this features in future?
Is there a roadmap for metal? Is there a website where I can suggest features to the metal developers?
I hope to see at least the first 3 features in metal in the future and I think that many developers feel the same way.
Best regards,
Marlon
Post not yet marked as solved
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training:
extension MLCTensor {
func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric {
let count = self.descriptor.shape.reduce(into: 1) { (result, value) in
result *= value
}
var array = [Scalar](repeating: 0, count: count)
self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU
_ = try array.withUnsafeMutableBytes { (pointer) in
guard let data = self.data else {
throw DataError.uninitialized // A custom error that I declare elsewhere
}
data.copyBytes(to: pointer)
}
return array
}
}
The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect.
This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically.
For reference, I’m testing my code on M1 Max.
Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Post not yet marked as solved
Is it possible to do any of the following:
Export a model created using MetalPerformanceShadersGraph to a CoreML file;
Failing 1., save a trained MetalPerformanceShadersGraph model in any other way for deployment;
Import a CoreML model and use it as a part of a MetalPerformanceShadersGraph model.
Thanks!
There is a write function documented in the CoreImage Metal shader reference here: https://developer.apple.com/metal/MetalCIKLReference6.pdf
But I'm not sure how to use it. I assumed one would be able to use it on the destination parameter i.e. dest.write(...) but I get the error, "no member named 'write' in 'coreimage::destination'"
How do I use this function?
Post not yet marked as solved
I've created a custom BoxBlur kernel that produces identical results to Apple's built-in box blur (CIBoxBlur) kernel but my custom kernel is orders of magnitude slower. So naturally I am wondering what I'm doing wrong to get such poor performance. Below is my custom kernel in the Metal shading language. Can you spot why it's so slow? The built-in filter performs well so I can only assume it's something I'm doing wrong.
#include <CoreImage/CoreImage.h>
#import <simd/simd.h>
extern "C" {
namespace coreimage {
float4 customBoxBlurFilterKernel(sampler src) {
float2 crd = src.coord();
int edge = 100;
int minx = crd.x - edge;
int maxx = crd.x + edge;
int miny = crd.y - edge;
int maxy = crd.y + edge;
float4 sums = float4(0,0,0,0);
float cnt = 0;
// compute average of surrounding rgb values
for(int row=miny; row < maxy; row++) {
for(int col=minx; col < maxx; col++) {
float4 samp = src.sample(float2(col, row));
sums[0] += samp[0];
sums[1] += samp[1];
sums[2] += samp[2];
cnt += 1.;
}
}
return float4(sums[0]/cnt, sums[1]/cnt, sums[2]/cnt, 1);
}
}
}
Post not yet marked as solved
MPS API allows to run kernels in MTLCommandBuffer but is it possible to create MTLComputeCommandEncoder and run several kernels in it without creating a separate encoder for each kernel under the hood?
Something like:
// Create Command Buffer
// Create Encoder
kernel1.encode(encoder: encoder, sourceTexture: source, destinationTexture: k1Destination)
kernel2.encode(encoder: encoder, sourceTexture: k1Destination, destinationTexture: destination)
encoder.endEncoding()
commandBuffer.commit()
Post not yet marked as solved
Hi,
I'm trying to debug a metal kernel for analytics computation on GPU.
If I set a breakpoint in the metal kernel source code I received the info what the Xcode won't pause on this breakpoint because it has not been resolved and breakpoint is disabled.
Is it possible in any way to debug the metal kernel source code?
Thanks in advance
Post not yet marked as solved
I am working on the implementation of a highly-demanding signal processing algorithm, and I am not able to reach an acceptable execution time with vDSP's routines.
I am now having a look at Metal and learn how to use it. It seems like Metal Performance Shaders as well as MPS Graph could replace almost all of my vDSP calls, but not the Fast Fourier Transform (which is the most time consuming part of the algorithm).
I was wondering if there's a way for FFT methods to be included in MPS, because it could be insanely fast if optimized for unified architecture of the M1.
Thanks !
Post not yet marked as solved
In the function processLastArData() a command buffer is committed and the output of the last MPS is immediately assigned without issuing a waitUntilCompleted() on the buffer. What am I missing?
https://developer.apple.com/documentation/arkit/environmental_analysis/displaying_a_point_cloud_using_scene_depth?language=objc