Post not yet marked as solved
Is it possible to do any of the following:
Export a model created using MetalPerformanceShadersGraph to a CoreML file;
Failing 1., save a trained MetalPerformanceShadersGraph model in any other way for deployment;
Import a CoreML model and use it as a part of a MetalPerformanceShadersGraph model.
Thanks!
Post not yet marked as solved
Hello! I’m having an issue with retrieving the trained weights from MLCLSTMLayer in ML Compute when training on a GPU. I maintain references to the input-weights, hidden-weights, and biases tensors and use the following code to extract the data post-training:
extension MLCTensor {
func dataArray<Scalar>(as _: Scalar.Type) throws -> [Scalar] where Scalar: Numeric {
let count = self.descriptor.shape.reduce(into: 1) { (result, value) in
result *= value
}
var array = [Scalar](repeating: 0, count: count)
self.synchronizeData() // This *should* copy the latest data from the GPU to memory that’s accessible by the CPU
_ = try array.withUnsafeMutableBytes { (pointer) in
guard let data = self.data else {
throw DataError.uninitialized // A custom error that I declare elsewhere
}
data.copyBytes(to: pointer)
}
return array
}
}
The issue is that when I call dataArray(as:) on a weights or biases tensor for an LSTM layer that has been trained on a GPU, the values that it retrieves are the same as they were before training began. For instance, if I initialize the biases all to 0 and then train the LSTM layer on a GPU, the biases values seemingly remain 0 post-training, even though the reported loss values decrease as you would expect.
This issue does not occur when training an LSTM layer on a CPU, and it also does not occur when training a fully-connected layer on a GPU. Since both types of layers work properly on a CPU but only MLCFullyConnectedLayer works properly on a GPU, it seems that the issue is a bug in ML Compute’s GPU implementation of MLCLSTMLayer specifically.
For reference, I’m testing my code on M1 Max.
Am I doing something wrong, or is this an actual bug that I should report in Feedback Assistant?
Post not yet marked as solved
Hello guys.
With the release of the M1 Pro and M1 Max in particular, the Mac has become a platform that could become very interesting for games in the future. However, since some features are still missing in Metal, it could be problematic for some developers to port their games to Metal. Especially with the Unreal Engine 5 you can already see a tendency in this direction, since e.g. Nanite and Lumen are unfortunately not available on the Mac.
As a Vulkan developer I wanted to inquire about some features that are not yet available in Metal at the moment. These features are very interesting if you want to write a GPU driven renderer for modern game engines.
Furthermore, these features could be used to emulate D3D12 on the Mac via MoltenVK, which would result in more games being available on the Mac.
Buffer device address:
This feature allows the application to query a 64-bit buffer device address value for a buffer.
It is very useful for D3D12 emulation and for compatibility with Vulkan, e.g. to implement ray tracing on MoltenVK.
DrawIndirectCount:
This feature allows an application to source the number of draws for indirect drawing calls from a buffer. Also very useful in many gpu driven situations
Only 500000 resources per argument buffer
Metal has a limit of 500000 resources per argument buffer. To be equivalent to D3D12 Resource Binding Tear 2, you would need 1 million. This is also very important as so many DirectX12 game engines could be ported to Metal more easily.
Mesh shader / Task shader:
Two interesting new shader stages to optimize the rendering pipeline
Are there any plans to implement this features in future?
Is there a roadmap for metal? Is there a website where I can suggest features to the metal developers?
I hope to see at least the first 3 features in metal in the future and I think that many developers feel the same way.
Best regards,
Marlon
Is it possible to pass MTLTexture to Metal Core Image Kernel? How can Metal resources be shared with Core Image?
Post not yet marked as solved
How to clear OpenCL cache which contains pre-compilled OpenCL kernels? It is saved somewhere on the disk, because cache persists even after system restart. I suppose it uses the same cache as Metal, but this I also cannot locate.
This cache is problematic because if some of header files for OpenCL code is modified the OpenCL kernel is not re-compilled.
Post not yet marked as solved
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
As the documentation says, Limiter counters tell you which subsystems of the GPU are active by providing a percentage of the total number of processor cycles during which this subsystem was active.
Besides, Instrument also provides some Utilization Counters and the value is different from the Limiter. What do Utilization Counters mean?
Post not yet marked as solved
I am trying measuring performance in my app, I used two difference ways to measure commandBuffer completion time. One way is using MTLCommandBuffer addCompletedHandler:
commandBuffer.addCompletedHandler { cb in
let executionDuration = cb.gpuEndTime - cb.gpuStartTime
/* ... */
}
The other way is to use MTLCaptureManager. And I found two interesting things:
First, the completion time from addCompletedHandler was 26.82ms, on the other hand, GPU time from the capture manager was 13.49ms. I have been trying to understand why these two number are way different, but couldn't fine any concrete answer.
Second, the GPU time is different from shader times shown in timeline in Performances. Here is screenshot.
According to the timeline, it took 17.57ms. There is inconsistency. I did same test multiple times, and sometimes process time on timeline is less than GPU time, or vise versa.
Within this command buffer, there are 56 dispatches. Is this because there are too many dispatches?
I tested this on iPhone 12 Max with iOS 15.2.1
If there is someone who can give me clear explanation, it would be really appreciated.
Hello, Everyone.
I try to use a metal kit with a scene kit. Because, the scene kits scene graph is great, I want to implement a low-level metal shader.
I want to use SCNNodeRenderDelegate, without SCNProgram. Because I want low-level implement for example passing custom extra MTLBuffer, or multi-pass-rendering.
So I pass model view projection matrix like that,
in metal shader
struct NodeBuffer {
float4x4 modelTransform;
float4x4 modelViewProjectionTransform;
float4x4 modelViewTransform;
float4x4 normalTransform;
float2x3 boundingBox;
};
in Swift code
struct NodeMatrix: sizeable {
var modelTransform = float4x4()
var modelViewProjectionTransform = float4x4()
var modelViewTransform = float4x4()
var normalTransform = float4x4()
var boundingBox = float2x3()
}
...
private func updateNodeMatrix(_ camNode: SCNNode) {
guard let camera = camNode.camera else {
return
}
let modelMatrix = transform
let viewMatrix = camNode.transform
let projectionMatrix = camera.projectionTransform
let viewProjection = SCNMatrix4Mult(viewMatrix, projectionMatrix)
let modelViewProjection = SCNMatrix4Mult(modelMatrix, viewProjection)
nodeMatrix.modelViewProjectionTransform = float4x4(modelViewProjectionMatrix)
}
...
public func renderNode(_ node: SCNNode,
renderer: SCNRenderer,
arguments: [String: Any])
{
guard let renderTexturePipelineState = renderTexturePipelineState,
let renderCommandEncoder = renderer.currentRenderCommandEncoder,
let camNode = renderer.pointOfView,
let texture = texture
else { return }
updateNodeMatrix(camNode)
guard let nodeBuffer
= renderer.device?.makeBuffer(bytes: &nodeMatrix,
length: NodeMatrix.stride,
options: [])
else { return }
renderCommandEncoder.setDepthStencilState(depthState)
renderCommandEncoder.setRenderPipelineState(renderTexturePipelineState)
renderCommandEncoder.setFragmentTexture(texture, index: 0)
renderCommandEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
renderCommandEncoder.setVertexBuffer(nodeBuffer, offset: 0, index: 1)
renderCommandEncoder.drawIndexedPrimitives(type: .triangle,
indexCount: indexCount,
indexType: .uint16,
indexBuffer: indexBuffer,
indexBufferOffset: 0)
}
But I got the wrong model view projection matrix in the shader.
I think scene kit has modify intermediate transform hiding.
I can't know, help me...
Post not yet marked as solved
After capturing several metal frames of my IOS games which I packaged IPA file from UE4, I fail to get the shader source as before. The following msg box come across.
As the image content says , I check my build settings. However, there is no Metal compiler build options and no "produce debuggering information" item either.
my MacOS 12.1 Monterey
my Xode 13.1(13A1030d)
any help will be appreciated.
Post not yet marked as solved
I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results:
First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.
Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.
Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
Post not yet marked as solved
I have a metal compute kernel for dense matrix mutiply, and I'd like to optimize it with simdgroup_float8x8 and simdgroup_half8x8.
However, it seems no one apply them in Metal.
Can you give me some more demo on how to use them excpet that in Metal Shading Language Specification Version 2.4.
Thanks!
Post not yet marked as solved
Hello Everybody.
I'm trying to port graphic code written cg in unity to metal.
And, one more thing I don't want to manually implement scene graph, so I gonna use SceneKit.
So I should use SCNProgram or SCNNodeRendererDelegate, and I think SCNProgram is more comfort.
And real my question is how convert this code, in cg
Cull Front
ZTest LEqual
ZWrite On
Blend SrcAlpha OneMinusSrcAlpha
I know source alpha blending in MTLPipelineDescriptor, zbuffer in RenderCommandEncoder and Cull Face also. But When I use SCNProgram or SCNSceneRendererNode, can't find these options... how I change these. Help me.
When I try to run my matrix multiplication I receive the following warning in iOS but not in macOS :
'init(dimensions:columns:rowBytes:dataType:)' was deprecated in iOS 11.0
How may I change my code to remove the iOS warning? Here is my line generating the warning:
let mdesc = MPSMatrixDescriptor( dimensions: 2, columns: 2, rowBytes: rowbytes, dataType: MPSDataType.float16)
Post not yet marked as solved
M1 Pro/Max uses unified memory for both CPU and GPU, does it simplify the structure of GPU-accelerated parallel computing? Or is it still using the same code like AMD GPU?
Post not yet marked as solved
I want to debug metal shader using Xcode, but it prompt like this:
but the apple doc says that we can use this memod to solve, but I cannot find where " Produce debugging information " in XCode 12.5
https://developer.apple.com/documentation/metal/shader_authoring/developing_and_debugging_metal_shaders
Hi, I'm writing a metal backend for the Leela Chess Zero NN-based chess engine, using MPSGraph for inference. I have found that I get an error: /System/Volumes/Data/SWE/macOS/BuildRoots/220e8a1b79/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MetalPerformanceShaders-124.6.1/MPSCore/Utility/MPSLibrary.mm:311: failed assertion MPSLibrary::MPSKey_Create internal error: Unable to get MPS kernel ndArrayConvolution2D. whenever I run inference on the graph.
Does anyone know what would cause this error?
Post not yet marked as solved
Before in Xcode 12, it can export gpu counters for gpu commands(drawcall). But in Xcode 13, it only export gpu counters for encoders when I do like the picture.
Post not yet marked as solved
I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so:
uint | uint | .... | uint | float
offsets (contiguous):
0 | ...
where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes.
I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps):
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!
let sharedEvent = device.makeSharedEvent()!
let sharedEventQueue = DispatchQueue(label: "my-queue")
let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue)
// Compute pipeline
kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)
commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in
var value = Float(0.0)
bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset)
}
kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset)
kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture)
Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively.
In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA.
The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU.
This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA)
What explains this strange behavior with Metal?
Note:
This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3
Post not yet marked as solved
I'm try to physical device always getting error on Error for Family Controls: Error Domain=FamilyControls.FamilyControlsError Code=2 "(null)"
AuthorizationCenter.shared.requestAuthorization{result in
switch result {
case .success():
print("Allow to controle App ")
break
case .failure(let error):
print("Error for Family Controls: \(error)")
}
}
My question is - how can I authorize my parents using Family Control API in order to use for example Device Activity framework and Managed Settings framework?