Post not yet marked as solved
The new Xcode 14 Metal visualization and debugger features are awesome. However, one major pain point for me with indirect compute command buffers and indirect argument buffers, is that the buffer viewer is not typed when viewing buffers bound to an argument buffer. It defaults to displaying everything as float types. This is for compute encoding, I haven't tried it for render encoders. For non-trivial types (structs), it's painful to figure out which piece of memory is what, therefore I convert everything back to regular MTLComputeCommandEncoders (non bindless) when debugging especially complicated compute pipelines, which is far from ideal.
Are there any workarounds or is this a current limitation?
Post not yet marked as solved
I hope someone here can give me some insight, because I am at my wits end.
I have been trying to learn Metal the past couple of months. In the process, I came across an examples and articles of Sorting Networks and decided to try and implement them in Metal.
Now the problem is, if I run the code on my Mac. Everything is fine. But if I run the the same code on my iDevice (iPadPro wLIDAR), I get all sort of errors I do not understand or sorted data is corrupted and all wrong.
Typical Error
2021-02-17 12:13:11.218394-0500 METAL_ISSUE[97650:6709092] [GPUDebug] Invalid device load executing kernel function "bitonicSort" encoder: "0", dispatch: 0, at offset 384
file:///Users/staque/Development/OTHER/METAL_ISSUE/METAL_ISSUE/Shaders.metal:77:40 - bitonicSort()
MTLBufferArgument: 0x28006d200
Name = floats
Type = MTLArgumentTypeBuffer
Access = MTLArgumentAccessReadWrite
LocationIndex = 0
IsActive = 1
ArrayLength = 1
TypeInfo =
DataType = MTLDataTypePointer
ElementType = MTLDataTypeFloat
Access = MTLArgumentAccessReadWrite
Alignment = 4
DataSize = 4
Alignment = 4
DataSize = 4
DataType = MTLDataTypeFloat
buffer: "unknown"
You can pretty much drop these in the default Xcode Metal Game default app.
Shader (slightly modified to track the indexes of the floats.)
/*
[Using Code based off of this](https://github.com/tgymnich/MetalSort)
Rewritten to make it more understandable.
*/
kernel void bitonicSort(device float *floats [[ buffer(0) ]],
device int *uInts [[ buffer(1) ]],
constant int &p [[ buffer(2) ]],
constant int &q [[ buffer(3) ]],
uint gid [[ thread_position_in_grid ]])
{
int pMinusQ = p-q;
int distance = 1 pMinusQ;
uint gidShiftedByP = gid p;
// True: Increasing / False: Descreasing
bool direction = (gidShiftedByP & 2) == 0;
uint gidDistance = (gid & distance);
bool isGidDistanceZero = (gidDistance == 0);
uint gidPlusDistance = (gid | distance);
bool isLowerIndexGreaterThanHigher = (floats[gid] floats[gidPlusDistance]);
if (isGidDistanceZero && isLowerIndexGreaterThanHigher == direction) {
float temp = floats[gid];
floats[gid] = floats[gidPlusDistance];
floats[gidPlusDistance] = temp;
int temp2 = uInts[gid];
uInts[gid] = uInts[gidPlusDistance]
uInts[gidPlusDistance] = temp2;
}
}
The call.
language
func runSort() {
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let sortFunction = library.makeFunction(name: "bitonicSort")!
let pipeline = try! device.makeComputePipelineState(function: sortFunction)
let setRange = 0..1024
var floatData = [Float]()
var uintData = [UInt32]()
// Build the Float and index data backward to form worst case scenerio for sorting.
for value in stride(from: Float(setRange.upperBound-1), to: Float(setRange.lowerBound-1), by: -1.0) {
floatData.append(value)
}
for value in stride(from: setRange.upperBound-1, to: setRange.lowerBound-1, by: -1) {
uintData.append(UInt32(value))
}
print(floatData)
print("")
print(uintData)
guard let logn = Int(exactly: log2(Double(floatData.count))) else {
fatalError("data.count is not a power of 2")
}
for p in 0..logn {
for q in 0..p+1 {
let floatDataBuffer = device.makeBuffer(bytes: &floatData,
length: MemoryLayoutFloat.stride * floatData.count,
options: [.storageModeShared])!
floatDataBuffer.label = "floatDataBuffer"
let uintDataBuffer = device.makeBuffer(bytes: &uintData,
length: MemoryLayoutUInt32.stride * uintData.count,
options: [.storageModeShared])!
uintDataBuffer.label = "uintDataBuffer"
let threadgroupsPerGrid = MTLSize(width: floatData.count, height: 1, depth: 1)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
var n1 = p
var n2 = q
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(floatDataBuffer, offset: 0, index: 0)
encoder.setBuffer(uintDataBuffer, offset: 0, index: 1)
encoder.setBytes(&n1, length: MemoryLayoutFloat.stride, index: 2)
encoder.setBytes(&n2, length: MemoryLayoutUInt32.stride, index: 3)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let dataPointer = floatDataBuffer.contents().assumingMemoryBound(to: Float.self)
let dataBufferPointer = UnsafeMutableBufferPointer(start: dataPointer, count: floatData.count)
floatData = Array.init(dataBufferPointer)
let dataPointer2 = uintDataBuffer.contents().assumingMemoryBound(to: UInt32.self)
let dataBufferPointer2 = UnsafeMutableBufferPointer(start: dataPointer2, count: uintData.count)
uintData = Array.init(dataBufferPointer2)
}
}
print(floatData)
print("")
print(uintData)
}
If anyone has a clue what I should be doing I am all ears, because I need help.
Thanks in advance.
Stan
Post not yet marked as solved
I've implemented GPU command encoding as described in the second part of the Modern Rendering with Metal WWDC19 talk.
The implementation applies frustum culling to each individual mesh instance and creates a draw_indexed_primitives command, in the same way as outlined in the talk. Each command has an instance count of 1.
My previous CPU command encoding implementation would group the visible mesh instances by mesh and pipeline state (after frustum culling) and encode the appropriate multi-instance draw call. With GPU command encoding running in parallel, I don't see a way to group meshes this way.
Is there any significant performance impact for issuing multiple draw calls for individual instances of the same mesh, as opposed to using instanced rendering?
This might very well be something that's not worth worrying about, but it would be good to have some input on this.
Post not yet marked as solved
I'm running into an issue with threadgroup memory where data written to it seemingly gets lost when I use int8_t or int16_t element types:
#include <metal_stdlib>
using namespace metal;
kernel void kernel_function(device int16_t* R, uint index [[thread_position_in_threadgroup]]) {
threadgroup int16_t shared[1];
shared[index] = (int16_t) 42;
threadgroup_barrier(mem_flags::mem_threadgroup);
R[0] = shared[index];
}
If I execute this kernel (using the following host code: https://gist.github.com/maleadt/ffcda8fc94f03f32347c3167ccca70a8 ), I get zeros in my output buffer. If I change the element type from int16_t to int32_t (just find/replace in the kernel and host code) I get the expected results.
I'm new to Metal, so I guess I'm doing something wrong here. I'm using an M1 Pro on Monterey, with Xcode 13.4.1.
EDIT: interestingly, running under MTL_SHADER_VALIDATION=1 results in the expected output, so this does start to look like a miscompilation in the back-end.
Post not yet marked as solved
I'm trying to get TensorFlow with Metal support running on my iMac (2017, Radeon 580 Pro) following these instructions. However, simply importing tensorflow ( import tensorflow ) results in the following error with the Python console crashing:
2022-05-27 11:46:12.419950: F tensorflow/c/experimental/stream_executor/stream_executor.cc:808] Non-OK-status: stream_executor::MultiPlatformManager::RegisterPlatform( std::move(cplatform)) status: INTERNAL: platform is already registered with name: "METAL"
Abort trap: 6
Versions: macOS 12.3, Python 3.8.13, tensorflow-macos 2.9.0, tensorflow-metal 0.5.0
Silly question but does anyone know where can I find the code for the demos in this video?
https://developer.apple.com/videos/play/wwdc2022/10063/
I am trying to replicate the distributed training but running into version errors with Horovod.
Post not yet marked as solved
I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results:
First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.
Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.
Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
Post not yet marked as solved
hi,
I'm developing a app that uses metal to compute some calculations and to improve the efficiency of the render process i started watching indirect command buffers but there isn't a example that explains the best way to processed.
Any one can provide some tips?
Post not yet marked as solved
I'm animating body+ face moments using skeleton animation , SceneKit and Metal custom shader.
I need to deform some vertices in the vertex shader, therefore I need skinningJointMatrices in the vertex shader.
However if I just add the line
float4 skinningJointMatrices[183];
in the NodeBuffer :
struct NodeBuffer {
float4x4 inverseModelTransform;
float4x4 inverseModelViewTransform;
float4x4 modelTransform;
float4x4 modelViewProjectionTransform;
float4x4 modelViewTransform;
float4x4 normalTransform;
float2x3 boundingBox;
float4 skinningJointMatrices[765];
};
I get the following assertion:
[SceneKit] Assertion 'C3DSkinnerGetEffectiveCalculationMode(skinner, C3DNodeGetGeometry(context->_nodeUniforms.instanceNode)) == kC3DSkinnerCalculationModeGPUVertexFunction' failed. skinningJointMatrices should only be used when skinning is done in the vertex function
Is there a way to workaround this assert?
The code seems to be working fine although the assertion.
Post not yet marked as solved
In my game project, there is a functions.data file in then /AppData/Library/Caches/[bundleID]/com.apple.metal/functions.data,
when we reboot and launch the game, this file was rest to about 40KB, normaly this file's is about 30MB, this operation was done by the metal, Is there any way to avoid it?
Post not yet marked as solved
When I begin a compute pass from a MTLComputePassDescriptor with sampleBufferAttachments on macOS 11.6, it cannot sample correct values from counters on Apple M1 device, which work well for render pass.
Device: MacBook Pro (12-inch, M1, 2020)
MacOS: 11.6
NSRef<MTLComputePassDescriptor> descriptor = [MTLComputePassDescriptor computePassDescriptor];
descriptor.sampleBufferAttachments[0].sampleBuffer = counterSampleBuffer;
descriptor.sampleBufferAttachments[0].startOfEncoderSampleIndex = NSUInteger(0);
descriptor.sampleBufferAttachments[0].endOfEncoderSampleIndex = NSUInteger(1);
After the compute pass is completed, the values in sample buffer are zeros.
Another issue is MTLCounterDontSample, if we don't need to sample counter at the end of passes, we can set endOfEncoderSampleIndex to MTLCounterDontSample following the document, but it occurs validation layer error:
failed assertion 'endOfEncoderSampleIndex (4294967295) must be < sample buffer count (2)'
Post not yet marked as solved
Hello! I am starting to dig into the docs on object and mesh shaders. I see that the Metal API on the CPU side has new functions for setting object and mesh buffers in the new programable stage. But I don't see corresponding changes to the API for MTLIndirectCommandBuffer. Will we be able to use the GPU to encode draw commands using a pipeline that leverages the new shader types?
Thanks,
Post not yet marked as solved
When we use Metal API
drawIndexedPrimitives:MTLPrimitiveTypeTriangle
indexCount:indexCount
indexType:MTLIndexTypeUInt32
indexBuffer:indexs
indexBufferOffset:0]
to draw triangles. I have to create a MTLBuffer with uint32_t or uint16_t. When I load an index from a file and save it uint32_t* indices_. I can make sure indices_ have the right values. I try to use
indexs = [_device newBufferWithBytes:indices_
length:SCR_WIDTH*SCR_HEIGHT*6*sizeof(uint32_t)
options:MTLResourceStorageModeShared];
or use
indexs = [_device newBufferWithLength:sizeof(uint32_t)*SCR_WIDTH*SCR_HEIGHT*6 options:MTLResourceStorageModeManaged];
memcpy(indexs.contents, indices_, sizeof(uint32_t) * SCR_WIDTH*SCR_HEIGHT*6);
#if TARGET_OS_OSX
[indexs didModifyRange:NSMakeRange(0, indexs.length)];
#endif
I can not save the index value into the MTLBuffer. I try to debug it, I find that MTLBuffer stores a lot of float value such as 3.0*10-45, and so on. appreciate any comments.
Currently trying to figure out why our translated HLSL is failing to be able to create a .metallib. The error is:
LLVM ERROR: Error opening 'C:\Program Files\Metal Developer Tools\iOS\bin\..\lib\clang\31001.630\lib\darwin\libmetal_rt_osx.a': no such file or directory!
I'm invoking metal and metallib from the ios subfolder of the Windows tools, so my guess is that I also need to invoke metal with the --target argument to inform it that it is in fact an iOS target, but I have no idea where to find what the valid list of these is.
Apple documentation states about Tier 2 Argument Buffer hardware capability
The maximum per-app resources available at any given time are:
500,000 buffers or textures
What does it mean exactly? Does this number refer to the maximal count of attachment points (e.g. unique indices) across all bound argument buffers, the maximal count of only bound resources across the argument buffers (e.g. when using dynamic indexing and sparsely binding resources) or the number of resource objects that the application can create and manage at a given time?
Prompted by some discussions in the community I decided to run some tests and was surprised to discover that I could bind many millions buffer attachments to a single argument buffer in a Metal shader on my M1 Max laptop, way in excess of the quoted 500,000 limit. Is that just undefined behaviour that one should not rely on or does "500,000" refer to something else instead of the number of attachment points?
Hope that someone from Apple Gpu team can clarify this. If this is not the correct venue for this question, please tell me where I can send my inquiry.
Looking at the new Metal 3 APIs diffs, I noticed that objects now expose a new gpuHandle/gpuRessourceD property, and that the MTLArgumentEncoder is marked as deprecated and there seems to be the family of new MTLBinding APIs that looks like a replacement for it. Does this mean that we are getting some new resource binding model? I was not able to find any details in the documentation and Tuesday's Metal session did not mention these API changes at all. And the APIs themselves seem to be in flux, as gpuHandle is already marked as deprecated even though it is still beta :)
Will there be a WWDC session about these APIs or could you share some details here?
We simulate Gaussian blur as a Box Blur with complex shader (several passes on two axises) using Metal on Mac.
I noticed slight fast pulsation of picture brightness when we reach radius more than 16 pixels in a fragment shader (we read 32+ pixels from a source texture to calculate 1 output pixel). We use float4 calculations.
This issue appears only on M1 Mac and only after sleep. The problem dissappears after restarting macOS. And no problem on Intel iMac with AMD graphics.
These artefacts appear in macOS Monterey 12.4 and also in macOS 13 Ventura Beta 1 on M1 Mac mini. We don't have any problem with other more simple shaders.
Post not yet marked as solved
Hi - not sure this is strictly a metal issue, but I'm having problems with a memory leak.
When I create a buffer to use with the GPU, then bind the results so that I can access the data, the created buffer seems to stay in memory even once the variables intensityPointer and intensityBuff is no longer in scope.
I tried using deallocate, but this caused an error too.
Is there a standard way of managing such memory, or accessing the buffer in a different way which will allow the memory to be released when no longer used?
Thank you,
Colin
let intensityBuff = myGPUData.device?.makeBuffer(length: MemoryLayout<Float>.stride * Int(myStars.nstars * myStars.npatch * myStars.npatch, options: .storageModeShared)
let intensityPointer = intensityBuff?.contents().bindMemory(to: Float.self, capacity: MemoryLayout<Float>.stride * Int(myStars.nstars * myStars.npatch * myStars.npatch))
MacOS M1 machines can run iOS applications.
We have an iOS application that runs a fullscreen metal game. The game can also run across all desktop platforms via Steam. In additional to Steam, we would like to make it available through the AppStore on MacOS. We'd like to utilise our iOS builds for this so that the Apple payment (micro-transactions) and sign-in processes can be reused.
While the app runs on MacOS, it runs in a small iPad shaped window that cannot be resized. We do not want to add iPad multitasking support (portrait orientation is not viable), but would like the window on MacOS to be expandable to full screen. Currently there is an option to make it full screen, but the metal view (MTKView) delegate does not receive a drawableSizeWillChange event for this, meaning the new resolution of the window cannot be received.
Is there another method of retrieving a window size change event in this context? What is the recommended way of enabling window resizing on MacOS but not iPad for a single iOS app?
Post not yet marked as solved
What are the system requirements for MetalFX upscaling?
Will it only be available on systems that support Metal 3 (which I believe excludes the most current Apple TV)?