Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal tag

313 Posts
Sort by:
Post not yet marked as solved
3 Replies
1.1k Views
I would like to know if the applications/games targeting the Metal 3 API will be fully compatible with the M1 Pro GPU. Thanks.
Posted Last updated
.
Post not yet marked as solved
3 Replies
412 Views
Hi All, I tried using the new metal tools for compiling Metal shaders at build time (as explained in this session https://developer.apple.com/videos/play/wwdc2022/10102/), but I got some errors. In particular: the command "metal shadersfilename.metal -N descriptors.mtlp-json -o archive.metallib" complains about the -N argument. the command "metal-tt shaders.metallib descriptors.mtlp-json -o archive.metallib" doesn't seem to recognise the JSON format. Also the command to extract the JSON pipeline from a binary archive (metal-source) fails. I'm using Xcode 14.0 beta (14A5228q) on macOS Ventura beta. Are the new Metal tools for offline compilation already available (and I'm simply doing something wrong) or do I need to wait for next betas? Thanks!
Posted Last updated
.
Post not yet marked as solved
0 Replies
89 Views
Are there any guarantees about the order in which completion handlers run? Do completion handlers added to the same command buffer run in the order they were added? Do completion handlers added to different command buffers on the same queue run in the order their command buffers were enqueued?
Posted Last updated
.
Post not yet marked as solved
2 Replies
592 Views
Dear All, There were years waiting for GPU acceleration for Mac users, and the frameworks were finally released this year. After some testing, I found out that the NLP task isn’t supported in the previous versions and made a report in 'FB9220496'. Then I was informed that this issue would be fixed in the next versions. So, I updated my frameworks immediately when the latest versions were released. However, I am very disappointed and frustrated because the latest frameworks cannot work at all. IDE Kernel always crashes when I am trying to train any model. Nevertheless, I attached the copy of my env in the .yaml file, so you should be able to reproduce this issue very easily. Here is the link of the .yaml file https://www.icloud.com/iclouddrive/0Yhr444NAqu6oD8qxOfKriY-A#env%5Fmac%5Ftf Besides that, please also note that I only update TesnorFlow-macos and TensorFlow-metal. And this env was fine except for NLP tasks before I update the frameworks. Could you please kindly look into this issue and solve it, thank you very much. BTW, the IDE which I use is spyder 5.1.5. Though IDE should not be the reason for crashes. I appreciate your time and looking forward to your prompt reply. Sincerely, Gavin
Posted
by hawkiyc.
Last updated
.
Post not yet marked as solved
0 Replies
124 Views
We use several CoreML models on our swift application. Memory footprint of these coreML models varies in a range from 15 kB to 3.5 MB according to the XCode coreML utility tool. We observe a huge difference of loading time in function of the type of the compute units selected to run the model. Here is a small sample code used to load the model: let configuration = MLModelConfiguration() //Here I use the the .all compute units mode: configuration.computeUnits = .all let myModel = try! myCoremlModel(configuration: configuration).model Here are the profiling results of this sample code for different models sizes in function of the targeted compute units: Model-3.5-MB : computeUnits is .cpuAndGPU: 188 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 4000 ms ⇒ 875 kB/s Model-2.6-MB: computeUnits is .cpuAndGPU: 144 ms ⇒ 18 MB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 1300 ms ⇒ 2 MB/s Model-15-kB: computeUnits is .cpuAndGPU: 18 ms ⇒ 833 kB/s computeUnits is .all or .cpuAndNeuralEngine on iOS16: 700 ms ⇒ 22 kB/s What explained the difference of loading time in function en the computeUnits mode ? Is there a way to reduce the loading time of the models when using the .all or .cpuAndNeuralEngine computeUnits mode ?
Posted
by dbphr.
Last updated
.
Post not yet marked as solved
0 Replies
125 Views
I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]], but that's not a compile time value, so it can't be used as an array dimension. Now, according to various WWDC session videos (e.g. WWDC2022 "Scale compute workloads acroos Apple GPUs), threadExecutionWidth on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex: on the compute encoder. This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth always seems to report 32) but I'm hitting configurations where threadExecutionWidth does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth = 16 for some complex kernels, although SIMD group size seems to be 32) Will the SIMD group size always be the same for all kernels on a device, so should I trust threadExecutionWidth only for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]? I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes due to exhaustion of some resource (registers?), although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth.
Posted
by pmdj.
Last updated
.
Post not yet marked as solved
0 Replies
131 Views
I'm having trouble with memory leaks. I want to get shader's reflection data from MTLFunction which is generated from precompiled shader files. These files are compiled with Metal Developer Tools for Windows 2.3 and I use the following command options. -x metal hoge.metal -std=ios-metal2.0 -o hoges.msl To get these reflection data, I use the function newArgumentEncoderWithBufferIndex:reflection:. I can get the reflection data, but some memory leaks occur for each call this function, and I don't know how to release the memory of the reflection data. This is a example code. void Func(id<MTLFunction> _pShaderFunc) { MTLAutoreleasedArgument reflection = nil; id<MTLArgumentEncoder> argEnc = [_pShaderFunc newArgumentEncoderWithBufferIndex:0 reflection:&reflection]; // some process... // [reflection release]; // run-time error. [argEnc release]; } I tested this on iPhone 11 Pro Max with iOS 13.3.1, Xcode 13.1, and Mac mini(2018) with macOS Monterey 12.4. I compiled source files specifying the option -fno-objc-arc. I used the instrument Leaks, and I confirmed when this function isn't called, there were no memory leaks. Could you give me some advice on how to resolve those leaks?
Posted Last updated
.
Post not yet marked as solved
0 Replies
97 Views
According to the Metal Best Practices guide, on Mac: If the texture is initialized once by the CPU and accessed frequently by the GPU, initialize a source texture with a Managed mode and then blit its data into a destination texture with a Private mode. This is a common case for static textures, such as diffuse maps. This would seem to be the best practice for things like material textures in a game that are loaded once, and then exclusively used by the GPU. However - the guide isn't specific on how Apple Silicon should be treated. It does say: Some macOS devices feature integrated GPUs. In these devices, the driver optimizes the underlying architecture to support a discrete memory model. macOS Metal apps should always target a discrete memory model. The iOS guide still mentions private textures, but does not going into detail, and has less prescriptive language. I'm basically unsure if Apple Silicon Macs should follow the iOS performance guide or the macOS performance guide. The guide also has a last updated of 2017, implying that it may not have been updated for Apple Silicon. I'm not quite sure for static texture resources for games what the best path is. Apple Silicon has a single address space - which I assume would reduce the penalty for having a shared resource. However, private resources can be optimized for GPU use during a blit. It's been hinted that things like textures might be compressed in a way that they typically couldn't if CPU access needed to be maintained. Very possible that the guide hasn't been updated because the guidance is unchanged. But I wanted to check in since I'd assume iOS and macOS on Apple Silicon should be similar.
Posted Last updated
.
Post not yet marked as solved
0 Replies
82 Views
In the "Discover advances Metal for A15 Bionic" Tech Talk right around the 20:00 mark, the presenter (Katelyn Hinson) says: The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, [with] each thread writing to a single output. Supposing that we know the simdgroup will contain 32 threads (which they mention in the talk is true for Apple Silicon), is the only way to ensure that the threads in each simdgroup will be arranged into a 4 x 8 chunk to perform a dispatch with threadgroups that have a width dividing the number of threads per simdgroup? I can't think of another way to control the shape of a simdgroup directly within threadgroups since there is no explicit API to do so. For example, if we perform a dispatchThreadgroups(_:threadsPerThreadgroup:) with a threadgroup size of 8 x 8 to attempt to recreate the visuals in the presentation, wouldn't the resulting simdgroup shape be an 8 x 4 region and not a 4 x 8 region? The assumptions made in the video about where to sample the source texture and which shuffle functions to use are heavily influenced by the shape of the simdgroup. I'm trying to implement a similar reduction but I'm currently figuring out how to shape each simdgroup. If we don't know whether the simdgroup is 32 threads (I believe it's possible simdgroups have 64 threads?). What would be a reliable way to control the structure of the simdgroups? I believe if we always ensure that the width of the threadgroup divides the number of threads in the simdgroup we should get the behavior that we want, but I'm looking to confirm this logic. IIRC, simdgroups will always have a multiple of 8 threads (maybe it was only 4?), so perhaps a width of 8 (or 4) would always suffice for the threadgroup and you could specify a height of computePipelineState.maxTotalThreadsPerThreadgroup / 4 for example. Finally, must we only use uniform threadgroups (viz. we couldn't use dispatchThreads(_:threadsPerThreadgroup:)) for reliable results? I'm thinking that non-uniform threadgroups would again violate our assumptions about the simdgroup shape
Posted Last updated
.
Post not yet marked as solved
0 Replies
90 Views
I'm hoping the answer here is that the fp16 values get written out to the parameter buffer to save space on TBDR, but then the gpu promotes them back to fp32 for interpolation, and then back to fp16 for the receiving fragment shader. This would then work around banding if the output and interpolation was done in fp16 math like on Android. There is no documentation that I've found on this, or even on the PowerVR documentation about their gpu.
Posted
by Alecazam.
Last updated
.
Post not yet marked as solved
1 Replies
105 Views
I would like to write a ReductionSum Metal Shader like this: https://github.com/alibaba/MNN/blob/master/source/backend/metal/MetalReduction.metal#L32 Sometimes the reduced dimension is large while the other dimensions is small, which cause few threads can be launched and inefficient. Is there any way to optimize it?
Posted
by PYNing.
Last updated
.
Post marked as solved
1 Replies
102 Views
I have a compute kernel that makes use of simdgroup operations such as simd_shuffle_up, simd_or, etc, and I'm looking to rewrite the kernel to support older hardware. One such computation requires that I know the index of the thread in the simdgroup (thread_index_in_simdgroup). I was hoping to derive it from the thread's position in its threadgroup (thread_position_in_threadgroup) and the thread execution width (thread_execution_width), along with other knowledge about the size of the threadgroup when I noticed there was also the threads_per_simdgroup attribute. The spec describes both respectively as thread_execution_width: The execution width of the compute unit. threads_per_simdgroup: The thread execution width of a SIMD-group. Under what conditions, if any, could these two values differ? If they do differ, is there a way to determine a thread's position in the simdgroup on hardware that doesn't support Metal 2.2?
Posted Last updated
.
Post not yet marked as solved
0 Replies
102 Views
As we get MTLBuffer when loading a file(which can contain lighting, animation, materials), like typeless memory blob, how can we get a MTDMesh with sub meshes from the buffer? Thanks in advance!
Posted
by ensoreus.
Last updated
.
Post not yet marked as solved
0 Replies
117 Views
I was familiarising myself with the Metal mesh shaders and run into some issues. First, a trivial application that uses mesh shaders to generate simple rectangular geometry hangs the GPU when dispatching 2D grids of mesh shader threadgroups, but it's really weird as it is sensitive to the grid shape. E.g. // these work! meshGridProperties.set_threadgroups_per_grid(uint3(512, 1, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(16, 8, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 5, 1)); // these (and anything "bigger") hang! meshGridProperties.set_threadgroups_per_grid(uint3(16, 9, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 6, 1)); The sample shader code is attached. The invocation is trivial enough: re.drawMeshThreadgroups( MTLSizeMake(1, 1, 1), threadsPerObjectThreadgroup: MTLSizeMake(1, 1, 1), threadsPerMeshThreadgroup: MTLSizeMake(1, 1, 1) ) For apple engineers: a bug has been submitted under FB10367407 Mesh shader code: 2d_grid_mesh_shader_hangs.metal I also have a more complex application where mesh shaders are used to generate sphere geometry: each mesh shader thread group generates a single slice of the sphere. Here the problem is similar: once there more than X slices to render, some of the dispatched mesh threadtroups don't seem to do anything (see screenshot below). But the funny thing is that the geometry is produced, as it would occasionally flicker in and out of existence, and if I manually block out some threadgroups from running (e.g. by using something like if(threadgroup_index > 90) return; in the mesh shader, the "hidden" geometry works! It almost looks like different mesh shaders thread group would reuse the same memory allocation for storing the output mesh data and output of some threadgroups is overwritten. I have not submitted this as a bug, since the code is more complex and messy, but can do so if someone from the Apple team wants to have a look.
Posted
by jcookie.
Last updated
.
Post not yet marked as solved
1 Replies
183 Views
In our use case, there is a Background Mac App (running on Mac M1) that is responsible for receiving data from a companion iOS App via WebSocket connection (client-side Apple Swift API, Vapor4 server side API) and perform computations using Metal Compute APIs and our custom kernels. In order to optimize execution time of these compute kernels we are looking for a way to profile their execution time i.e. how much combined GPU execution time (compute and memory accesses) is taken by each instance? As may be obvious, our primary focus is not the waiting time spent in the kernel scheduling queues before execution begins, but this may be helpful as an extra. We are not sure whether Instruments in XCode will be helpful in above scenario (partly in iOS, partly 3rd party WebSocket API, and partly background Mac App (command line App))? Also, is Metal frame capturing method dependent on presence of Metal graphics APIs and hence will not work for Background Apps? Can we get desired info using GPU Counter Sample Buffers, or are we looking at the wrong places? Any assistance wrt above (measurement of Metal compute kernel execution times in the context of a Mac Background App) will be highly appreciated.
Posted
by parth7.
Last updated
.
Post not yet marked as solved
6 Replies
967 Views
After installing tensorflow-metal PluggableDevice according to Getting Started with tensorflow-metal PluggableDevice I have tested this DCGAN example: https://www.tensorflow.org/tutorials/generative/dcgan. Everything was working perfectly until I decided tu upgrade macOS from 12.0.1 to 12.1. Before the final result after 50 epoch was like on picture1 below , after upgrade is like on picture2 below . I am using: TensrofFlow 2.7.0 tensorflow-metal-0.3.0 python3.9 I hope this question will also help Apple to improve Metal PluggableDevice. I can't wait to use it in my research.
Posted Last updated
.
Post not yet marked as solved
0 Replies
130 Views
As mentioned in the title, whenever an iMac19,2 owner running 10.15.7 runs my app, which features a scene with grain noise and rendering against a transparent background, it crashes like so: Crashed: CVDisplayLink 0 libobjc.A.dylib 0x681d objc_msgSend + 29 1 SceneKit 0x250780 SCNMTLComputeCommandEncoder::dispatchOnTexture2DWithoutOptimizedThreadGroupPerGrid(id<MTLTexture>, id<MTLComputePipelineState>) + 104 2 SceneKit 0x15abf6 C3D::getGrainNoise256(id<MTLCommandBuffer>, SCNMTLRenderContext*, C3D::RenderGraphResourceManager&) + 403 3 SceneKit 0xc6a9e C3D::CompositePass::compile() + 1410 4 SceneKit 0x391a46 C3D::RenderGraph::allocateResources() + 2198 5 SceneKit 0x14dc2d C3DEngineContextRenderWithRenderGraph + 52 6 SceneKit 0x22666b -[SCNRenderer _renderSceneWithEngineContext:sceneTime:] + 532 7 SceneKit 0x227222 -[SCNRenderer _drawSceneWithNewRenderer:] + 281 8 SceneKit 0x227786 -[SCNRenderer _drawScene:] + 46 9 SceneKit 0x227c8b -[SCNRenderer _drawAtTime:] + 965 10 SceneKit 0x214d29 -[SCNView _drawAtTime:WithContext:] + 542 11 SceneKit 0x214653 -[SCNView SCN_displayLinkCallback:] + 306 12 SceneKit 0x1af4f8 __69-[NSObject(SCN_DisplayLinkExtensions) SCN_setupDisplayLinkWithQueue:]_block_invoke + 49 13 SceneKit 0x2a1468 __36-[SCNDisplayLink _callbackWithTime:]_block_invoke.13 + 52 14 libdispatch.dylib 0x2658 _dispatch_client_callout + 8 15 libdispatch.dylib 0xe6ec _dispatch_lane_barrier_sync_invoke_and_complete + 60 16 SceneKit 0x2a13c5 -[SCNDisplayLink _callbackWithTime:] + 307 17 SceneKit 0x2a10c6 _cvDisplayLinkCallback + 261 18 CoreVideo 0x2e92 CVDisplayLink::performIO(CVTimeStamp*) + 230 19 CoreVideo 0x22c8 CVDisplayLink::runIOThread() + 626 20 libsystem_pthread.dylib 0x6109 _pthread_start + 148 21 libsystem_pthread.dylib 0x1b8b thread_start + 15
Posted
by kode54.
Last updated
.