Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal tag

313 Posts
Sort by:
Post not yet marked as solved
0 Replies
131 Views
I'm having trouble with memory leaks. I want to get shader's reflection data from MTLFunction which is generated from precompiled shader files. These files are compiled with Metal Developer Tools for Windows 2.3 and I use the following command options. -x metal hoge.metal -std=ios-metal2.0 -o hoges.msl To get these reflection data, I use the function newArgumentEncoderWithBufferIndex:reflection:. I can get the reflection data, but some memory leaks occur for each call this function, and I don't know how to release the memory of the reflection data. This is a example code. void Func(id<MTLFunction> _pShaderFunc) { MTLAutoreleasedArgument reflection = nil; id<MTLArgumentEncoder> argEnc = [_pShaderFunc newArgumentEncoderWithBufferIndex:0 reflection:&reflection]; // some process... // [reflection release]; // run-time error. [argEnc release]; } I tested this on iPhone 11 Pro Max with iOS 13.3.1, Xcode 13.1, and Mac mini(2018) with macOS Monterey 12.4. I compiled source files specifying the option -fno-objc-arc. I used the instrument Leaks, and I confirmed when this function isn't called, there were no memory leaks. Could you give me some advice on how to resolve those leaks?
Posted
by
Post not yet marked as solved
0 Replies
97 Views
According to the Metal Best Practices guide, on Mac: If the texture is initialized once by the CPU and accessed frequently by the GPU, initialize a source texture with a Managed mode and then blit its data into a destination texture with a Private mode. This is a common case for static textures, such as diffuse maps. This would seem to be the best practice for things like material textures in a game that are loaded once, and then exclusively used by the GPU. However - the guide isn't specific on how Apple Silicon should be treated. It does say: Some macOS devices feature integrated GPUs. In these devices, the driver optimizes the underlying architecture to support a discrete memory model. macOS Metal apps should always target a discrete memory model. The iOS guide still mentions private textures, but does not going into detail, and has less prescriptive language. I'm basically unsure if Apple Silicon Macs should follow the iOS performance guide or the macOS performance guide. The guide also has a last updated of 2017, implying that it may not have been updated for Apple Silicon. I'm not quite sure for static texture resources for games what the best path is. Apple Silicon has a single address space - which I assume would reduce the penalty for having a shared resource. However, private resources can be optimized for GPU use during a blit. It's been hinted that things like textures might be compressed in a way that they typically couldn't if CPU access needed to be maintained. Very possible that the guide hasn't been updated because the guidance is unchanged. But I wanted to check in since I'd assume iOS and macOS on Apple Silicon should be similar.
Posted
by
Post not yet marked as solved
0 Replies
82 Views
In the "Discover advances Metal for A15 Bionic" Tech Talk right around the 20:00 mark, the presenter (Katelyn Hinson) says: The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, [with] each thread writing to a single output. Supposing that we know the simdgroup will contain 32 threads (which they mention in the talk is true for Apple Silicon), is the only way to ensure that the threads in each simdgroup will be arranged into a 4 x 8 chunk to perform a dispatch with threadgroups that have a width dividing the number of threads per simdgroup? I can't think of another way to control the shape of a simdgroup directly within threadgroups since there is no explicit API to do so. For example, if we perform a dispatchThreadgroups(_:threadsPerThreadgroup:) with a threadgroup size of 8 x 8 to attempt to recreate the visuals in the presentation, wouldn't the resulting simdgroup shape be an 8 x 4 region and not a 4 x 8 region? The assumptions made in the video about where to sample the source texture and which shuffle functions to use are heavily influenced by the shape of the simdgroup. I'm trying to implement a similar reduction but I'm currently figuring out how to shape each simdgroup. If we don't know whether the simdgroup is 32 threads (I believe it's possible simdgroups have 64 threads?). What would be a reliable way to control the structure of the simdgroups? I believe if we always ensure that the width of the threadgroup divides the number of threads in the simdgroup we should get the behavior that we want, but I'm looking to confirm this logic. IIRC, simdgroups will always have a multiple of 8 threads (maybe it was only 4?), so perhaps a width of 8 (or 4) would always suffice for the threadgroup and you could specify a height of computePipelineState.maxTotalThreadsPerThreadgroup / 4 for example. Finally, must we only use uniform threadgroups (viz. we couldn't use dispatchThreads(_:threadsPerThreadgroup:)) for reliable results? I'm thinking that non-uniform threadgroups would again violate our assumptions about the simdgroup shape
Posted
by
Post not yet marked as solved
0 Replies
91 Views
I'm hoping the answer here is that the fp16 values get written out to the parameter buffer to save space on TBDR, but then the gpu promotes them back to fp32 for interpolation, and then back to fp16 for the receiving fragment shader. This would then work around banding if the output and interpolation was done in fp16 math like on Android. There is no documentation that I've found on this, or even on the PowerVR documentation about their gpu.
Posted
by
Post not yet marked as solved
1 Replies
106 Views
I would like to write a ReductionSum Metal Shader like this: https://github.com/alibaba/MNN/blob/master/source/backend/metal/MetalReduction.metal#L32 Sometimes the reduced dimension is large while the other dimensions is small, which cause few threads can be launched and inefficient. Is there any way to optimize it?
Posted
by
Post marked as solved
1 Replies
102 Views
I have a compute kernel that makes use of simdgroup operations such as simd_shuffle_up, simd_or, etc, and I'm looking to rewrite the kernel to support older hardware. One such computation requires that I know the index of the thread in the simdgroup (thread_index_in_simdgroup). I was hoping to derive it from the thread's position in its threadgroup (thread_position_in_threadgroup) and the thread execution width (thread_execution_width), along with other knowledge about the size of the threadgroup when I noticed there was also the threads_per_simdgroup attribute. The spec describes both respectively as thread_execution_width: The execution width of the compute unit. threads_per_simdgroup: The thread execution width of a SIMD-group. Under what conditions, if any, could these two values differ? If they do differ, is there a way to determine a thread's position in the simdgroup on hardware that doesn't support Metal 2.2?
Posted
by
Post not yet marked as solved
0 Replies
102 Views
As we get MTLBuffer when loading a file(which can contain lighting, animation, materials), like typeless memory blob, how can we get a MTDMesh with sub meshes from the buffer? Thanks in advance!
Posted
by
Post not yet marked as solved
3 Replies
416 Views
Hi All, I tried using the new metal tools for compiling Metal shaders at build time (as explained in this session https://developer.apple.com/videos/play/wwdc2022/10102/), but I got some errors. In particular: the command "metal shadersfilename.metal -N descriptors.mtlp-json -o archive.metallib" complains about the -N argument. the command "metal-tt shaders.metallib descriptors.mtlp-json -o archive.metallib" doesn't seem to recognise the JSON format. Also the command to extract the JSON pipeline from a binary archive (metal-source) fails. I'm using Xcode 14.0 beta (14A5228q) on macOS Ventura beta. Are the new Metal tools for offline compilation already available (and I'm simply doing something wrong) or do I need to wait for next betas? Thanks!
Posted
by
Post not yet marked as solved
0 Replies
118 Views
I was familiarising myself with the Metal mesh shaders and run into some issues. First, a trivial application that uses mesh shaders to generate simple rectangular geometry hangs the GPU when dispatching 2D grids of mesh shader threadgroups, but it's really weird as it is sensitive to the grid shape. E.g. // these work! meshGridProperties.set_threadgroups_per_grid(uint3(512, 1, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(16, 8, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 5, 1)); // these (and anything "bigger") hang! meshGridProperties.set_threadgroups_per_grid(uint3(16, 9, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 6, 1)); The sample shader code is attached. The invocation is trivial enough: re.drawMeshThreadgroups( MTLSizeMake(1, 1, 1), threadsPerObjectThreadgroup: MTLSizeMake(1, 1, 1), threadsPerMeshThreadgroup: MTLSizeMake(1, 1, 1) ) For apple engineers: a bug has been submitted under FB10367407 Mesh shader code: 2d_grid_mesh_shader_hangs.metal I also have a more complex application where mesh shaders are used to generate sphere geometry: each mesh shader thread group generates a single slice of the sphere. Here the problem is similar: once there more than X slices to render, some of the dispatched mesh threadtroups don't seem to do anything (see screenshot below). But the funny thing is that the geometry is produced, as it would occasionally flicker in and out of existence, and if I manually block out some threadgroups from running (e.g. by using something like if(threadgroup_index > 90) return; in the mesh shader, the "hidden" geometry works! It almost looks like different mesh shaders thread group would reuse the same memory allocation for storing the output mesh data and output of some threadgroups is overwritten. I have not submitted this as a bug, since the code is more complex and messy, but can do so if someone from the Apple team wants to have a look.
Posted
by
Post not yet marked as solved
0 Replies
130 Views
As mentioned in the title, whenever an iMac19,2 owner running 10.15.7 runs my app, which features a scene with grain noise and rendering against a transparent background, it crashes like so: Crashed: CVDisplayLink 0 libobjc.A.dylib 0x681d objc_msgSend + 29 1 SceneKit 0x250780 SCNMTLComputeCommandEncoder::dispatchOnTexture2DWithoutOptimizedThreadGroupPerGrid(id<MTLTexture>, id<MTLComputePipelineState>) + 104 2 SceneKit 0x15abf6 C3D::getGrainNoise256(id<MTLCommandBuffer>, SCNMTLRenderContext*, C3D::RenderGraphResourceManager&) + 403 3 SceneKit 0xc6a9e C3D::CompositePass::compile() + 1410 4 SceneKit 0x391a46 C3D::RenderGraph::allocateResources() + 2198 5 SceneKit 0x14dc2d C3DEngineContextRenderWithRenderGraph + 52 6 SceneKit 0x22666b -[SCNRenderer _renderSceneWithEngineContext:sceneTime:] + 532 7 SceneKit 0x227222 -[SCNRenderer _drawSceneWithNewRenderer:] + 281 8 SceneKit 0x227786 -[SCNRenderer _drawScene:] + 46 9 SceneKit 0x227c8b -[SCNRenderer _drawAtTime:] + 965 10 SceneKit 0x214d29 -[SCNView _drawAtTime:WithContext:] + 542 11 SceneKit 0x214653 -[SCNView SCN_displayLinkCallback:] + 306 12 SceneKit 0x1af4f8 __69-[NSObject(SCN_DisplayLinkExtensions) SCN_setupDisplayLinkWithQueue:]_block_invoke + 49 13 SceneKit 0x2a1468 __36-[SCNDisplayLink _callbackWithTime:]_block_invoke.13 + 52 14 libdispatch.dylib 0x2658 _dispatch_client_callout + 8 15 libdispatch.dylib 0xe6ec _dispatch_lane_barrier_sync_invoke_and_complete + 60 16 SceneKit 0x2a13c5 -[SCNDisplayLink _callbackWithTime:] + 307 17 SceneKit 0x2a10c6 _cvDisplayLinkCallback + 261 18 CoreVideo 0x2e92 CVDisplayLink::performIO(CVTimeStamp*) + 230 19 CoreVideo 0x22c8 CVDisplayLink::runIOThread() + 626 20 libsystem_pthread.dylib 0x6109 _pthread_start + 148 21 libsystem_pthread.dylib 0x1b8b thread_start + 15
Posted
by
Post not yet marked as solved
0 Replies
85 Views
I've implemented GPU command encoding as described in the second part of the Modern Rendering with Metal WWDC19 talk. The implementation applies frustum culling to each individual mesh instance and creates a draw_indexed_primitives command, in the same way as outlined in the talk. Each command has an instance count of 1. My previous CPU command encoding implementation would group the visible mesh instances by mesh and pipeline state (after frustum culling) and encode the appropriate multi-instance draw call. With GPU command encoding running in parallel, I don't see a way to group meshes this way. Is there any significant performance impact for issuing multiple draw calls for individual instances of the same mesh, as opposed to using instanced rendering? This might very well be something that's not worth worrying about, but it would be good to have some input on this.
Posted
by
Post not yet marked as solved
0 Replies
102 Views
I'm running into an issue with threadgroup memory where data written to it seemingly gets lost when I use int8_t or int16_t element types: #include <metal_stdlib> using namespace metal; kernel void kernel_function(device int16_t* R, uint index [[thread_position_in_threadgroup]]) { threadgroup int16_t shared[1]; shared[index] = (int16_t) 42; threadgroup_barrier(mem_flags::mem_threadgroup); R[0] = shared[index]; } If I execute this kernel (using the following host code: https://gist.github.com/maleadt/ffcda8fc94f03f32347c3167ccca70a8 ), I get zeros in my output buffer. If I change the element type from int16_t to int32_t (just find/replace in the kernel and host code) I get the expected results. I'm new to Metal, so I guess I'm doing something wrong here. I'm using an M1 Pro on Monterey, with Xcode 13.4.1. EDIT: interestingly, running under MTL_SHADER_VALIDATION=1 results in the expected output, so this does start to look like a miscompilation in the back-end.
Posted
by
Post marked as solved
1 Replies
190 Views
Silly question but does anyone know where can I find the code for the demos in this video? https://developer.apple.com/videos/play/wwdc2022/10063/ I am trying to replicate the distributed training but running into version errors with Horovod.
Posted
by
Post not yet marked as solved
1 Replies
184 Views
In our use case, there is a Background Mac App (running on Mac M1) that is responsible for receiving data from a companion iOS App via WebSocket connection (client-side Apple Swift API, Vapor4 server side API) and perform computations using Metal Compute APIs and our custom kernels. In order to optimize execution time of these compute kernels we are looking for a way to profile their execution time i.e. how much combined GPU execution time (compute and memory accesses) is taken by each instance? As may be obvious, our primary focus is not the waiting time spent in the kernel scheduling queues before execution begins, but this may be helpful as an extra. We are not sure whether Instruments in XCode will be helpful in above scenario (partly in iOS, partly 3rd party WebSocket API, and partly background Mac App (command line App))? Also, is Metal frame capturing method dependent on presence of Metal graphics APIs and hence will not work for Background Apps? Can we get desired info using GPU Counter Sample Buffers, or are we looking at the wrong places? Any assistance wrt above (measurement of Metal compute kernel execution times in the context of a Mac Background App) will be highly appreciated.
Posted
by
Post not yet marked as solved
0 Replies
120 Views
hi, I'm developing a app that uses metal to compute some calculations and to improve the efficiency of the render process i started watching indirect command buffers but there isn't a example that explains the best way to processed. Any one can provide some tips?
Posted
by
Post not yet marked as solved
0 Replies
121 Views
I'm animating body+ face moments using skeleton animation , SceneKit and Metal custom shader. I need to deform some vertices in the vertex shader, therefore I need skinningJointMatrices in the vertex shader. However if I just add  the line   float4 skinningJointMatrices[183]; in the NodeBuffer : struct NodeBuffer {   float4x4 inverseModelTransform;   float4x4 inverseModelViewTransform;   float4x4 modelTransform;   float4x4 modelViewProjectionTransform;   float4x4 modelViewTransform;   float4x4 normalTransform;   float2x3 boundingBox; float4 skinningJointMatrices[765]; }; I get the following assertion: [SceneKit] Assertion 'C3DSkinnerGetEffectiveCalculationMode(skinner, C3DNodeGetGeometry(context->_nodeUniforms.instanceNode)) == kC3DSkinnerCalculationModeGPUVertexFunction' failed. skinningJointMatrices should only be used when skinning is done in the vertex function Is there a way to workaround this assert? The code seems to be working fine although the assertion.
Posted
by
Post not yet marked as solved
0 Replies
109 Views
In my game project, there is a functions.data file in then /AppData/Library/Caches/[bundleID]/com.apple.metal/functions.data, when we reboot and launch the game, this file was rest to about 40KB, normaly this file's is about 30MB, this operation was done by the metal, Is there any way to avoid it?
Posted
by
Post not yet marked as solved
0 Replies
100 Views
When I begin a compute pass from a MTLComputePassDescriptor with sampleBufferAttachments on macOS 11.6, it cannot sample correct values from counters on Apple M1 device, which work well for render pass. Device: MacBook Pro (12-inch, M1, 2020) MacOS: 11.6 NSRef<MTLComputePassDescriptor> descriptor = [MTLComputePassDescriptor computePassDescriptor]; descriptor.sampleBufferAttachments[0].sampleBuffer = counterSampleBuffer; descriptor.sampleBufferAttachments[0].startOfEncoderSampleIndex = NSUInteger(0); descriptor.sampleBufferAttachments[0].endOfEncoderSampleIndex = NSUInteger(1); After the compute pass is completed, the values in sample buffer are zeros. Another issue is MTLCounterDontSample, if we don't need to sample counter at the end of passes, we can set endOfEncoderSampleIndex to MTLCounterDontSample following the document, but it occurs validation layer error: failed assertion 'endOfEncoderSampleIndex (4294967295) must be < sample buffer count (2)'
Posted
by