Code is download from apple official metal4 sample
[https://developer.apple.com/documentation/metal/drawing-a-triangle-with-metal-4?language=objc]
enable metal gpu trace in macOS schema and trace a frame in Xcode.
Xcode may show segment fault on App from some 'GTTrace' function when click trace button.
When replay a .gputrace file, Xcode may crash , throw an internal error or a XPC error.
The example code using old metal-renderer can trace without any problem and everything works fine.
Test Environment:
Xcode Version 26.2 (17C52)
macOS 26.2 (25C56)
M1 Pro 16GB A2442
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hi everyone,
I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture.
While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges.
Current Core Implementations:
GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders.
Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering.
Temporal Stability: Custom TAA with history rejection and Motion Blur resolve.
Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy.
The Architectural Challenge:
I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers.
&& m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty();
if (canBuildHzb) {
MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder();
hzbInit->setComputePipelineState(m_hzbInitPipeline);
hzbInit->setTexture(m_depthTexture, 0);
hzbInit->setTexture(m_hzbMipViews[0], 1);
if (m_pointClampSampler) {
hzbInit->setSamplerState(m_pointClampSampler, 0);
} else if (m_linearClampSampler) {
hzbInit->setSamplerState(m_linearClampSampler, 0);
}
const uint32_t hzbWidth = m_hzbMipViews[0]->width();
const uint32_t hzbHeight = m_hzbMipViews[0]->height();
const uint32_t threads = 8;
MTL::Size tgSize = MTL::Size(threads, threads, 1);
MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads,
(hzbHeight + threads - 1) / threads * threads,
1);
hzbInit->dispatchThreads(gridSize, tgSize);
hzbInit->endEncoding();
for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) {
MTL::Texture* src = m_hzbMipViews[mip - 1];
MTL::Texture* dst = m_hzbMipViews[mip];
if (!src || !dst) {
continue;
}
MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder();
downEncoder->setComputePipelineState(m_hzbDownsamplePipeline);
downEncoder->setTexture(src, 0);
downEncoder->setTexture(dst, 1);
const uint32_t mipWidth = dst->width();
const uint32_t mipHeight = dst->height();
MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads,
(mipHeight + threads - 1) / threads * threads,
1);
downEncoder->dispatchThreads(downGrid, tgSize);
downEncoder->endEncoding();
}
if (m_instanceCullHzbPipeline) {
dispatchInstanceCulling(m_instanceCullHzbPipeline, true);
}
}
My Questions:
Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR?
visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency.
Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process?
I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
Hi. I'm a 3D designer, using Blender for most of my work. The most recent Blender conference discussed utilizing the Open Shading Language (OSL) in their latest versions, which allows designers to write custom shaders for their workflows.
At the moment, only Nvidia Optix GPU's can utilize this language for rendering (from what I understand), but Blender developers stated they are waiting on other GPU manufacturers to implement this feature as well. I'm not sure if there are any licensing issues here, but would this be something Apple could implement in Metal to make their hardware more attractive to the 3D design community?
Any help or knowledge on this topic would be greatly appreciated.
Unable to find intelgpu_kbl_gt2r0 slice or a compatible one in binary archive 'file:///System/Library/PrivateFrameworks/IconRendering.framework/Resources/binary.metallib'
available slices: applegpu_g13g, applegpu_g13s, applegpu_g13d, applegpu_g14g, applegpu_g14s, applegpu_g14d, applegpu_g15g, applegpu_g15s, applegpu_g15d, applegpu_g16g, applegpu_g16s, applegpu_g17g, applegpu_g15g, applegpu_g15s, applegpu_g15d, applegpu_g16s
Is it related to performance of applications in macOS 26.2 on Intel Macs?
Hello Apple Developers and users I am writing this message reguarding some help on some performance codes/settings I can use for my Macbook since I recently downloaded the MacOs Tahoe 26.2 and its been very glitchy and laggy with gaming and just using my mac normally I have tried using a FPS unlocker and downloading Metal 4 the FPS unlocker hasent worked at all I am still stuck on the normal 60 FPS and need some advice/help. Thank you. Kind regards Zachary
I'm currently learning Metal. While reading the reference, I came across a strange description.
Page 78 in Version 4 Reference (2025-10-25) says:
It is legal to call the following set_indices functions to set the indices if the position in the index buffer is valid and if the position in the index buffer is a multiple of 2 (uchar2 overload) or 2 (uchar4 overload). The index I needs to be in the range [0, max_indices).
void set_indices(uint I, uchar2 v);
void set_indices(uint I, uchar4 v);
However, it seems that the uchar4 overload should be multiple of 4.
Furthermore, there is no explanation of what these methods actually do. I believe it involves setting two to four consecutive indices at once, but there is no mention of that here.
I would like to know if the above understanding is correct.