AppleOS 26 introduces LowLevelInstanceData that can reduce CPU draw calls significantly by instancing. However, I have noticed trouble with animating each individual instance.
As I wanted low-level control, I'm using a custom system and LowLevelInstanceData.replace(using:) to update the transform each frame. The update closure itself is extremely efficient (Xcode Instruments reports nearly no cost). But I noticed extremely high runloop time, reach around 20ms. Time Profiler shows that the CPU is blocked by kernel.release.t6401.
I think it is caused by synchronization between CPU and GPU, however, as I am already using a MTLCommandBuffer to coordinate it, I don't understand why I am still seeing large CPU time.