Post not yet marked as solved
I tried converting our Android ATrace scopes to use os_signpost, but this seems to add 20ms of cpu time to every frame. ATrace_isEnabled is only called with AGI (Android GPU Inspector) takes a capture, but there don't seem to be flags that indicate when an Instruments capture is being taken.
AGI gives us a nice tracks in Perfetto of cpu and gpu timings with pseudo-coloring and text in each track that help interpret the frame, and without a 20ms hit.
Instruments gives microscopically tiny tracks that are all blue with no text in the os_signpost widget. I have to hover over every track which is about 2 pixels high to see the timings, and the timings for each frame is 400ms instead of the actual 50ms that is the actual time.
Is there a better method to see scoped cpu timings for macOS/iOS considering dtrace isn't available, or somehow improve the performance hit there?
Post not yet marked as solved
By the time I background the app, hit the capture button, wait on the UI popup to appear, and then hit the "capture" button in the popup, the even that I was trying to capture has already passed. Can we get a button, or double-click on the slanted M icon to just do the capture instead of verify that I want to. All told, it's about 5s to get a capture to execute and that is too long when running at 60 or 120Hz.
I know there's programmatic capture too, but we don't have that hooked up yet.
Post not yet marked as solved
We are using first pass depth. I know it's not recommended, but we have one and need it. Deferred renders uses this, and we do too.
We've tried setting [invariant] on the position, and now are resorting to slope and depth biasing the second pass. We even set -fpreserve-invariance on the compiler. This whole construct is confusing. "invariant" was added in MSL 2.1, but requires iOS 13 to set that compiler flag, and then other code states that flag must be set for iOS 14 and macOS11 SDK use (minSDK? buildSDK?). We also tried disabling -fno-fast-math to no avail.
But why is a simple v = v * m calculation different once polys hit the near plane or the viewport edges. The polys then seem to per-tile z-fight. Some tiles have stripes of z, and some are just completely missing. These are the same tris going through two shaders that do the same vertex calc.
That shouldn't be happening, unless the tiles are computing gradients per tile incorrectly from the one pass to the next. On long clipped tris, it looks like a hardware/driver bug computing consistent depths across the same triangles. This was tested on older (iPhone 6) and newer iOS devices and M1 MBP.
Post not yet marked as solved
We have this on many of our platforms, but Apple doesn't appear to expose this in Metal. Nvidia/AMD have had this for a long time. We can workaround for now, with gather followed by a component min/max on a single channel. For large scale multi-channel downsampling, having access to the sampler setting would be better. This would even work with 3d volumes, etc.
VK_EXT_sampler_filter_minmax
These are the three modes
WeightedAverage - basic nearest/blinear/trilinear
Min
Max
Post not yet marked as solved
I know how to do this with macOS 12/iOS 15, but how do we determine the split prior? I know most phones are 2/4, but A10 is 2/2 exclusive.
This is the new way below, but what is the old way? Especially with Alderlake chips using 8HT/8 configs with 24 threads, this info is important to identify.
sysctlbyname( "hw.nperflevels", &perfLevelCount, &countSize, NULL, 0 )
sysctlbyname( "hw.perflevel0.physicalcpu", &info.bigCores, &countSize, NULL, 0 )
sysctlbyname( "hw.perflevel1.physicalcpu", &info.littleCores, &countSize, NULL, 0 )
Post not yet marked as solved
I can't figure out why macOS keeps updating itself without my consent. I have "automatically download" and "automatically update" turned off. But macOS is constantly indicating an update is available, and then on reboot, the new macOS installs itself anyways. Since this often tends to break Xcode or gpu capture, I'd really like to prevent this.
Post not yet marked as solved
When we build our C++ code in Visual Studio, IntelliSense finds all of the types and functions. When we build in Xcode, it finds about 90%.
There seems to be no consistent pattern to why Xcode skips some things, and then that daisychains into the next header that includes that prior header.
We have a class with If/Else function calls, but Add calls are skipped. Even one header with the struct defined in the same header isn't highlighted as a type within that header.
Sources are built with Gnu makefiles, but ultimately the .o and .d files are all complied and linked together by clang using Xcode 13.3 and we use the new build system. What could we be doing wrong here? This isn't a recent problem, and has happened with all Xcode builds prior.
Post not yet marked as solved
I see reasonable numbers from this on macOS, but on iPad I see really large numbers from this, and in the gpu capture that don't add up. This is Xcode 12.2 and and iPad 14.0.1.
Textures and Buffers add up to 261MB which is close to the macOS. The memory summary, and the "other" area in the buffers area report 573MB when I hover over that. Also device.currentAllocatedSize reports 868MB total. I assume the buffer size is skewing the memory totals, since Xcode reports 620MB for the entire app.
I would attach a screenshot of the gpu capture showing the memory capture, but seems that the new forums don't support this, and not being able to search categories anymore is rather limiting.
Non-voliatile 261
Volatile 0
Textures 195
Buffers 66 <- but hover over "other" reports 573
Private 184
Shared 77
Used 166
Unused 95
Post not yet marked as solved
For keyboard handling on iOS (and iOS on macOS M1), the iOS 13.4 keyboard constants are missing the command keys. We need to be able to detect key up/down on all the modifiers. I realize there's a modifiers field on UIKey, but this seems inconsistent.
case UIKeyboardHIDUsageKeyboardLeftShift: b = kButton_Shift; break;
case UIKeyboardHIDUsageKeyboardRightShift: b = kButton_Shift; break;
case UIKeyboardHIDUsageKeyboardLeftAlt: b = kButton_Alt; break;
case UIKeyboardHIDUsageKeyboardRightAlt: b = kButton_Alt; break;
// ? case kVK_Command: b = kButton_Command; break;
// ? case kVK_RightCommand: b = kButton_Command; break;
case UIKeyboardHIDUsageKeyboardLeftControl: b = kButton_Ctrl; break;
case UIKeyboardHIDUsageKeyboardRightControl: b = kButton_Ctrl; break;
Post not yet marked as solved
This is the latest Intel Mac running with AMD 5500, and it can't sample timings at stage boundaries? How are we supposed to write timing consistently for macOS and iOS if that's not the case? So I have to then add several 1000 samples per draw call and accumulate them? I don't remember the docs or sample code pointing this out.
Our app compiles to deploy on macOS 10.15. Does setting that higher help with this?
MTLCounterSamplingPointAtStageBoundary is not supported, startOfVertexSampleIndex must be MTLCounterDontSample.
MTLCounterSamplingPointAtStageBoundary is not supported, startOfFragmentSampleIndex must be MTLCounterDontSample
Post not yet marked as solved
We can drop our compiles from AVX to SSE4.2, but we also use f16c ops to handle fp16 <-> fp32 conversions. Neon already has similar routines to f16c support, so why are these missing from Rosetta2?
Until we can generate universal apps, we need to fallback to running our tools under Rosetta2. Also looks like popcount is missing. These limits should be posted in Apple Rosetta2 documents.
Here's my MBP 16" Intel
sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
And an M1 comparison:
sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTSE64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 AES SEGLIM64
Post not yet marked as solved
This breaks shader hotloading and has been a persistent bug in Metal for the past many years. Metal holds onto some existing lib, returns it, without checking that the data content has changed. Similar bugs happen with Metal's shader cache not checking modification timestamps.
In my case, I'm just changing a color in the shader from float3(1,0,0) to float3(1,1,0) and then never seeing the result of the shader change. The new metallib is loaded from disk, and handed off to newLibraryWithData.
I can tell that it's returning a cached metallib, because we set a label on the MTLFunction that is returned. That's not nil on the first load of the shader, and after the hotload of the new metallib the label is non-nil. So we just see the old shader content.
This is a very important Radar to fix.
Post not yet marked as solved
Why is there no count to any of these draw indirect directives? I am appending draws to a single MTLBuffer on the cpu, but can't limit how many are drawn out of the buffer. An offset isn't enough to specify a range. Can this be supplied in some bind call?
- (void)drawIndexedPrimitives:(MTLPrimitiveType)primitiveType indexType:(MTLIndexType)indexType indexBuffer:(id <MTLBuffer>)indexBuffer indexBufferOffset:(NSUInteger)indexBufferOffset indirectBuffer:(id <MTLBuffer>)indirectBuffer indirectBufferOffset:(NSUInteger)indirectBufferOffset API_AVAILABLE(macos(10.11), ios(9.0));
Contrast this with the Vulkan call which as an offset and count.
vkCmdDrawIndexedIndirect( m_encoder, indirectBuffer, drawBufferOffset, drawCount, sizeof( vkCmdDrawIndexedIndirect ) );
Post not yet marked as solved
We target MSL 1.1 on iOS9, and are seeing non-equivalence to the following. The upper code gens bad pixels on iOS but is the more efficient form. macOS (on AMD 5500m) is fine. I will log this to Feedback Assistant, but also here too.
The code was also compiled with -O2. So could be an iOS optimizer bug.
#if 1
if ( all( greaterThanEqual(pos.xy, v_clip.xy )) &&
all( lessThanEqual(pos.xy, v_clip.zw )) )
#else
if ( pos.x = v_clip.x && pos.x = v_clip.z &&
pos.y = v_clip.y && pos.y = v_clip.w )
#endif
This is codgen out of spirv-cross. Mac and iOS codegen is the same for this chunk.
These are on iOS
With #1, this doesn't work:
fsmain_out out = {};
float4 color = float4(0.0);
float2 pos = gl_FragCoord.xy;
bool _35 = all(pos = in.v_clip.xy);
bool _43;
if (_35)
{
_43 = all(pos = in.v_clip.zw);
}
else
{
_43 = _35;
}
if (_43) ...
With #0, this works
fsmain_out out = {};
float4 color = float4(0.0);
float2 pos = gl_FragCoord.xy;
bool _38 = pos.x = in.v_clip.x;
bool _47;
if (_38)
{
_47 = pos.x = in.v_clip.z;
}
else
{
_47 = _38;
}
bool _56;
if (_47)
{
_56 = pos.y = in.v_clip.y;
}
else
{
_56 = _47;
}
bool _65;
if (_56)
{
_65 = pos.y = in.v_clip.w;
}
else
{
_65 = _56;
}
if (_65)
Post not yet marked as solved
The push/popDebugGroup calls are captured by GPU capture and display a folder around a series of draw calls. But when you select the folder, the previous draw call results and attachments are displayed. This makes walking through a deep hierarchy of draw calls confusing, especially to people new to GPU capture.
A simple change, but selecting a folder like this or any command after a draw should really display the results from the next draw call instead of the previous.