Im new in the Mac area but for sure not UE. Windows is a long process to packaging but it could be done. All the documentation for Epic and from the internet is basically non existent with exactly how to package a project within UE. I have Xcode installed which makes sense, agreed to terms and install for MacOS, I've been able to make a project for several weeks now and want to package for a test run for my friends to play on Windows. Now I just get this in the log:
UATHelper: Packaging (Mac): ERROR: Failed to finalize the .app with Xcode. Check the log for more information
UATHelper: Packaging (Mac): Trace written to file /Users/rileysleger/Library/Logs/Unreal Engine/LocalBuildLogs/UBA-ProjectNightTerror-Mac-Development.uba with size 12.6kb
UATHelper: Packaging (Mac): Total time in Unreal Build Accelerator local executor: 8.12 seconds
UATHelper: Packaging (Mac): Result: Failed (OtherCompilationError)
UATHelper: Packaging (Mac): Total execution time: 9.71 seconds
PackagingResults: Error: Failed to finalize the .app with Xcode. Check the log for more information
UATHelper: Packaging (Mac): Took 9.77s to run dotnet, ExitCode=6
UATHelper: Packaging (Mac): UnrealBuildTool failed. See log for more details. (/Users/rileysleger/Library/Logs/Unreal Engine/LocalBuildLogs/UBA-ProjectNightTerror-Mac-Development.txt)
UATHelper: Packaging (Mac): AutomationTool executed for 0h 0m 10s
UATHelper: Packaging (Mac): AutomationTool exiting with ExitCode=6 (6)
UATHelper: Packaging (Mac): RunUAT ERROR: AutomationTool was unable to run successfully. Exited with code: 6
PackagingResults: Error: AutomationTool was unable to run successfully. Exited with code: 6
PackagingResults: Error: Unknown Error
This absolutely makes no sense to me. Anyone have ideas?
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hi there,
I'm wondering if it's possible under iOS 28 developer beta to enable MetalFX scaling info with '{"MTL_HUD_ENABLED": "1" for my App.
This information has been added to Mac, but looks to be absent on iPhone / iPad
Hey, I've been struggling with this for some days now.
I am trying to write to a sparse texture in a compute shader. I'm performing the following steps:
Set up a sparse heap and create a texture from it
Map the whole area of the sparse texture using updateTextureMapping(..)
Overwrite every value with the value "4" in a compute shader
Blit the texture to a shared buffer
Assert that the values in the buffer are "4".
I have a minimal example (which is still pretty long unfortunately).
It works perfectly when removing the line heapDesc.type = .sparse.
What am I missing? I could not find any information that writes to sparse textures are unsupported. Any help would be greatly appreciated.
import Metal
func sparseTexture64x64Demo() throws {
// ── Metal objects
guard let device = MTLCreateSystemDefaultDevice()
else { throw NSError(domain: "SparseNotSupported", code: -1) }
let queue = device.makeCommandQueue()!
let lib = device.makeDefaultLibrary()!
let pipeline = try device.makeComputePipelineState(function: lib.makeFunction(name: "addOne")!)
// ── Texture descriptor
let width = 64, height = 64
let format: MTLPixelFormat = .r32Uint // 4 B per texel
let desc = MTLTextureDescriptor()
desc.textureType = .type2D
desc.pixelFormat = format
desc.width = width
desc.height = height
desc.storageMode = .private
desc.usage = [.shaderWrite, .shaderRead]
// ── Sparse heap
let bytesPerTile = device.sparseTileSizeInBytes
let meta = device.heapTextureSizeAndAlign(descriptor: desc)
let heapBytes = ((bytesPerTile + meta.size + bytesPerTile - 1) / bytesPerTile) * bytesPerTile
let heapDesc = MTLHeapDescriptor()
heapDesc.type = .sparse
heapDesc.storageMode = .private
heapDesc.size = heapBytes
let heap = device.makeHeap(descriptor: heapDesc)!
let tex = heap.makeTexture(descriptor: desc)!
// ── CPU buffers
let bytesPerPixel = MemoryLayout<UInt32>.stride
let rowStride = width * bytesPerPixel
let totalBytes = rowStride * height
let dstBuf = device.makeBuffer(length: totalBytes, options: .storageModeShared)!
let cb = queue.makeCommandBuffer()!
let fence = device.makeFence()!
// 2. Map the sparse tile, then signal the fence
let rse = cb.makeResourceStateCommandEncoder()!
rse.updateTextureMapping(
tex,
mode: .map,
region: MTLRegionMake2D(0, 0, width, height),
mipLevel: 0,
slice: 0)
rse.update(fence) // ← capture all work so far
rse.endEncoding()
let ce = cb.makeComputeCommandEncoder()!
ce.waitForFence(fence)
ce.setComputePipelineState(pipeline)
ce.setTexture(tex, index: 0)
let threadsPerTG = MTLSize(width: 8, height: 8, depth: 1)
let tgCount = MTLSize(width: (width + 7) / 8,
height: (height + 7) / 8,
depth: 1)
ce.dispatchThreadgroups(tgCount, threadsPerThreadgroup: threadsPerTG)
ce.updateFence(fence)
ce.endEncoding()
// Blit texture into shared buffer
let blit = cb.makeBlitCommandEncoder()!
blit.waitForFence(fence)
blit.copy(
from: tex,
sourceSlice: 0,
sourceLevel: 0,
sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
sourceSize: MTLSize(width: width, height: height, depth: 1),
to: dstBuf,
destinationOffset: 0,
destinationBytesPerRow: rowStride,
destinationBytesPerImage: totalBytes)
blit.endEncoding()
cb.commit()
cb.waitUntilCompleted()
assert(cb.error == nil, "GPU error: \(String(describing: cb.error))")
// ── Verify a few texels
let out = dstBuf.contents().bindMemory(to: UInt32.self, capacity: width * height)
print("first three texels:", out[0], out[1], out[width]) // 0 1 64
assert(out[0] == 4 && out[1] == 4 && out[width] == 4)
}
Metal shader:
#include <metal_stdlib>
using namespace metal;
kernel void addOne(texture2d<uint, access::write> tex [[texture(0)]],
uint2 gid [[thread_position_in_grid]])
{
tex.write(4, gid);
}
I am trying to load some PNG data with MTKTextureLoader newTextureWithData,but the result shows wrong at the alpha area.
Here is the code. I have an image URL, after it downloads successfully, I try to use the data or UIImagePNGRepresentation (image), they all show wrong.
UIImage *tempImg = [UIImage imageWithData:data];
CGImageRef cgRef = tempImg.CGImage;
MTKTextureLoader *loader = [[MTKTextureLoader alloc] initWithDevice:device];
id<MTLTexture> temp1 = [loader newTextureWithData:data options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil];
NSData *tempData = UIImagePNGRepresentation(tempImg);
id<MTLTexture> temp2 = [loader newTextureWithData:tempData options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil];
id<MTLTexture> temp3 = [loader newTextureWithCGImage:cgRef options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil];
}] resume];
I am building a MacOS desktop app (https://anukari.com) that is using Metal compute to do real-time audio/DSP processing, as I have a problem that is highly parallelizable and too computationally expensive for the CPU.
However it seems that the way in which I am using the GPU, even when my app is fully compute-limited, the OS never increases the power/performance state. Because this is a real-time audio synthesis application, it's a huge problem to not be able to take advantage of the full clock speeds that the GPU is capable of, because the app can't keep up with real-time.
I discovered this issue while profiling the app using Instrument's Metal tracing (and Game tracing) modes. In the profiling configuration under "Metal Application" there is a drop-down to select the "Performance State." If I run the application under Instruments with Performance State set to Maximum, it runs amazingly well, and all my problems go away.
For comparison, when I run the app on its own, outside of Instruments, the expensive GPU computation it's doing takes around 2x as long to complete, meaning that the app performs half as well.
I've done a ton of work to micro-optimize my Metal compute code, based on every scrap of information from the WWDC videos, etc. A problem I'm running into is that I think that the more efficient I make my code, the less it signals to the OS that I want high GPU clock speeds!
I think part of why the OS is confused is that in most use cases, my computation can be done using only a small number of Metal threadgroups. I'm guessing that the OS heuristics see that only a small fraction of the GPU is saturated and fail to scale up the power/clock state.
I'm not sure what to do here; I'm in a bit of a bind. One possibility is that I intentionally schedule busy work -- spin threadgroups just to waste energy and signal to the OS that I need higher clock speeds. This is obviously a really bad idea, but it might work.
Is there any other (better) way for my app to signal to the OS that it is doing real-time latency-sensitive computation on the GPU and needs the clock speeds to be scaled up?
Note that game mode is not really an option, as my app also runs as an AU plugin inside hosts like Garageband, so it can't be made fullscreen, etc.
hi
When analyzing our game using Instruments, I've always been confused about the two items "Drawable Present" and "Drawable Presented" in the GPU column. The timing of Drawable Present seems to be when the CPU layer calls commandbuffer:present, rather than when the actual encoding is completed on the GPU. Also, what does drawable presented specifically mean? In our case, when a CPU stall occurs, it appears that the vsync interval changes in the next frame, and a surface that has already been calculated is not displayed. Why is this happening?
Hello!
I'm a developer working on a plugin for the Elgato Stream Deck, called GPU Metrics. The plugin currently only works on Windows but I'd like to bring it to macOS. However, based on forum posts I've read (and StackOverflow) there isn't a very clear path to query GPU metrics like usage, temperature, used GPU memory, and power consumption. There are some tools out there that do similar things, but I wanted to see what would be the recommendation from Apple's engineering team to get this data via a public API.
Requirements:
Access GPU utilization, temperature, memory usage, power usage
C/C++ based API for querying the metrics so I can expose the data to JavaScript via Node Addon
No need to compatibile with Intel-based Macs, as Apple silicon will be fine for now
Plugin GitHub
Thank you!
Noah
In the Creating A 3D Application With Hydra Rendering tutorial on the Apple Developer website, on the last step where I execute this command:
cmake -S ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/ -B ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/
I keep getting an error:
CMake Error at CMakeLists.txt:5 (include):
include could not find requested file:
/Users/macuser/USDInstall/bin/pxrConfig.cmake
I've tried to follow the instructions as mentioned in the README.md file included in the project files at least 5 times as well as moving the pxrConfig.cmake file around and copying it in different folders, then executed the command and was still unsuccessful into generating the proper file expected to compile and render the HydraPlayer renderer. How do I get cmake to generate the Xcode file to create the HydraPlayer renderer?
I have run into an issue where I am trying to use atomic_float in a swift package but I cannot get things to compile because it appears that the Swift Package Manager doesn't support Metal 3 (atomic_float is Metal 3 functionality). Is there any way around this? I am using
// swift-tools-version: 6.1
and my Metal code includes:
#include <metal_stdlib>
#include <metal_geometric>
#include <metal_math>
#include <metal_atomic>
using namespace metal;
kernel void test(device atomic_float* imageBuffer [[buffer(1)]],
uint id [[ thread_position_in_grid ]]) {
}
But I get an error on the definition of atomic_float .
Any help, one more importantly, where I could have found this information about this limitation, would be helpful.
-RadBobby
Topic:
Graphics & Games
SubTopic:
Metal
Hi,
seems MSL is missing support for a clock() shader instruction available in other graphics APIs like Vulkan or OpenGL for example..
useful for counting cost in number of clock cycles of some code insider shader with much finer granularity than launching a micro kernel with same instructions and measuring cycles cost from CPU..
also useful for MoltenVK to support that extensions..
thanks..
The game physics work as expected using GTPK 2.0 using Crossover 24 or Whisky. However, using GPTK 2.1 with Crossover 25, the player and camera physics misbehave. See https://www.reddit.com/r/WWEGames/comments/1jx9mph/the_siamese_elbow/ and https://www.reddit.com/r/WWEGames/comments/1jx9ow4/camera_glitch/
Full video also linked in the Reddit post.
I have also submitted this bug via the feedback assistant.
I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal
I notice that performance is significantly different with different threadgroup memory set in
[computeEncoder setThreadgroupMemoryLength]
All other lines are exactly same, the only difference is this parameter.
Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max),
but 400 GFLops if I set 8192.
Why does this happen? How can I optimize it?
Topic:
Graphics & Games
SubTopic:
Metal
Hi,
Introducing Swift Concurrency to my Metal app has been a bit challenging as Swift Concurrency is limited by the cooperative thread pool.
GPU work is obviously not CPU bound and can block forward moving progress, especially when using waitUntilCompleted on the command buffer. For concurrent render work this has the potential of under utilizing the CPU and even creating dead locks.
My question is, what is the Metal's teams general recommendation when it comes to concurrency? It seems to me that Dispatch or OperationQueues are still the preferred way for Metal bound tasks in order to gain maximum performance?
To integrate with Swift Concurrency my idea is to use continuations that kick off render jobs via Dispatch or Queues? Would this be the best solution to bridge async tasks with Metal work?
Thanks!
The code is pretty simple
kernel void naive(
constant RunParams *param [[ buffer(0) ]],
const device float *A [[ buffer(1) ]], // [N, K]
device float *output [[ buffer(2) ]],
uint2 gid [[ thread_position_in_grid ]]) {
uint a_ptr = gid.x * param->K;
for (uint i = 0; i < param->K; i++, a_ptr++) {
val += A[b_ptr];
}
output[ptr] = val;
}
when uint a_ptr = gid.x * param->K, the code got 150 GFLops
when uint a_ptr = gid.y * param->K, the code got 860 GFLops
param->K = 256;
thread per group: [16, 16]
I'd like to understand why the performance is so different, and how can I profile/diagnose this to help with further optimization.
Topic:
Graphics & Games
SubTopic:
Metal
So I've been trying out GPTK with Elite Dangerous Horizons game and it looks like from what I can tell. The VRAM keeps going up until it goes over the limit where it drops the FPS to 1-3 FPS and then crashes the game. From the Performance HUD I can see that it looks like when using GPTK, the VRAM usage just keeps climbing and I never saw it drop down at all. I did some limited testing, and from that I think I can conclude that it is probably not a VRAM leak, but it might be caching it. The reason for this is because I noticed that if I went back to the area that I've been before. It won't increase the VRAM usage.
So either there is something wrong with the freeing VRAM memory part, or it could be that GPTK might not be reporting the right amount of VRAM available to use? So maybe that's why it keeps allocating VRAM until it went out of memory and crashed the game.
Just to test, I did try running the game with DXVK+MoltenVK combo, and I can see that it works just fine. VRAM is being freed up when it's no longer used.
Is this a known issue in some games?
Hello
I am trying to get thread group memory access in fragment shader. In essence, I would like to have all the fragments in a tile to bitwiseOR some value. My idea was to use simd_or across the SIMD group, then make each SIMD group thread 0 to atomic or the value into thread group memory. Finally very first thread of the tile would be tasked with writing the value down to texture with write access.
Now, I can allocate the thread group memory argument to the fragment function all right. MTLRenderEncoder has setThreadgroupMemoryLength call, which I am using the following way
[renderEncoder setThreagroupMemoryLength: 16 offset: 0 atIndex:0]
Unfortunately, all I am getting is the following error (runtime assertion)
-[MTLDebugRenderCommandEncoder setThreadgroupMemoryLength:offset:atIndex:]:3487: failed assertion Set Threadgroup Memory Length Validation
offset + length(16) must be <= threadgroupMemoryLength(0).`
What I am doing wrong? How I can get thread group memory in the fragment shader? I know I could use tile shading and compute function but the problem is that here I really like to use fragment stuff. Will be grateful for help.
Hi,
I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this?
Thanks,
Rasmus
Hello. In the iOS app i'm working on we are very tight on memory budget and I was looking at ways to reduce our texture memory usage. However I noticed that comparing ASTC8x8 to ASTC12x12, there is no actual difference in allocated memory for most of our textures despite ASTC12x12 having less than half the bpp of 8x8. The difference between the two only becomes apparent for textures 1024x1024 and larger, and even in that case the actual texture data is sometimes only 60% of the allocation size. I understand there must be some alignment and padding going on, but this seems extreme. For an example scene in my app with astc12x12 for most textures there is over a 100mb difference in astc size on disk versus when loaded, so I would love to be able to recover even a portion of that memory.
Here is some test code with some measurements i've taken using an iphone 11:
for(int i = 0; i < 11; i++) {
MTLTextureDescriptor *texDesc = [[MTLTextureDescriptor alloc] init];
texDesc.pixelFormat = MTLPixelFormatASTC_12x12_LDR;
int dim = 12;
int n = 2 << i;
int mips = i+1;
texDesc.width = n;
texDesc.height = n;
texDesc.mipmapLevelCount = mips;
texDesc.resourceOptions = MTLResourceStorageModeShared;
texDesc.usage = MTLTextureUsageShaderRead;
// Calculate the equivalent astc texture size
int blocks = 0;
if(mips == 1) {
blocks = n/dim + (n%dim>0? 1 : 0);
blocks *= blocks;
} else {
for(int j = 0; j < mips; j++) {
int a = 2 << j;
int cur = a/dim + (a%dim>0? 1 : 0);
blocks += cur*cur;
}
}
auto tex = [objCObj newTextureWithDescriptor:texDesc];
printf("%dx%d, mips %d, Astc: %d, Metal: %d\n", n, n, mips, blocks*16, (int)tex.allocatedSize);
}
MTLPixelFormatASTC_12x12_LDR
128x128, mips 7, Astc: 2768, Metal: 6016
256x256, mips 8, Astc: 10512, Metal: 32768
512x512, mips 9, Astc: 40096, Metal: 98304
1024x1024, mips 10, Astc: 158432, Metal: 262144
128x128, mips 1, Astc: 1936, Metal: 4096
256x256, mips 1, Astc: 7744, Metal: 16384
512x512, mips 1, Astc: 29584, Metal: 65536
1024x1024, mips 1, Astc: 118336, Metal: 147456
MTLPixelFormatASTC_8x8_LDR
128x128, mips 7, Astc: 5488, Metal: 6016
256x256, mips 8, Astc: 21872, Metal: 32768
512x512, mips 9, Astc: 87408, Metal: 98304
1024x1024, mips 10, Astc: 349552, Metal: 360448
128x128, mips 1, Astc: 4096, Metal: 4096
256x256, mips 1, Astc: 16384, Metal: 16384
512x512, mips 1, Astc: 65536, Metal: 65536
1024x1024, mips 1, Astc: 262144, Metal: 262144
I also tried using MTLHeaps (placement and automatic) hoping they might be better, but saw nearly the same numbers.
Is there any way to have metal allocate these textures in a more compact way to save on memory?
Anyone else unable to download the "Rendering a Scene with Deferred Lighting in C++" (https://developer.apple.com/documentation/metal/rendering-a-scene-with-deferred-lighting-in-c++?language=objc)?
I just an error page:
Is there another place to download this sample?
Topic:
Graphics & Games
SubTopic:
Metal
Hello!
I have a question about how thread groups work with tile shading. When running "traditional" compute, I get to choose both thread group size and the grid size. However, when using tile shading kernel I only have dispatchThreadsPerTile method - this controls how many threads will be ran in each tile. So far so good, but what about thread groups?
The examples in video "Tile Shading on A11" seem to suggest that there will be only one thread group per tile. In the video, [[thread_index_in_threadgroup]] is called "local_id" and it is used to access the image block.
I assume this is the default configuration. So when one does the following:
Creates MTLRenderPassDescriptor with tileWidth set to W and tileHeight set to H
Fires up the tile shading kernel using dispatchThreadsPerTile with MTLSize size = { W, H, 1 }
I understand that the result is 1-to-1 mapping between the tile "pixels" and kernel threads. Now, what I would like to do is to have more than one thread group there. I want this for performance reasons: I have a certain compute kernel which I know executes very well with small thread group size. In fact, { 32, 1, 1 } seems to be the fastest. My understanding is that even if I set tile size to 16x16, and so I am executing 256 threads there, there will only be one SIMD group active in a thread group. Meaning that this SIMD group has to execute 8 times over the tile.
Is it possible somehow? Or perhaps the limitations of the API are pointing at the limitations of hardware itself, and if I want to execute with SIMD group sized thread groups I have to use "traditional" compute encoder?
Will be grateful for help.
Michał