Post not yet marked as solved
I'm trying to obtain the intrinsic matrix for each video frame of AVCaptureSession (the same intrisic matrix as ARKit provides), however the isCameraIntrinsicMatrixDeliverySupported property of AVCaptureConnection is false in my use-case.The documentation of the property says "This property's value is true only if both the connection's input device format and output class support delivery of camera intrinsics."How do I know which device formats support delivery of intrinsic matrix? What do I need to do to be able to enable the intrinsic matrix delivery?Simple code to illustrate my problem:import UIKit
import AVFoundation
class ViewController: UIViewController {
var sess: AVCaptureSession!
var sessOut: AVCaptureVideoDataOutput!
var prevLayer: AVCaptureVideoPreviewLayer!
override func viewDidLoad() {
super.viewDidLoad()
sess = AVCaptureSession()
let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back)
let input = try! AVCaptureDeviceInput(device: device!)
sess.addInput(input)
sessOut = AVCaptureVideoDataOutput()
sess.addOutput(sessOut)
sessOut.connections.first?.videoOrientation = .landscapeRight
sessOut.connections.first?.preferredVideoStabilizationMode = .cinematic
print(sessOut.connections.first?.isCameraIntrinsicMatrixDeliverySupported) // <-- false - why?
prevLayer = AVCaptureVideoPreviewLayer(session: sess)
prevLayer.frame = self.view.frame
prevLayer.videoGravity = .resizeAspectFill
prevLayer.connection?.videoOrientation = .landscapeRight
self.view.layer.addSublayer(prevLayer)
sess.startRunning()
}
}
Post not yet marked as solved
Situation (simplified):There're 3 Metal buffers (created on app initialization via MTLDevice – not via MTLHeap);Buffer A: contains total of n float3 elements (which may represent particle positions)Buffer B: double-buffered ABuffer C: contains total of n float elements (i-th element of B corresponds to i-th element of A (and also B))I also have 3 kernels (Metal compute functions), which maipulate the 3 buffers;Kernel preprocess(A, B): its parameters are A and B. It reads (old) values from A and writes updated values into B. [1 thread = 1 element]Kernel process(B, C): this is a very computation-expensive/time-comsuming kernel. (Just to be clear: each element of buffer B represents particle position.) For each element/particle from B, the kernel finds the particle's nearest neighbors (i.e. their positions). After the neighbors are found, it reduces their positions into one float value. The float value is written to buffer C. [1 thread = 1 element (i.e. particle)]Kernel postprocess(A, B, C): it reads element from B and C, combines them in a simple way and writes the result back to A. (Something like A[i] = B[i]*C[i])[1 thread = 1 element]Problem:The kernel process takes serious amount of time to complete (Because for each particle it iterates over a range of B's values. Particles that are near (i.e. adjacent threads) iterate over very similar ranges in B – there's heavy memory reuse of B between adjacent kernels). The kernel is defined similarly to this:kernel void process(..., device float3* B, ...) {...}Question:Since the kernel process loads repeteadly the same regions of B (even though I use threadgroup memory) and B is read-only in this kernel, I thought it'd be a good idea for performance improvement to cache the buffer B – to transfer it into a highly cached portion of memory optimized for repeated reads. AFAIK such portion of memory is the constant address space. Is it possible to do the following? And if it is, then how?Call the kernel preprocess(A, B) [this updates the contents of buffer B – hence buffer B cannot be in the constant address space]Transfer the buffer B (containing the new, updated values from previous step) into constant address space. <——— How do I do this?Call the kernel process(B, C) [read repeatedly values from the buffer B which is in constant address space (improving performance)]Call the kernel postprocess(A, B, C)Extra question:When I create an buffer at the app initialization, I don't specify what address space it belongs to, so when is it decided that a buffer belongs into device or constant address space? Is it per Command Buffer, per Encoder or per each individual Pipeline state of one Encoder? Or does a buffer's position in memory address spaces change throughout the app lifetime?Example: Say I have 1 buffer and 2 kernels:// CPU side:
var data = device.newBuffer(...)
commandEncoder.setComputePipelineState(fooPipeline);
commandEncoder.setBuffer(data, offset: 0, at: 0)
commandEncoder.dispatchThreadgroups(...)
commandEncoder.setComputePipelineState(barPipeline);
commandEncoder.setBuffer(data, offset: 0, at: 0)
commandEncoder.dispatchThreadgroups(...)
commandEncoder.endEncoding()
// GPU side
kernel foo(device float* data){...} // foo writes to data
kernel bar(constant float& data){...} // bar only reads from dataWhere is the buffer data stored (in the device or the (cached) constant address space space)? What would happen when I swap the order in which are foo and bar dispatched?
I'm trying to profile my Metal compute shaders on iOS (macOS 10.13.3, Xcode 9.3 beta 3, iOS 11.3 public beta 3), but after the GPU Capture shows the list of profiled kernels, there're no execution times for them and I cannot see percentage of time spend on particular lines of code (as I used to see).When I select the "Performance" tool under a specific kernel in the profiler list (in the image below), the explorer shows "Waiting for Shader Profiler…" and nothing happens (I tried waiting for 5+ minutes but nothing changed — no times were reported).The issue isn't present when I try to profile vertex/fragment shaders though — times are displayed correctly for them.I use MTLCaptureManager to start capturing GPU Frame:MTLCaptureManager.shared().startCapture(commandQueue: queue)
let cmdbuff = queue.makeCommandBuffer()!
// encode commands into the buffer using ComputeCommandEncoder
cmdbuff.commit()
// Wait untill Xcode's GPU Capture Button (with the camera-like icon)
// starts flashing red circle and then manually click it.Screens:i.imgur.com/VrOsuDu.pngi.imgur.com/QJlowFA.png
Post not yet marked as solved
I need to perform atomic-max operation on 64bit integers.I know no types other than int, uint and bool have official support for atomic operations, but given that you can perform (officially unsupported) 64bit integer math in Metal on iOS, I guess there might be undocumented AIR assembly instructions for 64bit atomic operations (I tried the obvious ones, but without any success — Xcode outputs "Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED").How can I list all supported AIR assembly operations?
Post not yet marked as solved
I need to render a scene at 60 FPS and at the same time run a Metal kernel function.The problem is that while the rendering pipeline is fast (5 ms) and can easily target 60 FPS, the kernel however can take up to 90 ms to complete.This obviously causes extreme stuttering.I tried to create 2 MTLCommandQueues – one for the rendering pipeline and the second for the compute task, which didn't help at all with the concurrent execution problem – the execution order of render and compute tasks is sequential.I think I'd need some kind of a QoS; given the 16.67 ms time budget per frame, always execute the rendering task and run the compute task for the (16.67-5) ms, effectively running the compute task through several frames.Has anyone successfuly tackled this problem?
Post not yet marked as solved
AFAIK there's no way an app could require presence of the TrueDepth camera on an device via UIRequiredDeviceCapabilities in Info.plist.What's the logic behind this? What if an app relies solely on the TrueDepth camera and couldn't possibly run without it?I suppose one can't simply show an error message when the app is executed on a TrueDepth-camera-less device, so how to deal with this issue?
It seems like the SIMD-group shuffle instructions (e.g. simd_shuffle(), simd_shuffle_down(), simd_shuffle_xor() etc. – subsection 5.13) aren't supported on iOS (#include <metal_simdgroup> throws an error).I'd like to know why aren't these instructions available on iOS – is it because the GPU architectures don't allow for these instructions to be implemented (why?), or is there any other reason?Furthermore is there a plan to add these instructions to iOS?