Fundamentally, my questions are: is there a known transform I can apply onto a given (pixel) position (passed into a Metal Fragment Function) to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?
My goal is to highlight people in a Vision Pro app using Compositor Services.
To start, I asynchronously receive camera frames for the main left and right cameras. This is the breakdown of the specific CameraVideoFormat I pass along to the CameraFrameProvider:
- minFrameDuration: 0.03
- maxFrameDuration: 0.033333335
- frameSize: (1920.0, 1080.0)
- pixelFormat: 875704422
- cameraType: main
- cameraPositions: [left, right]
- cameraRectification: mono
From each camera frame sample, I extract the left and right buffers (CVReadOnlyPixelBuffer.withUnsafebuffer ==> CVPixelBuffer).
I asynchronously process the extracted buffers by performing a VNGeneratePersonSegmentationRequest on both of them:
// NOTE: This block of code and all following code blocks contain simplified representations of my code for clarity's sake.
var request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .balanced
request.outputPixelFormat = kCVPixelFormatType_OneComponent8
...
let lHandler = VNSequenceRequestHandler()
let rHandler = VNSequenceRequestHandler()
...
func processBuffers() async {
    try lHandler.perform([request], on: lBuffer)
    guard let lMask = request.results?.first?.pixelBuffer else {...}
    try rHandler.perform([request], on: rBuffer)
    guard let rMask = request.results?.first?.pixelBuffer else {...}
    appModel.latestPersonMasks = (lMask, rMask)
}
I store the two resulting CVPixelBuffers in my appModel. For both of these buffers aka grayscale masks:
- width (in pixels) = 512
- height (in pixels) = 384
- byters per row = 512
- plane count = 0
- pixel format type = 1278226488
I am using Compositor Services to render my content in Immersive Space. My implementation of Compositor Services is based off of the same code from Interacting with virtual content blended with passthrough.
Within the Shaders.metal, the tint's Fragment Shader is now passed the grayscale masks (converted from CVPixelBuffer to MTLTexture via CVMetalTextureCacheCreateTextureFromImage() at the beginning of the main render pipeline).
fragment float4 tintFragmentShader(
                                   TintInOut in [[stage_in]],
                                   ushort amp_id [[amplification_id]],
                                   texture2d<uint> leftMask [[texture(0)]],
                                   texture2d<uint> rightMask [[texture(1)]]
                                   )
{
    if (in.color.a <= 0.0) {
        discard_fragment();
    }
    float2 uv;
    
    if (amp_id == 0) { // LEFT
        uv = ??????????????????????;
    } else { // RIGHT
        uv = ??????????????????????;
    }
    
    constexpr sampler linearSampler (mip_filter::linear, mag_filter::linear, min_filter::linear);
    
    // Sample the PersonSegmentation grayscale mask
    float maskValue = 0.0;
    
    if (amp_id == 0) { // LEFT
        if (leftMask.get_width() > 0) {
             maskValue = rightMask.sample(linearSampler, uv).r;
        }
    } else { // RIGHT
        if (rightMask.get_width() > 0) {
            maskValue = rightMask.sample(linearSampler, uv).r;
        }
    }
    
    if (maskValue > 250) {
        return (1.0, 1.0, 1.0, 0.5)
    }
    return in.color;
}
I need to correctly sample the masks for a given fragment.
The LayerRenderer.Layout is set to .layered. From Developer Documentation.
A layout that specifies each view’s content as a slice of a single texture.
Using the Metal debugger, I know that the final render target texture for each view / eye is 1888 x 1792 pixels, giving an aspect ratio of 59:56.
The initial CVPixelBuffer provided by the main left and right cameras is 1920x1080 (16:9).
The grayscale CVPixelBuffer returned by the VNPersonSegmentationRequest is 512x384 (4:3).
All of these aspect ratios are different.
My questions come down to: is there a known transform I can apply onto a given (pixel) position to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?
Within the tint's Vertex Shader, after applying the modelViewProjectionMatrix, I have tried every version I have been able to find that takes the pixel space position (= vertices[vertexID].position.xy) and the viewport size (1888x1792) to compute the correct clip space position (maybe = pixel space position.xy / (viewport size * 0.5)???) of the grayscale masks but nothing has worked. The "highlight" of the person segmentations is off: scaled a little too big, offset little to far up and off to the side.
