How-to highlight people in a Vision Pro app using Compositor Services

Fundamentally, my questions are: is there a known transform I can apply onto a given (pixel) position (passed into a Metal Fragment Function) to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?

My goal is to highlight people in a Vision Pro app using Compositor Services.

To start, I asynchronously receive camera frames for the main left and right cameras. This is the breakdown of the specific CameraVideoFormat I pass along to the CameraFrameProvider:

  • minFrameDuration: 0.03
  • maxFrameDuration: 0.033333335
  • frameSize: (1920.0, 1080.0)
  • pixelFormat: 875704422
  • cameraType: main
  • cameraPositions: [left, right]
  • cameraRectification: mono

From each camera frame sample, I extract the left and right buffers (CVReadOnlyPixelBuffer.withUnsafebuffer ==> CVPixelBuffer).

I asynchronously process the extracted buffers by performing a VNGeneratePersonSegmentationRequest on both of them:

// NOTE: This block of code and all following code blocks contain simplified representations of my code for clarity's sake.

var request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .balanced
request.outputPixelFormat = kCVPixelFormatType_OneComponent8
...
let lHandler = VNSequenceRequestHandler()
let rHandler = VNSequenceRequestHandler()
...

func processBuffers() async {
    try lHandler.perform([request], on: lBuffer)
    guard let lMask = request.results?.first?.pixelBuffer else {...}

    try rHandler.perform([request], on: rBuffer)
    guard let rMask = request.results?.first?.pixelBuffer else {...}

    appModel.latestPersonMasks = (lMask, rMask)
}

I store the two resulting CVPixelBuffers in my appModel. For both of these buffers aka grayscale masks:

  • width (in pixels) = 512
  • height (in pixels) = 384
  • byters per row = 512
  • plane count = 0
  • pixel format type = 1278226488

I am using Compositor Services to render my content in Immersive Space. My implementation of Compositor Services is based off of the same code from Interacting with virtual content blended with passthrough.

Within the Shaders.metal, the tint's Fragment Shader is now passed the grayscale masks (converted from CVPixelBuffer to MTLTexture via CVMetalTextureCacheCreateTextureFromImage() at the beginning of the main render pipeline).

fragment float4 tintFragmentShader(
                                   TintInOut in [[stage_in]],
                                   ushort amp_id [[amplification_id]],
                                   texture2d<uint> leftMask [[texture(0)]],
                                   texture2d<uint> rightMask [[texture(1)]]
                                   )
{
    if (in.color.a <= 0.0) {
        discard_fragment();
    }

    float2 uv;
    
    if (amp_id == 0) { // LEFT
        uv = ??????????????????????;
    } else { // RIGHT
        uv = ??????????????????????;
    }
    
    constexpr sampler linearSampler (mip_filter::linear, mag_filter::linear, min_filter::linear);
    
    // Sample the PersonSegmentation grayscale mask
    float maskValue = 0.0;
    
    if (amp_id == 0) { // LEFT
        if (leftMask.get_width() > 0) {
             maskValue = rightMask.sample(linearSampler, uv).r;
        }
    } else { // RIGHT
        if (rightMask.get_width() > 0) {
            maskValue = rightMask.sample(linearSampler, uv).r;
        }
    }
    
    if (maskValue > 250) {
        return (1.0, 1.0, 1.0, 0.5)
    }

    return in.color;
}

I need to correctly sample the masks for a given fragment.

The LayerRenderer.Layout is set to .layered. From Developer Documentation.

A layout that specifies each view’s content as a slice of a single texture.

Using the Metal debugger, I know that the final render target texture for each view / eye is 1888 x 1792 pixels, giving an aspect ratio of 59:56.

The initial CVPixelBuffer provided by the main left and right cameras is 1920x1080 (16:9).

The grayscale CVPixelBuffer returned by the VNPersonSegmentationRequest is 512x384 (4:3).

All of these aspect ratios are different.

My questions come down to: is there a known transform I can apply onto a given (pixel) position to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?

Within the tint's Vertex Shader, after applying the modelViewProjectionMatrix, I have tried every version I have been able to find that takes the pixel space position (= vertices[vertexID].position.xy) and the viewport size (1888x1792) to compute the correct clip space position (maybe = pixel space position.xy / (viewport size * 0.5)???) of the grayscale masks but nothing has worked. The "highlight" of the person segmentations is off: scaled a little too big, offset little to far up and off to the side.

You seem to be facing two separate problems, and I'll try to provide guidance to address both.

  1. You need to map the frames returned by CameraFrameProvider to what needs to be rendered so it matches the passthrough in Compositor Services screen space.

In order to solve this, see if you can use the following method to render the original frame as returned by CameraFrameProvider with a 0.5 alpha on top of the pass through, and make it match.

For each CameraFrameUpdates you get CameraFrame which has Parameters, and they contain:

/// The camera intrinsics.
public var intrinsics: simd_float3x3

/// The camera extrinsics.
public var extrinsics: simd_float4x4

With these, you can apply the following transformation sequence to each corner of the image:

CameraFrame Pixel -> (use intrinsics) -> 3D Point in Camera Space -> (use extrinsics inverse) -> 3D Point in World Space -> (use CS view/projection matrices) -> Compositor Services Screen Coordinate

To overlay the quad that contains the CameraFrame on top of your pass-through.

Please do note that this will result in a quad that doesn't cover the full passthrough. It'll be a slightly smaller crop of it. This is a currently known limitation of how CameraFrameProvider works for producing these images, and there's no easy way to fix that.

Also, if you need frames that have a different aspect ratio than 1920 x 1080 for this use case, to please submit a separate feedback request.

  1. The people segmentation request is altering the aspect ratio of the frames obtained from CameraFrameProvider.

The output mask's resolution is always the same regardless of input image, and it is determined by the underlying segmentation model, that's why the aspect ratio changes.

On a use case with no shaders, you could use vImageScale_Planar8 or some other means to resize the mask to the original aspect ratio, and then it'd would match your input.

In your use case, you can probably apply the same transformation I detailed on step 1 to match it to the passthrough image directly on the shader.

Do beware that if the original image is much larger than the mask, it can sometimes cause a pixelated effect around the subject edges in the resized mask. This can be avoided by using a quality level of .accurate, which produces a larger mask, but is also slower.

Please do let us know if you have followup questions.

How-to highlight people in a Vision Pro app using Compositor Services
 
 
Q