How to find the camera transform (or view matrix) in the world coordinate from a camera frame

I'm trying to implement a prototype to render virtual objects in a mixed immersive space on the camer frames captured by CameraFrameProvider.

Here are what I have done:

  1. Get camera's instrinsics from frame.primarySample.parameters.intrinsics
  2. Get camera's extrinsics from frame.primarySample.parameters.extrinsics
  3. Get the device anchor by worldTrackingProvider.queryDeviceAnchor(atTimestamp: CACurrentMediaTime())
  4. Setup a RealityKit.RealityRenderer to render virtual objects on the captured camera frames
        let realityRenderer = try RealityKit.RealityRenderer()
        realityRenderer.cameraSettings.colorBackground = .outputTexture()
        let cameraEntity = PerspectiveCamera()
        // see https://developer.apple.com/forums/thread/770235 
        let cameraTransform = deviceAnchor.originFromAnchorTransform * extrinsics.inverse
        
        cameraEntity.setTransformMatrix(cameraTransform, relativeTo: nil)
        cameraEntity.camera.near = 0.01
        cameraEntity.camera.far = 100
        cameraEntity.camera.fieldOfViewOrientation = .horizontal
        // manually calculated based on camera intrinsics
        cameraEntity.camera.fieldOfViewInDegrees = 105 

        realityRenderer.entities.append(cameraEntity)
        realityRenderer.activeCamera = cameraEntity

Virtual objects, which should be seen in the camera frames, are clipped out by the camera transform.

If I use deviceAnchor.originFromAnchorTransform as the camera transform, virtual objects can be rendered on camera frames at wrong positions (I think it is because the camera extrinsics isn't used to adjust the camera to the correct position).

My question is how to use the camera extrinsic matrix for this purpose?

Does the camera extrinsics point to a similar orientation of the device anchor with some minor rotation and postion change? Here is an extrinsics from a camera frame. It seems that the direction of Y-axis and Z-axis are flipped by the extrinsics. So the camera is point to a wrong direction.

simd_float4x4([[0.9914258, 0.012555369, -0.13006608, 0.0], // X-axis
[-0.0009778949, -0.9946325, -0.10346654, 0.0], // Y-axis
[-0.13066702, 0.10270659, -0.98609203, 0.0],  // Z-axis
[0.024519, -0.019568002, -0.058280986, 1.0]]) // translation
Answered by Vision Pro Engineer in 822749022

Hi @hale_xie

I did some prototyping over the weekend and came up with something that's close, but not perfect. Specifically, there's increasing misalignment as the angle between an object and the camera increases. I'd appreciate it if you file a feedback request to request an abstraction to simplify offline rendering with passthrough. Be sure to detail your use case.

Now on to the solution which uses ProjectiveTransformCameraComponent instead of PerspectiveCamera.

Here's a class to render a scene with passthrough. Construct it with the root entity you want to render. When CameraFrameProvider delivers an update, call render to obtain a UIImage of the scene.

import SwiftUI
import RealityKit
import ARKit

@MainActor
final class EntityToImage {
    let renderer:RealityRenderer?
    let cameraEntity = Entity()

    init(root: Entity) {
        renderer = try? RealityRenderer()
        renderer?.entities.append(root)
        renderer?.entities.append(cameraEntity)
    }
    
    private func computeProjectionMatrix(
        intrinsics: simd_float3x3,
        extrinsics: simd_float4x4
    ) -> simd_float4x4 {
        
        let rotation = simd_float3x3(extrinsics.columns.0.xyz, extrinsics.columns.1.xyz, extrinsics.columns.2.xyz)
        let translation = extrinsics.columns.3.xyz
        let projectionMatrix3x4 = intrinsics * rotation
        let projectionTranslation = intrinsics * translation
        
        return simd_float4x4(
            simd_float4(projectionMatrix3x4.columns.0, projectionTranslation.x),
            simd_float4(projectionMatrix3x4.columns.1, projectionTranslation.y),
            simd_float4(projectionMatrix3x4.columns.2, projectionTranslation.z),
            simd_float4(0, 0, 0, 1)
        )
    }
    
    private func fixIntrinsics(_ intrinsics: simd_float3x3, physicalMetrics: PhysicalMetricsConverter) -> simd_float3x3 {
        let cx:Float = 0
        let cy:Float = 0
        let fx:Float = -physicalMetrics.convert(intrinsics.columns.0.x, to: .meters) * 2.0
        let fy:Float = -physicalMetrics.convert(intrinsics.columns.1.y, to: .meters) * 2.0
        
        
        return simd_float3x3([[fx, 0, cx],
                              [0, fy, cy],
                              [0, 0, 1]])
        
    }
    
    func render(sample: CameraFrame.Sample,
                       deviceAnchor: DeviceAnchor,
                       physicalMetrics: PhysicalMetricsConverter) async throws -> UIImage? {
        
        guard let renderer = renderer else { return nil }
        func textureImage(from texture: MTLTexture) -> UIImage? {
            let componentCount = 4
            let bitmapInfo = CGImageByteOrderInfo.order32Big.rawValue | CGImageAlphaInfo.premultipliedLast.rawValue
            let bitsPerComponent = 8
            let colorSpace = CGColorSpace(name: CGColorSpace.sRGB)!
            
            let bytesPerRow = texture.width * componentCount
            guard let pixelBuffer = malloc(texture.height * bytesPerRow) else {
                return nil
            }
            
            defer {
                free(pixelBuffer)
            }
            
            let region = MTLRegionMake2D(0, 0, texture.width, texture.height)
            texture.getBytes(pixelBuffer, bytesPerRow: bytesPerRow, from: region, mipmapLevel: 0)
            let ctx = CGContext(data: pixelBuffer,
                                width: texture.width,
                                height: texture.height,
                                bitsPerComponent: bitsPerComponent,
                                bytesPerRow: bytesPerRow,
                                space: colorSpace,
                                bitmapInfo: bitmapInfo)
            
            guard let cgImage = ctx?.makeImage() else {
                return nil
            }
            
            let ciImage = CIImage(cgImage: cgImage)
            let passThroughImage = CIImage(cvPixelBuffer: sample.pixelBuffer)
            let compositedCIImage = ciImage.composited(over: passThroughImage)
            let context = CIContext(options: nil)
            
            let composited = context.createCGImage(compositedCIImage, from: compositedCIImage.extent)
            return UIImage(cgImage: composited!)
        }
        
       
        let intrinsics = fixIntrinsics(sample.parameters.intrinsics, physicalMetrics: physicalMetrics)
        let extrinsics = sample.parameters.extrinsics
        let projectionMatrix = computeProjectionMatrix(intrinsics: intrinsics, extrinsics: extrinsics)
        let projectiveTransformCameraComponent = ProjectiveTransformCameraComponent(projectionMatrix: projectionMatrix)
        
        cameraEntity.components.set(projectiveTransformCameraComponent)
       
        cameraEntity.transform.matrix = deviceAnchor.originFromAnchorTransform
        renderer.activeCamera = cameraEntity
        renderer.cameraSettings.colorBackground = .color(.init(gray: 0.0, alpha: 0.0))
        renderer.cameraSettings.antialiasing = .none
        
        // TODO if you need an IBL enable it here
//        renderer.lighting.resource = try await EnvironmentResource(named: "ImageBasedLighting")
        
        let imageWidth:Double = Double(sample.parameters.intrinsics.columns.0.z) * 2.0
        let imageHeight:Double = Double(sample.parameters.intrinsics.columns.1.z) * 2.0
        
        let contentSize = CGSize(width: imageWidth, height: imageHeight)
        let descriptor = MTLTextureDescriptor()
        descriptor.width = Int(contentSize.width)
        descriptor.height = Int(contentSize.height)
        descriptor.pixelFormat = .rgba8Unorm_srgb
        descriptor.sampleCount = 1
        descriptor.usage = [.renderTarget, .shaderRead, .shaderWrite]
        
        guard let texture = MTLCreateSystemDefaultDevice()?.makeTexture(descriptor: descriptor) else {
            return nil
        }
        
        let image: UIImage? = await withCheckedContinuation { (continuation: CheckedContinuation<UIImage?, Never>) in
            do {
                let output = try RealityRenderer.CameraOutput(RealityRenderer.CameraOutput.Descriptor.singleProjection(colorTexture: texture))
                try renderer.updateAndRender(deltaTime: 0.1, cameraOutput: output, onComplete: { _ in
                    let uiImage = textureImage(from: texture)
                    continuation.resume(returning: uiImage)
                })
            } catch {
                continuation.resume(returning: nil)
            }
        }
        return image
    }
}

extension simd_float4 {
    var xyz: simd_float3 {
        get {
            return [self.x, self.y, self.z]
        }
    }
}

Hi @hale_xie

I'm investigating this. In the meantime, the Passthrough in screen capture enterprise API provides access to a composite feed of what a Vision Pro wearer sees—both the physical world and overlaid digital content. Could this API help you achieve your goal?

Thank you for replying. We have tried camera passthrough in screen capture. It works. However, we are doing further researching:

  1. If we can capture only the required objects and exclude some unwanted ones.

  2. If we can unproject from a point on a captured frame.

Accepted Answer

Hi @hale_xie

I did some prototyping over the weekend and came up with something that's close, but not perfect. Specifically, there's increasing misalignment as the angle between an object and the camera increases. I'd appreciate it if you file a feedback request to request an abstraction to simplify offline rendering with passthrough. Be sure to detail your use case.

Now on to the solution which uses ProjectiveTransformCameraComponent instead of PerspectiveCamera.

Here's a class to render a scene with passthrough. Construct it with the root entity you want to render. When CameraFrameProvider delivers an update, call render to obtain a UIImage of the scene.

import SwiftUI
import RealityKit
import ARKit

@MainActor
final class EntityToImage {
    let renderer:RealityRenderer?
    let cameraEntity = Entity()

    init(root: Entity) {
        renderer = try? RealityRenderer()
        renderer?.entities.append(root)
        renderer?.entities.append(cameraEntity)
    }
    
    private func computeProjectionMatrix(
        intrinsics: simd_float3x3,
        extrinsics: simd_float4x4
    ) -> simd_float4x4 {
        
        let rotation = simd_float3x3(extrinsics.columns.0.xyz, extrinsics.columns.1.xyz, extrinsics.columns.2.xyz)
        let translation = extrinsics.columns.3.xyz
        let projectionMatrix3x4 = intrinsics * rotation
        let projectionTranslation = intrinsics * translation
        
        return simd_float4x4(
            simd_float4(projectionMatrix3x4.columns.0, projectionTranslation.x),
            simd_float4(projectionMatrix3x4.columns.1, projectionTranslation.y),
            simd_float4(projectionMatrix3x4.columns.2, projectionTranslation.z),
            simd_float4(0, 0, 0, 1)
        )
    }
    
    private func fixIntrinsics(_ intrinsics: simd_float3x3, physicalMetrics: PhysicalMetricsConverter) -> simd_float3x3 {
        let cx:Float = 0
        let cy:Float = 0
        let fx:Float = -physicalMetrics.convert(intrinsics.columns.0.x, to: .meters) * 2.0
        let fy:Float = -physicalMetrics.convert(intrinsics.columns.1.y, to: .meters) * 2.0
        
        
        return simd_float3x3([[fx, 0, cx],
                              [0, fy, cy],
                              [0, 0, 1]])
        
    }
    
    func render(sample: CameraFrame.Sample,
                       deviceAnchor: DeviceAnchor,
                       physicalMetrics: PhysicalMetricsConverter) async throws -> UIImage? {
        
        guard let renderer = renderer else { return nil }
        func textureImage(from texture: MTLTexture) -> UIImage? {
            let componentCount = 4
            let bitmapInfo = CGImageByteOrderInfo.order32Big.rawValue | CGImageAlphaInfo.premultipliedLast.rawValue
            let bitsPerComponent = 8
            let colorSpace = CGColorSpace(name: CGColorSpace.sRGB)!
            
            let bytesPerRow = texture.width * componentCount
            guard let pixelBuffer = malloc(texture.height * bytesPerRow) else {
                return nil
            }
            
            defer {
                free(pixelBuffer)
            }
            
            let region = MTLRegionMake2D(0, 0, texture.width, texture.height)
            texture.getBytes(pixelBuffer, bytesPerRow: bytesPerRow, from: region, mipmapLevel: 0)
            let ctx = CGContext(data: pixelBuffer,
                                width: texture.width,
                                height: texture.height,
                                bitsPerComponent: bitsPerComponent,
                                bytesPerRow: bytesPerRow,
                                space: colorSpace,
                                bitmapInfo: bitmapInfo)
            
            guard let cgImage = ctx?.makeImage() else {
                return nil
            }
            
            let ciImage = CIImage(cgImage: cgImage)
            let passThroughImage = CIImage(cvPixelBuffer: sample.pixelBuffer)
            let compositedCIImage = ciImage.composited(over: passThroughImage)
            let context = CIContext(options: nil)
            
            let composited = context.createCGImage(compositedCIImage, from: compositedCIImage.extent)
            return UIImage(cgImage: composited!)
        }
        
       
        let intrinsics = fixIntrinsics(sample.parameters.intrinsics, physicalMetrics: physicalMetrics)
        let extrinsics = sample.parameters.extrinsics
        let projectionMatrix = computeProjectionMatrix(intrinsics: intrinsics, extrinsics: extrinsics)
        let projectiveTransformCameraComponent = ProjectiveTransformCameraComponent(projectionMatrix: projectionMatrix)
        
        cameraEntity.components.set(projectiveTransformCameraComponent)
       
        cameraEntity.transform.matrix = deviceAnchor.originFromAnchorTransform
        renderer.activeCamera = cameraEntity
        renderer.cameraSettings.colorBackground = .color(.init(gray: 0.0, alpha: 0.0))
        renderer.cameraSettings.antialiasing = .none
        
        // TODO if you need an IBL enable it here
//        renderer.lighting.resource = try await EnvironmentResource(named: "ImageBasedLighting")
        
        let imageWidth:Double = Double(sample.parameters.intrinsics.columns.0.z) * 2.0
        let imageHeight:Double = Double(sample.parameters.intrinsics.columns.1.z) * 2.0
        
        let contentSize = CGSize(width: imageWidth, height: imageHeight)
        let descriptor = MTLTextureDescriptor()
        descriptor.width = Int(contentSize.width)
        descriptor.height = Int(contentSize.height)
        descriptor.pixelFormat = .rgba8Unorm_srgb
        descriptor.sampleCount = 1
        descriptor.usage = [.renderTarget, .shaderRead, .shaderWrite]
        
        guard let texture = MTLCreateSystemDefaultDevice()?.makeTexture(descriptor: descriptor) else {
            return nil
        }
        
        let image: UIImage? = await withCheckedContinuation { (continuation: CheckedContinuation<UIImage?, Never>) in
            do {
                let output = try RealityRenderer.CameraOutput(RealityRenderer.CameraOutput.Descriptor.singleProjection(colorTexture: texture))
                try renderer.updateAndRender(deltaTime: 0.1, cameraOutput: output, onComplete: { _ in
                    let uiImage = textureImage(from: texture)
                    continuation.resume(returning: uiImage)
                })
            } catch {
                continuation.resume(returning: nil)
            }
        }
        return image
    }
}

extension simd_float4 {
    var xyz: simd_float3 {
        get {
            return [self.x, self.y, self.z]
        }
    }
}
How to find the camera transform (or view matrix) in the world coordinate from a camera frame
 
 
Q