Create Anchor on Objects from 2D Data

We're developing a VisionOS application, where we would like to do product recognition (like food items).

We have enterprise entitlements and therefore also main camera access for VisionOS. We send this live camera frames to a trained CoreML model where we will receive 2D coordinates from the model detection prediction.

Now, we would like to create a 3D anchor on the detected items so it can be visible for user. The 3D anchor is going to be the class name of the detected item.

How do we transform this 2D coordinate from the model prediction to a 3D anchor?

use camera's intrinsics and extrinsics params, convert image coordinates to camera 3D coordinates:

float u = imagePoint.x;
    float v = imagePoint.y;

    simd_float3 cameraPoint;
    cameraPoint.x = (u - intrinsics.columns[2].x) * depth / intrinsics.columns[0].x; 
    cameraPoint.y = (v - intrinsics.columns[2].y) * depth / intrinsics.columns[1].y;
    cameraPoint.z = depth;

as this post say : the extrinsics do not define the transformation from the device anchor to the camera, but from the camera to the device anchor. (Actually extrinsics is a constant value!)

so then transform 3D camera point to point in device anchor coordinates:

simd_float4 cameraPoint4D = simd_make_float4(cameraPoint.x, cameraPoint.y, cameraPoint.z, 1.0);
    
    simd_float4x4 extrinsicsInverse = simd_inverse(extrinsics);
    simd_float4 world_point = simd_mul(extrinsicsInverse, cameraPoint4D);

then use device anchor to transform to world position:

simd_float4 world_point = simd_mul(weakSelf.deviceTransform, simd_make_float4(spacePoint.x, spacePoint.y, spacePoint.z, 1.0));

seem everything correct, but don't work, i don't know why, please help

Hi @christiandevin

It sounds like you want to convert a 2D point, on an image (of the left camera frame), to its corresponding location in 3D space.

Before we discuss this, I want to bring your attention to a bug: The camera intrinsic matrix is row major instead of column major. I suspect this bug is the cause of the unexpected behavior @tsia observed. To account for this, look for the principal point and focal length at different positions in the intrinsic matrix (see snippet).

Now let's turn to your goal. I'll refer to the 2D point on the image as the "observation point". Using the camera intrinsic and extrinsic data with queryDeviceAnchor, you can convert the observation point to a 3D point (in world space) that represents the observation's location relative to the left camera's projection plane. That's not the same as its position in 3D space. Imagine seeing the world through a piece of glass (which represents the projection plane), the former is a point on that glass and the latter is the actual point. To get the observation's position in 3D space you need a map of 2D points (on a projection plane) to depth. Depth data is not provided by CameraFrameProvider. I encourage you to file an enhancement request via feedback assistant with an explanation of your use case and how it can benefit from depth data.

In the meantime, consider one of the following alternatives to obtain the z axis:

  • Use SceneReconstructionProvider to create collision shapes for real world objects then raycast along the vector from the device to the observation point. This works best on nearby, stationary objects.
  • Use monocular depth. I've not tried this and it doesn't appear trivial to implement.

Sometimes code is easier to understand. Here's a snippet that covers the first observation returned from DetectBarcodesRequest with a plane. Note this positions an entity at the xy position of the observation relative to the left camera's projection plane then scales it to match the size of the barcode; it does not place the plane at the barcode's xyz position.

// in AppModel

guard let pixelBuffer = sample?.pixelBuffer else { return }

let image = CIImage(cvPixelBuffer: pixelBuffer)
let request = DetectBarcodesRequest()

// observations is a property on appModel
do {
    observations = try await request.perform(on: image, orientation: .downMirrored)
} catch {
    observations = []
}

Position a plane at the observation's xy coordinates relative to the projection plane for the left camera.

struct ImmersiveView: View {
    @Environment(AppModel.self) var appModel
    @Environment(\.physicalMetrics) var physicalMetrics
    @State var arkitSession = ARKitSession()
    @State var worldTrackingProvider = WorldTrackingProvider()
    @State var observationRoot = Entity()
    
    // Entity to represent the barcode
    @State var observationEntity = Entity()
    
    var body: some View {
        @Bindable var appModel = appModel
        
        RealityView { content in
            observationEntity.components.set(ModelComponent(
                mesh: .generateBox(width: 2, height: 2, depth: 0.001),
                materials: [SimpleMaterial(color: .green, isMetallic: false)]
            ))
            observationEntity.components.set(OpacityComponent(opacity: 0.5))
            observationEntity.isEnabled = false

            observationRoot.addChild(observationEntity)
            content.add(observationRoot)
        }
        update: { content in
            
            guard
                // rect is the first observation of a barcode.
                let rect = appModel.observations.first?.boundingBox.cgRect,
                // sample is the sample returned from CameraFrameProvider.
                let sample = appModel.sample,
                let deviceAnchor = worldTrackingProvider.queryDeviceAnchor(atTimestamp: CACurrentMediaTime()) else {
                
                observationEntity.isEnabled = false

                return
            }

            observationEntity.isEnabled = true
            
            let intrinsics = sample.parameters.intrinsics
            let focalLength = physicalMetrics.convert(intrinsics.columns.0.x, to: .meters)
            let focalLengthTransform = Transform(translation: [0, 0, focalLength]).matrix

            // Position an entity to represent the projection plane.
            observationRoot.transform.matrix = deviceAnchor.originFromAnchorTransform
            * sample.parameters.extrinsics.inverse
            * focalLengthTransform

            // Position the barcode relative to the projection plane.
            // Note: you have to account for the different coordinate systems (in this case top,left to Cartesian).
            let centerX = physicalMetrics.convert(intrinsics.columns.0.z, to: .meters)
            let centerY = physicalMetrics.convert(intrinsics.columns.1.z, to: .meters)
            
            observationEntity.position.x = remap(value: Float(rect.midX), fromRange: [0, 1], toRange: [-centerX, centerX])
            observationEntity.position.y = remap(value: Float(rect.midY), fromRange: [0, 1], toRange: [-centerY, centerY])
                        
            observationEntity.scale.x = Float(rect.width) * centerX
            observationEntity.scale.y = Float(rect.height) * centerY
            
        }
        .task {
            try? await arkitSession.run([worldTrackingProvider])
            await appModel.start()
        }
    }
    
    func remap(value: Float, fromRange: SIMD2<Float>, toRange: SIMD2<Float>) -> Float {
        toRange.x + (value - fromRange.x) * (toRange.y - toRange.x) / (fromRange.y - fromRange.x)
    }
}
Create Anchor on Objects from 2D Data
 
 
Q