Create Anchor on Objects from 2D Data

Question

christiandevin OP

Created Dec ’24

Replies 4

Boosts 0

Participants 4

We're developing a VisionOS application, where we would like to do product recognition (like food items).

We have enterprise entitlements and therefore also main camera access for VisionOS. We send this live camera frames to a trained CoreML model where we will receive 2D coordinates from the model detection prediction.

Now, we would like to create a 3D anchor on the detected items so it can be visible for user. The 3D anchor is going to be the class name of the detected item.

How do we transform this 2D coordinate from the model prediction to a 3D anchor?

Boost

Answer 1

tsia OP

Dec ’24

use camera's intrinsics and extrinsics params, convert image coordinates to camera 3D coordinates:

 float u = imagePoint.x;
    float v = imagePoint.y;
 
    simd_float3 cameraPoint;
    cameraPoint.x = (u - intrinsics.columns[2].x) * depth / intrinsics.columns[0].x; 
    cameraPoint.y = (v - intrinsics.columns[2].y) * depth / intrinsics.columns[1].y;
    cameraPoint.z = depth;

as this post say : the extrinsics do not define the transformation from the device anchor to the camera, but from the camera to the device anchor. (Actually extrinsics is a constant value!)

so then transform 3D camera point to point in device anchor coordinates:

 simd_float4 cameraPoint4D = simd_make_float4(cameraPoint.x, cameraPoint.y, cameraPoint.z, 1.0);
    
    simd_float4x4 extrinsicsInverse = simd_inverse(extrinsics);
    simd_float4 world_point = simd_mul(extrinsicsInverse, cameraPoint4D);

then use device anchor to transform to world position:

simd_float4 world_point = simd_mul(weakSelf.deviceTransform, simd_make_float4(spacePoint.x, spacePoint.y, spacePoint.z, 1.0));

seem everything correct, but don't work, i don't know why, please help

1

Answer 2

Vision Pro Engineer OP

Apple

Dec ’24

Hi @christiandevin

It sounds like you want to convert a 2D point, on an image (of the left camera frame), to its corresponding location in 3D space.

Before we discuss this, I want to bring your attention to a bug: The camera intrinsic matrix is row major instead of column major. I suspect this bug is the cause of the unexpected behavior @tsia observed. To account for this, look for the principal point and focal length at different positions in the intrinsic matrix (see snippet).

Now let's turn to your goal. I'll refer to the 2D point on the image as the "observation point". Using the camera intrinsic and extrinsic data with queryDeviceAnchor, you can convert the observation point to a 3D point (in world space) that represents the observation's location relative to the left camera's projection plane. That's not the same as its position in 3D space. Imagine seeing the world through a piece of glass (which represents the projection plane), the former is a point on that glass and the latter is the actual point. To get the observation's position in 3D space you need a map of 2D points (on a projection plane) to depth. Depth data is not provided by CameraFrameProvider. I encourage you to file an enhancement request via feedback assistant with an explanation of your use case and how it can benefit from depth data.

In the meantime, consider one of the following alternatives to obtain the z axis:

Use SceneReconstructionProvider to create collision shapes for real world objects then raycast along the vector from the device to the observation point. This works best on nearby, stationary objects.
Use monocular depth. I've not tried this and it doesn't appear trivial to implement.

Sometimes code is easier to understand. Here's a snippet that covers the first observation returned from DetectBarcodesRequest with a plane. Note this positions an entity at the xy position of the observation relative to the left camera's projection plane then scales it to match the size of the barcode; it does not place the plane at the barcode's xyz position.

 // in AppModel
 
guard let pixelBuffer = sample?.pixelBuffer else { return }
 
let image = CIImage(cvPixelBuffer: pixelBuffer)
let request = DetectBarcodesRequest()
 
// observations is a property on appModel
do {
    observations = try await request.perform(on: image, orientation: .downMirrored)
} catch {
    observations = []
}

Position a plane at the observation's xy coordinates relative to the projection plane for the left camera.

 struct ImmersiveView: View {
    @Environment(AppModel.self) var appModel
    @Environment(\.physicalMetrics) var physicalMetrics
    @State var arkitSession = ARKitSession()
    @State var worldTrackingProvider = WorldTrackingProvider()
    @State var observationRoot = Entity()
    
    // Entity to represent the barcode
    @State var observationEntity = Entity()
    
    var body: some View {
        @Bindable var appModel = appModel
        
        RealityView { content in
            observationEntity.components.set(ModelComponent(
                mesh: .generateBox(width: 2, height: 2, depth: 0.001),
                materials: [SimpleMaterial(color: .green, isMetallic: false)]
            ))
            observationEntity.components.set(OpacityComponent(opacity: 0.5))
            observationEntity.isEnabled = false
 
            observationRoot.addChild(observationEntity)
            content.add(observationRoot)
        }
        update: { content in
            
            guard
                // rect is the first observation of a barcode.
                let rect = appModel.observations.first?.boundingBox.cgRect,
                // sample is the sample returned from CameraFrameProvider.
                let sample = appModel.sample,
                let deviceAnchor = worldTrackingProvider.queryDeviceAnchor(atTimestamp: CACurrentMediaTime()) else {
                
                observationEntity.isEnabled = false
 
                return
            }
 
            observationEntity.isEnabled = true
            
            let intrinsics = sample.parameters.intrinsics
            let focalLength = physicalMetrics.convert(intrinsics.columns.0.x, to: .meters)
            let focalLengthTransform = Transform(translation: [0, 0, focalLength]).matrix
 
            // Position an entity to represent the projection plane.
            observationRoot.transform.matrix = deviceAnchor.originFromAnchorTransform
            * sample.parameters.extrinsics.inverse
            * focalLengthTransform
 
            // Position the barcode relative to the projection plane.
            // Note: you have to account for the different coordinate systems (in this case top,left to Cartesian).
            let centerX = physicalMetrics.convert(intrinsics.columns.0.z, to: .meters)
            let centerY = physicalMetrics.convert(intrinsics.columns.1.z, to: .meters)
            
            observationEntity.position.x = remap(value: Float(rect.midX), fromRange: [0, 1], toRange: [-centerX, centerX])
            observationEntity.position.y = remap(value: Float(rect.midY), fromRange: [0, 1], toRange: [-centerY, centerY])
                        
            observationEntity.scale.x = Float(rect.width) * centerX
            observationEntity.scale.y = Float(rect.height) * centerY
            
        }
        .task {
            try? await arkitSession.run([worldTrackingProvider])
            await appModel.start()
        }
    }
    
    func remap(value: Float, fromRange: SIMD2<Float>, toRange: SIMD2<Float>) -> Float {
        toRange.x + (value - fromRange.x) * (toRange.y - toRange.x) / (fromRange.y - fromRange.x)
    }
}

1

Answer 3

raymondkings OP

Jan ’25

Hi, i am @christiandevin 's teammate. Firstly, we'd like to extend our gratitude for the insightful responses we've received so far—they have been immensely helpful! We also submitted an enhancement request to obtain depth data from cameraFrameProvider, but with the deadline of our project is coming quite soon, we went for the raycasting approach.

Current Setup:

In our RealityView, we have two root entities:

observationRoot: Manages RealityKit's sphere and label entities (of the detected objects) for raycasting.

sceneReconstructionRoot: Holds ModelEntity instances obtained from the SceneReconstructionProvider.

             RealityView { content in
                
                content.add(anchorViewModel.observationRoot)
content.add(anchorViewModel.sceneReconstrutionRoot)
            }

Transform Configuration:

We apply transformations only to the observationRoot as follows:

             let focalLengthMeters: Float = 0.6
            let focalLengthTransform = Transform(
                translation: SIMD3<Float>(0, 0, focalLengthMeters)
            ).matrix
            let finalMatrix = anchorTransform * extrinsics.inverse * focalLengthTransform
 
            observationRoot.transform.matrix = finalMatrix

. We do not apply any transform to the sceneReconstructionRoot.

The Challenge:

For successful raycasting, we need to perform raycasts that consider entities from both observationRoot and sceneReconstructionRoot. Ideally, this would require having all relevant entities under a single parent entity. However, transferring entities from sceneReconstructionRoot to observationRoot causes the scene reconstruction entities to inherit and follow the transformations of observationRoot, which disrupts their original positioning and orientation.

Screenshot 2025-01-24 at 20.57.49.png

Question:

Is there a way to merge entities from both observationRoot and sceneReconstructionRoot into a single parent entity while maintaining each entity's original transform? We aim to combine these entities for comprehensive raycasting without altering the orientation and position of the scene reconstruction entities.

Thank you in Advance!

Best regards,

Raymond

0

Answer 4

Vision Pro Engineer OP

Apple

4w

Hi @raymondkings

Apologies for the delay and thanks for sharing your progress. It sounds like you are looking for a way to re-parent an entity while preserving its transform. I believe Entity.addChild(_:preservingWorldTransform:) is what you're looking for.

0

	float u = imagePoint.x;
	float v = imagePoint.y;

	simd_float3 cameraPoint;
	cameraPoint.x = (u - intrinsics.columns[2].x) * depth / intrinsics.columns[0].x;
	cameraPoint.y = (v - intrinsics.columns[2].y) * depth / intrinsics.columns[1].y;
	cameraPoint.z = depth;

	simd_float4 cameraPoint4D = simd_make_float4(cameraPoint.x, cameraPoint.y, cameraPoint.z, 1.0);

	simd_float4x4 extrinsicsInverse = simd_inverse(extrinsics);
	simd_float4 world_point = simd_mul(extrinsicsInverse, cameraPoint4D);

	// in AppModel

	guard let pixelBuffer = sample?.pixelBuffer else { return }

	let image = CIImage(cvPixelBuffer: pixelBuffer)
	let request = DetectBarcodesRequest()

	// observations is a property on appModel
	do {
	observations = try await request.perform(on: image, orientation: .downMirrored)
	} catch {
	observations = []
	}

	struct ImmersiveView: View {
	@Environment(AppModel.self) var appModel
	@Environment(\.physicalMetrics) var physicalMetrics
	@State var arkitSession = ARKitSession()
	@State var worldTrackingProvider = WorldTrackingProvider()
	@State var observationRoot = Entity()

	// Entity to represent the barcode
	@State var observationEntity = Entity()

	var body: some View {
	@Bindable var appModel = appModel

	RealityView { content in
	observationEntity.components.set(ModelComponent(
	mesh: .generateBox(width: 2, height: 2, depth: 0.001),
	materials: [SimpleMaterial(color: .green, isMetallic: false)]
	))
	observationEntity.components.set(OpacityComponent(opacity: 0.5))
	observationEntity.isEnabled = false

	observationRoot.addChild(observationEntity)
	content.add(observationRoot)
	}
	update: { content in

	guard
	// rect is the first observation of a barcode.
	let rect = appModel.observations.first?.boundingBox.cgRect,
	// sample is the sample returned from CameraFrameProvider.
	let sample = appModel.sample,
	let deviceAnchor = worldTrackingProvider.queryDeviceAnchor(atTimestamp: CACurrentMediaTime()) else {

	observationEntity.isEnabled = false

	return
	}

	observationEntity.isEnabled = true

	let intrinsics = sample.parameters.intrinsics
	let focalLength = physicalMetrics.convert(intrinsics.columns.0.x, to: .meters)
	let focalLengthTransform = Transform(translation: [0, 0, focalLength]).matrix

	// Position an entity to represent the projection plane.
	observationRoot.transform.matrix = deviceAnchor.originFromAnchorTransform
	* sample.parameters.extrinsics.inverse
	* focalLengthTransform

	// Position the barcode relative to the projection plane.
	// Note: you have to account for the different coordinate systems (in this case top,left to Cartesian).
	let centerX = physicalMetrics.convert(intrinsics.columns.0.z, to: .meters)
	let centerY = physicalMetrics.convert(intrinsics.columns.1.z, to: .meters)

	observationEntity.position.x = remap(value: Float(rect.midX), fromRange: [0, 1], toRange: [-centerX, centerX])
	observationEntity.position.y = remap(value: Float(rect.midY), fromRange: [0, 1], toRange: [-centerY, centerY])

	observationEntity.scale.x = Float(rect.width) * centerX
	observationEntity.scale.y = Float(rect.height) * centerY

	}
	.task {
	try? await arkitSession.run([worldTrackingProvider])
	await appModel.start()
	}
	}

	func remap(value: Float, fromRange: SIMD2<Float>, toRange: SIMD2<Float>) -> Float {
	toRange.x + (value - fromRange.x) * (toRange.y - toRange.x) / (fromRange.y - fromRange.x)
	}
	}

	RealityView { content in

	content.add(anchorViewModel.observationRoot)
	content.add(anchorViewModel.sceneReconstrutionRoot)
	}

	let focalLengthMeters: Float = 0.6
	let focalLengthTransform = Transform(
	translation: SIMD3<Float>(0, 0, focalLengthMeters)
	).matrix
	let finalMatrix = anchorTransform * extrinsics.inverse * focalLengthTransform

	observationRoot.transform.matrix = finalMatrix