Extract images of face and eyes using ARKit via the ARFrame's CVPixelBuffer?

I am currently tracking faces using ARKit. ARKit knows the world coordinates of the face via the ARFaceAnchor and the user's eyes via leftEyeTransform and rightEyeTransform applied to the face.

I can get the live pixelbuffer via

    let frame = sceneView.session.currentFrame
    let ciimage:CIImage = CIImage(cvPixelBuffer: frame!.capturedImage)
    let context:CIContext = CIContext(options: nil)
    let cgImage:CGImage = context.createCGImage(ciimage, from: ciimage.extent)!
    let myImage:UIImage = UIImage(cgImage: cgImage)

How do I combine frame.capturedImage with ARKit's internal knowledge of where the user's eyes are, nose, etc. and create a cropped images of the user's eyes? I am trying to construct images in real time of the user's eyes.

Vision is likely better suited for this problem for a couple of reasons:

  1. Eyes are typically quite small in a frame, so you likely want to capture at the highest resolution possible. ARKit limits the resolution that you can capture at, whereas Vision requests can be applied to high resolution output from an AVCaptureSession.

  2. ARKit will give you the eye transforms, but Vision will give you points that make up a region outlining detected eyes. With that set of points, you can calculate a bounding box, and use the bounding box crop the image.

So, it is likely that you will want to use VNDetectFaceLandmarksRequest instead of ARKit for this problem.

If you have some other requirement which forces you to use ARKit, you can still submit the Vision request using the pixel buffer from ARKit, but again it will be potentially a lower resolution than you'd like.

Or (again if you need to use ARKit for some other reason), you could try to identify the vertices in the ARFaceGeometry that surround the eyes, and then project those positions to image space, and calculate a bounding box from those points (this approach is not really recommended, as it is possible that the ordering of these vertices could change, and thus this approach would be relying upon undocumented behavior, which is fragile).

Thanks!

I just have noticed that ARKit seems to constantly track the pupils with near 100% accuracy in real time. The Vision tracking seems to go astray occasionally. Thus whatever ARKit is doing under the hood seems more accurate. We are also already using ARKit, but sounds like Vision is the way to go.

Any idea how to create a bounding box to capture a picture of the user's head? Essentially I'd like to constantly be capturing images of the user's head and cropped pictures of the eyes in real time.

Any idea how to create a bounding box to capture a picture of the user's head? Essentially I'd like to constantly be capturing images of the user's head and cropped pictures of the eyes in real time.

The VNFaceObservation that you receive from the VNDetectFaceLandmarksRequest will also have a boundingBox property, which you may be able to use to derive a bounding box that bounds the head in the image.

You may find this sample code useful as a real time example: https://developer.apple.com/documentation/vision/tracking_the_user_s_face_in_real_time

Extract images of face and eyes using ARKit via the ARFrame's CVPixelBuffer?
 
 
Q