Summary:
I am using the Vision framework, in conjunction with AVFoundation, to detect facial landmarks of each face in the camera feed (by way of the VNDetectFaceLandmarksRequest). From here, I am taking the found observations and unprojecting each point to a SceneKit View (SCNView), then using those points as the vertices to draw a custom geometry that is textured with a material over each found face.
Effectively, I am working to recreate how an ARFaceTrackingConfiguration functions. In general, this task is functioning as expected, but only when my device is using the front camera in landscape right orientation. When I rotate my device, or switch to the rear camera, the unprojected points do not properly align with the found face as they do in landscape right/front camera.
Problem:
When testing this code, the mesh appears properly (that is, appears affixed to a user's face), but again, only when using the front camera in landscape right. While the code runs as expected (that is, generating the face mesh for each found face) in all orientations, the mesh is wildly misaligned in all other cases.
My belief is this issue either stems from my converting the face's bounding box (using VNImageRectForNormalizedRect, which I am calculating using the width/height of my SCNView, not my pixel buffer, which is typically much larger), though all modifications I have tried result in the same issue.
Outside of that, I also believe this could be an issue with my SCNCamera, as I am a bit unsure how the transform/projection matrix works and whether that would be needed here.
Sample of Vision Request Setup:
// Setup Vision request options
var requestHandlerOptions: [VNImageOption: AnyObject] = [:]
// Setup Camera Intrinsics
let cameraIntrinsicData = CMGetAttachment(sampleBuffer, key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil)
if cameraIntrinsicData != nil {
requestHandlerOptions[VNImageOption.cameraIntrinsics] = cameraIntrinsicData
}
// Set EXIF orientation
let exifOrientation = self.exifOrientationForCurrentDeviceOrientation()
// Setup vision request handler
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
orientation: exifOrientation,
options: requestHandlerOptions)
// Setup the completion handler
let completion: VNRequestCompletionHandler = {request, error in
let observations = request.results as! [VNFaceObservation]
// Draw faces
DispatchQueue.main.async {
drawFaceGeometry(observations: observations)
}
}
// Setup the image request
let request = VNDetectFaceLandmarksRequest(completionHandler: completion)
// Handle the request
do {
try handler.perform([request])
} catch {
print(error)
}
Sample of SCNView Setup:
// Setup SCNView
let scnView = SCNView()
scnView.translatesAutoresizingMaskIntoConstraints = false
self.view.addSubview(scnView)
scnView.showsStatistics = true
NSLayoutConstraint.activate([
scnView.leadingAnchor.constraint(equalTo: self.view.leadingAnchor),
scnView.topAnchor.constraint(equalTo: self.view.topAnchor),
scnView.bottomAnchor.constraint(equalTo: self.view.bottomAnchor),
scnView.trailingAnchor.constraint(equalTo: self.view.trailingAnchor)
])
// Setup scene
let scene = SCNScene()
scnView.scene = scene
// Setup camera
let cameraNode = SCNNode()
let camera = SCNCamera()
cameraNode.camera = camera
scnView.scene?.rootNode.addChildNode(cameraNode)
cameraNode.position = SCNVector3(x: 0, y: 0, z: 16)
// Setup light
let ambientLightNode = SCNNode()
ambientLightNode.light = SCNLight()
ambientLightNode.light?.type = SCNLight.LightType.ambient
ambientLightNode.light?.color = UIColor.darkGray
scnView.scene?.rootNode.addChildNode(ambientLightNode)
Sample of "face processing"
func drawFaceGeometry(observations: [VNFaceObservation]) {
// An array of face nodes, one SCNNode for each detected face
var faceNode = [SCNNode]()
// The origin point
let projectedOrigin = sceneView.projectPoint(SCNVector3Zero)
// Iterate through each found face
for observation in observations {
// Setup a SCNNode for the face
let face = SCNNode()
// Setup the found bounds
let faceBounds = VNImageRectForNormalizedRect(observation.boundingBox, Int(self.scnView.bounds.width), Int(self.scnView.bounds.height))
// Verify we have landmarks
if let landmarks = observation.landmarks {
// Landmarks are relative to and normalized within face bounds
let affineTransform = CGAffineTransform(translationX: faceBounds.origin.x, y: faceBounds.origin.y)
.scaledBy(x: faceBounds.size.width, y: faceBounds.size.height)
// Add all points as vertices
var vertices = [SCNVector3]()
// Verify we have points
if let allPoints = landmarks.allPoints {
// Iterate through each point
for (index, point) in allPoints.normalizedPoints.enumerated() {
// Apply the transform to convert each point to the face's bounding box range
_ = index
let normalizedPoint = point.applying(affineTransform)
let projected = SCNVector3(normalizedPoint.x, normalizedPoint.y, CGFloat(projectedOrigin.z))
let unprojected = sceneView.unprojectPoint(projected)
vertices.append(unprojected)
}
}
// Setup Indices
var indices = [UInt16]()
// Add indices
// ... Removed for brevity ...
// Setup texture coordinates
var coordinates = [CGPoint]()
// Add texture coordinates
// ... Removed for brevity ...
// Setup texture image
let imageWidth = 2048.0
let normalizedCoordinates = coordinates.map { coord -> CGPoint in
let x = coord.x / CGFloat(imageWidth)
let y = coord.y / CGFloat(imageWidth)
let textureCoord = CGPoint(x: x, y: y)
return textureCoord
}
// Setup sources
let sources = SCNGeometrySource(vertices: vertices)
let textureCoordinates = SCNGeometrySource(textureCoordinates: normalizedCoordinates)
// Setup elements
let elements = SCNGeometryElement(indices: indices, primitiveType: .triangles)
// Setup Geometry
let geometry = SCNGeometry(sources: [sources, textureCoordinates], elements: [elements])
geometry.firstMaterial?.diffuse.contents = textureImage
// Setup node
let customFace = SCNNode(geometry: geometry)
sceneView.scene?.rootNode.addChildNode(customFace)
// Append the face to the face nodes array
faceNode.append(face)
}
// Iterate the face nodes and append to the scene
for node in faceNode {
sceneView.scene?.rootNode.addChildNode(node)
}
}
Vision
RSS for tagApply computer vision algorithms to perform a variety of tasks on input images and video using Vision.
Posts under Vision tag
90 Posts
Sort by:
Post
Replies
Boosts
Views
Activity
Hi,
I have a custom object detection CoreML model and I notice something strange when using the model with the Vision framework.
I have tried two different approaches as to how to process an image and do inference on the CoreML model.
The first one is using the CoreML "raw": initialising the model, getting the input image ready and using the model's .prediction() function to get the models output.
The second one is using Vision to wrap the CoreML model in a VNCoreMLModel, creating a VNCoreMLRequest and using the VNImageRequestHandler to actually perform the model inference. The result of the VNCoreMLRequest is of type VNRecognizedObjectObservation.
The issue I now face is in the difference in the output of both methods. The first method gives back the raw output of the CoreML model: confidence and coordinates. The confidence is an array with size equal to the number of classes in my model (3 in my case). The second method gives back the boundingBox, confidence and labels. However here the confidence is only the confidence for the most likely class (so size is equal to 1). But the confidence I get from the second approach is quite different from the confidence I get during the first approach.
I can use either one of the approaches in my application. However, I really want to find out what is going on and understand how this difference occurred.
Thanks!
Hi,
When using VNFeaturePrintObservation and then computing the distance using two images, the values that it returns varies heavily. When two identical images (same image file) is inputted into function (below) that I have used to compare the images, the distance does not return 0 while it is expected to, since they are identical images.
Also, what is the upper limit of computeDistance? I am trying to find the percentage similarity between the two images. (Of course, this cannot be done unless the issue above is resolved).
Code that I have used is below
func featureprintObservationForImage(image: UIImage) -> VNFeaturePrintObservation? {
let requestHandler = VNImageRequestHandler(cgImage: image.cgImage!, options: [:])
let request = VNGenerateImageFeaturePrintRequest()
request.usesCPUOnly = true // Simulator Testing
do {
try requestHandler.perform([request])
return request.results?.first as? VNFeaturePrintObservation
} catch {
print("Vision Error: \(error)")
return nil
}
}
func compare(origImg: UIImage, drawnImg: UIImage) -> Float? {
let oImgObservation = featureprintObservationForImage(image: origImg)
let dImgObservation = featureprintObservationForImage(image: drawnImg)
if let oImgObservation = oImgObservation {
if let dImgObservation = dImgObservation {
var distance: Float = -1
do {
try oImgObservation.computeDistance(&distance, to: dImgObservation)
} catch {
fatalError("Failed to Compute Distance")
}
if distance == -1 {
return nil
} else {
return distance
}
} else {
print("Drawn Image Observation found Nil")
}
} else {
print("Original Image Observation found Nil")
}
return nil
}
Thanks for all the help!
i saw there is a way to track hands with vision, but is there also a way to record that movement and export it to fbx? Oh and is there a way to set only one hand to be recorded or both at the same time? Implementation will be in SwiftUI
Hello,
I am reaching out for some assistance regarding integrating a CoreML action classifier into a SwiftUI app. Specifically, I am trying to implement this classifier to work with the live camera of the device. I have been doing some research, but unfortunately, I have not been able to find any relevant information on this topic.
I was wondering if you could provide me with any examples, resources, or information that could help me achieve this integration? Any guidance you can offer would be greatly appreciated.
Thank you in advance for your help and support.
I just grabbed the portal code made available for testing and ran into this error when trying to run in simulator Vision Pro
Thread 1: Fatal error: SwiftUI Scene ImmersiveSpace requires a UISceneSessionRole of "UISceneSessionRoleImmersiveSpaceApplication" for key UIApplicationPreferredDefaultSceneSessionRole in the Application Scene Manifest.
hi there,
i'm not sure if i'm missing something, but i've tried passing a variety of CGImages into SCSensitivityAnalyzer, incl ones which should be flagged as sensitive, and it always returns false. it doesn't throw an exception, and i have the Sensitive Content Warning enabled in settings (confirmed by checking the analysisPolicy at run time).
i've tried both the async and callback versions of analyzeImage.
this is with Xcode 15 beta 5.
i'm primarily testing on iOS/iPad simulators - is that a known issue?
cheers,
Mike
Can you share the source code for the demo of the Vision Face Detector with the metrics (roll, yaw and pitch) displayed? You provide some code online but not for this portion of the presentation.
When I customize the gesture interaction, how do I set the key value? It depends on the accuracy of finger joint recognition and distance detection. What is the accuracy of finger joint detection? discrimination and distance detection
I'm trying to create a sky mask on pictures taken from my iPhone. I've seen in the documentation that CoreImage support semantic segmentation for Sky among other type for person (skin, hair etc...)
For now, I didn't found the proper workflow to use it.
First, I watched https://developer.apple.com/videos/play/wwdc2019/225/
I understood that images must be captured with the segmentation with this kind of code:
photoSettings.enabledSemanticSegmentationMatteTypes = self.photoOutput.availableSemanticSegmentationMatteTypes
photoSettings.embedsSemanticSegmentationMattesInPhoto = true
I capture the image on my iPhone, save it as HEIC format then later, I try to load the matte like that :
let skyMatte = CIImage(contentsOf: imageURL, options: [.auxiliarySemanticSegmentationSkyMatte: true])
Unfortunately, self.photoOutput.availableSemanticSegmentationMatteTypes always give me a list of types for person only and never a types Sky.
Anyway, the AVSemanticSegmentationMatte.MatteType is just [Hair, Skin, Teeth, Glasses] ... No Sky !!!
So, How am I supposed to use semanticSegmentationSkyMatteImage ?!? Is there any simple workaround ?
Hi,
I want to control a hand model via hand motion capture.
I know there is a sample project and some articles about Rigging a Model for Motion Capture in ARKit document. BUT The solution is quite encapsulated in BodyTrackedEntity. I can't find appropriate Entity for controlling just a hand model.
By using VNDetectHumanHandPoseRequest provided by Vision framework, I can get hand joint info, but I don't know how to use that info in RealityKit to control a 3d hand model.
Do you know how to do that or do you have any idea on how should it be implemented?
Thanks
I am trying to use VNDetectFaceRectanglesRequest to detect face bounding boxes on frames obtained by ARKit callbacks.
I have my app in Portrait Device Orientation and I am passing the .right orientation to perform method on VNSequenceRequestHandler
something like:
private let requestHandler = VNSequenceRequestHandler()
private var facePoseRequest: VNDetectFaceRectanglesRequest!
// ...
try? self.requestHandler.perform([self.facePoseRequest], on: currentBuffer, orientation: orientation)
Im setting .right for orientation above, in the hopes that Vision-Framework will re-orient before running inference.
Im trying to draw the returned BB on top of the Image. Here's my results processing code:
guard let faceRes = self.facePoseRequest.results?.first as? VNFaceObservation else {
return
}
//Option1: Assuming reported BB is in coordinate space of orientation-adjusted pixel buffer
// Problems/Observations:
// BoundingBox turns into a square with equal width and height
// Also BB does not cover entire face, but only from chin to eyes
//Notice Height & Width are flipped below
let flippedBB = VNImageRectForNormalizedRect(faceRes.boundingBox, currBufHeight, currBufWidth)
//vs
//Option2: Assuming, reported BB is in coordinate-system of original un-oriented pixel-buffer
// Problem/Observations:
// while the drawn BB does appear like a rectangle and covering most of the face, it is not always centered on the face.
// It moves around the screen when I tilt the device or my face.
let currBufWidth = CVPixelBufferGetWidth(currentBuffer)
let currBufHeight = CVPixelBufferGetHeight(currentBuffer)
let reportedBB = VNImageRectForNormalizedRect(faceRes.boundingBox, currBufWidth, currBufHeight)
In Option1 above:
BoundingBox becomes a square shape with Width and Height becoming equal. I noticed that the reported normalized BB has the same aspect ration as the Input Pixel Buffer, which is 1.33 . This is the reason that when I flip Width and Height params in VNImageRectForNormalizedRect, width and height become equal.
In Option2 above:
BB seems to be somewhat right height, it jumps around when I tilt the device or my head.
What coordinate system are the reported bounding boxes in?
Do I need to adjust for y-flippedness of Vision framework before I perform above operations?
What's the best way to draw these BB on the captured-frame and or ARview?
Thank you
Hello, I have created a view with a 360 image full view, and I need to perform a task when the user clicks anywhere on the screen (leave the dome), but no matter what I try, it just does not work, it doesn't print anything at all.
import SwiftUI
import RealityKit
import RealityKitContent
struct StreetWalk: View {
@Binding var threeSixtyImage: String
@Binding var isExitFaded: Bool
var body: some View {
RealityView { content in
// Create a material with a 360 image
guard let url = Bundle.main.url(forResource: threeSixtyImage, withExtension: "jpeg"),
let resource = try? await TextureResource(contentsOf: url) else {
// If the asset isn't available, something is wrong with the app.
fatalError("Unable to load starfield texture.")
}
var material = UnlitMaterial()
material.color = .init(texture: .init(resource))
// Attach the material to a large sphere.
let streeDome = Entity()
streeDome.name = "streetDome"
streeDome.components.set(ModelComponent(
mesh: .generatePlane(width: 1000, depth: 1000),
materials: [material]
))
// Ensure the texture image points inward at the viewer.
streeDome.scale *= .init(x: -1, y: 1, z: 1)
content.add(streeDome)
}
update: { updatedContent in
// Create a material with a 360 image
guard let url = Bundle.main.url(forResource: threeSixtyImage,
withExtension: "jpeg"),
let resource = try? TextureResource.load(contentsOf: url) else {
// If the asset isn't available, something is wrong with the app.
fatalError("Unable to load starfield texture.")
}
var material = UnlitMaterial()
material.color = .init(texture: .init(resource))
updatedContent.entities.first?.components.set(ModelComponent(
mesh: .generateSphere(radius: 1000),
materials: [material]
))
}
.gesture(tap)
}
var tap: some Gesture {
SpatialTapGesture().targetedToAnyEntity().onChanged{ value in
// Access the tapped entity here.
print(value.entity)
print("maybe you can tap the dome")
// isExitFaded.toggle()
}
}
Hello,
I am looking for something to allow me to Anchor a webview component to the user, as in it follows their line of vision as they move.
I tried using RealityView with an Anchor Entity, but it raises an error of "Presentations are not permitted within volumetric window scene". Can I anchor the Window instead?
Hello, I am Pieter Bikkel. I study Software Engineering at the HAN, University of Applied Sciences, and I am working on an app that can recognize volleyball actions using Machine Learning. A volleyball coach can put an iPhone on a tripod and analyze a volleyball match. For example, where the ball always lands in the field, how hard the ball is served. I was inspired by this session and wondered if I could interview one of the experts in this field. This would allow me to develop my App even better. I hope you can help me with this.
I just downloaded the latest Xcode beta, Version 15.0 (15A240d) and ran into some issues:
On start up, I was not given an option to download the Vision simulator.
I cannot create a project targeted at visionOS
I cannot build/run a hello world app for Vision.
In my previous Xcode-beta (Version 15.0 beta 8 (15A5229m)), there was an option to download the vision simulator, and I can create projects for the visionOS and run the code in the vision simulator.
The Xcode file downloaded was named "Xcode" instead of "Xcode-beta". I didn't want to get rid of the exiting Xcode, so I selected Keep Both. Now I have 3 Xcodes in the Applications folder
Xcode
Xcode copy
Xcode-beta
That is the only thing I see that might have been different about my install.
Hardware: Mac Studio 2022 with M1 Max
macOS Ventura 13.5.2
Any idea what I did wrong?
var accessibilityComponent = AccessibilityComponent()
accessibilityComponent.isAccessibilityElement = true
accessibilityComponent.traits = [.button, .playsSound]
accessibilityComponent.label = "Cloud"
accessibilityComponent.value = "Grumpy"
cloud.components[AccessibilityComponent.self] = accessibilityComponent
// ...
var isHappy: Bool {
didSet {
cloudEntities[id].accessibilityValue = isHappy ? "Happy" : "Grumpy"
}
}
Hi,
I am developing a fitness app that detects technique mistakes during workout. Can we use 3D data from VNDetectHumanBodyPose3DRequest with ML model?
I am trying to use Vision framework in iOS but getting below error in logs.
Not able to find any resources in Developer Forums.
Any help would be appreciated!
ABPKPersonIDTracker not supported on this device
Failed to initialize ABPK Person ID Tracker
public func runHumanBodyPose3DRequest() {
let request = VNDetectHumanBodyPose3DRequest()
let requestHandler = VNImageRequestHandler(url: filePath!)
do {
try requestHandler.perform([request])
if let returnedObservation = request.results?.first {
self.humanObservation = returnedObservation
print(humanObservation)
}
} catch let error{
print(error.localizedDescription)
}
}
Hello!
I would like to develop a visionOS application that tracks a single object in a user's environment. Skimming through the documentation I found out that this feature is currently unsupported in ARKit (we can only recognize images). But it seems it should be doable by combining CoreML and Vision frameworks. So I have a few questions:
Is it the best approach or is there a simpler solution?
What is the best way to train a CoreML model without access to the device? Will videos recorded by iPhone 15 be enough?
Thank you in advance for all the answers.