Apply computer vision algorithms to perform a variety of tasks on input images and video using Vision.

Vision Documentation

Posts under Vision tag

80 Posts
Sort by:
Post marked as solved
3 Replies
440 Views
I would like to extract depth data for a given point in ARSession.currentFrame.smoothedSceneDepth. Optimally this would end up looking something like: ARView.depth(at point: CGPoint) With the point being in UIKit coordinates just like the points passed to the raycasting methods. I ultimately to use this depth data to convert a 2D normalized landmark from a Vision image request into a 3D world space coordinate in the 3D scene - I only lack the accurate depth data for a given 2D point. What I have available is: The normalized landmark from the Vision request. Ability to convert this^ to AVFoundation coordinates. Ability to convert this^ to screen-space/display coordinates. When the depth data is provided correctly I can combine the 2D position in UIKit/screen-space coordinates with the depth (in meters) to produce an accurate 3D world position with the use of ARView.ray(through:) What I have not been able to figure out is how to get this depth value for this coordinate on screen. I can index the pixel buffer like this: extension CVPixelBuffer {     func value(for point: CGPoint) -> Float32 {         CVPixelBufferLockBaseAddress(self, .readOnly)         let width = CVPixelBufferGetWidth(self)         let height = CVPixelBufferGetHeight(self)          //Something potentially going wrong here.         let pixelX: Int = width * point.x         let pixelY: Int = height * point.y            let bytesPerRow = CVPixelBufferGetBytesPerRow(self)         let baseAddress = CVPixelBufferGetBaseAddress(self)! assert(kCVPixelFormatType_DepthFloat32 == CVPixelBufferGetPixelFormatType(depthDataMap))         let rowData = baseAddress + pixelY * bytesPerRow         let distanceAtXYPoint = rowData.assumingMemoryBound(to: Float32.self)[pixelX]                CVPixelBufferUnlockBaseAddress(self, .readOnly)         return distanceAtXYPoint     } } And then try to use this method like so:         guard let depthMap = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap else { return nil } //The depth at this coordinate, in meters. let depthValue = depthMap.value(for: myGivenPoint) The frame semantics [.smoothedSceneDepth, .sceneDepth] have been set properly on my ARConfiguration. The depth data is available. If I hard-code the width and height values like so:         let pixelX: Int = width / 2         let pixelY: Int = height / 2 I get the correct depth value for the center of the screen. I have only been testing in portrait mode. But I do not know how to index the depth data for any given point.
Posted
by CodeName.
Last updated
.
Post not yet marked as solved
1 Replies
228 Views
Hello everyone, I am working on a simple ML project. I trained a custom model on classifying the images of US dollar bill notes. Everything seems good to me and I don't know why the classification label isn't being updated with any value. Files: https://codeshare.io/OdXzMW
Posted Last updated
.
Post not yet marked as solved
0 Replies
165 Views
Hi there, I am trying to combine this code I have for stereo vision as well as the hand tracking code available (drawing when pinch) but I'm running into trouble with this. I believe it is to do with the fact that the hand tracking code sets up AV Session where as the stereo vision uses SceneKit. Could you please provide me with some feedback as how to start integrating these two very different sets of code? StereoVision Hand Tracking
Posted
by alpl.
Last updated
.
Post not yet marked as solved
1 Replies
178 Views
I am writing an iOS app using swift and need to analyse many photos in parallel. In order to do that, I plan to use a single CIDetector instance (as recommended by Apple). Is it safe to use the same instance of CIDetector in different threads? Tried to find a documentation about it but had no luck.
Posted
by n01z.
Last updated
.
Post not yet marked as solved
0 Replies
142 Views
I am trying to get the feature points' coordinates in the image from VNFeaturePrintObservation but I can't find the way to get them. Is there any way to do that? The only thing I see that I can get is the metrics, the number of elements and the data in a form that I don't understand.
Posted
by alexnikol.
Last updated
.
Post marked as solved
1 Replies
260 Views
In my app, I am performing a VNDetectFaceLandmarksRequest with a VNSequenceRequestHandler. The video that serves as my input is from my iPhones selfie-camera. The request returns the VNFaceLandmarkRegion2D from where I get all the landmarks as an array of CGPoints via VNFaceLandmarkRegion2D.normalizedPoints I want to compare all the CGPoint-arrays over time, but I am not sure if a point at a certain index is always representing the same landmark. Can I assume that a specific landmark, e.g. the left-most landmark of the right eye, always has the same index in the CGPoint-array?
Posted Last updated
.
Post not yet marked as solved
1 Replies
371 Views
In Vision's hand detection, we can work with one hand's landmarks to classify the pose. Is it possible to detect both hands' landmarks at the same time? So that we can detect a two-hand pose?
Posted Last updated
.
Post not yet marked as solved
0 Replies
357 Views
Hello developers, I am recently struggling with providing the right data for a deep learning model i want to integrate into my swift app. (I am fairly new to swift and ios dev, please bear with me.) The App is supposed to run on an iPad Pro with Lidar sensor, no other devices. I'm working with Dense Fusion 6DoF pose estimation , which requires an RGB-D Image as input. (it's trained on the ycb-video dataset) I already looked up different examples on how i can stream depth data from the camera (fog example) and also how to capture an Image with depth information. So I started a session that gets video and depth data from the CVPixelBuffer as input. However, I don't really know what to do with them after that and how to fuse them together to create one RGBD image. Also I want to use the predicted pose to place a 3d model into a scene so i can append AR content to it. (The apps purpose is to check whether Deep Learning can outperform RealityKit in things like poor lighting conditions, etc) I'd be glad about every little help. Thanks in advance!
Posted
by MiriamJo.
Last updated
.
Post not yet marked as solved
1 Replies
608 Views
Hello, is it possible to calculate the distance from iPhone to objects which are recognized by ML models? If yes, how to do so? I suppose using AVDepthData is not suitable for measuring distances in real-time. Perhaps you know projects doing so. Thanks in advance!
Posted
by Sarliefer.
Last updated
.
Post not yet marked as solved
0 Replies
290 Views
My activity classifier is used in tennis sessions, where there are necessarily multiple people on the court. There is also a decent chance other courts' players will be in the shot, depending on the angle and lens. For my training data, would it be best to crop out adjacent courts?
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
0 Replies
290 Views
For a Create ML activity classifier, I’m classifying “playing” tennis (the points or rallies) and a second class “not playing” to be the negative class. I’m not sure what to specify for the action duration parameter given how variable a tennis point or rally can be, but I went with 10 seconds since it seems like the average duration for both the “playing” and “not playing” labels. When choosing this parameter however, I’m wondering if it affects performance, both speed of video processing and accuracy. Would the Vision framework return more results with smaller action durations?
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
0 Replies
213 Views
I am using VNRecognizeTextRequest in my app to process text from a picture captured by the user. Everything works fine except that observations returned by the request are in a strange order. For the sake of simplicity, imagine that the app is a processing a square with a word in each of the four quadrants. Instead of going upper-left, upper-right, lower-left, lower-right, it goes upper-left, lower-left, upper-right, lower-right. This strange order only occurs in some places in the image, not over the whole image (it does generally go in the expected order and first picks up on the higher up words). I have checked the coordinates of the observations in a few example images, and there is no clear pattern of why this is happening (I originally thought the issue might be due to the tilt of the camera or something similar but it appears not). Would be super grateful if anyone has an idea of why this might be happening.
Posted
by agaS95.
Last updated
.
Post marked as solved
2 Replies
850 Views
Did something change on face detection / Vision Framework on iOS 15? Using VNDetectFaceLandmarksRequest and reading the VNFaceLandmarkRegion2D to detect eyes is not working on iOS 15 as it did before. I am running the exact same code on an iOS 14 and iOS 15 device and the coordinates are different as seen on the screenshot? Any Ideas?
Posted
by Ships66.
Last updated
.
Post not yet marked as solved
0 Replies
225 Views
Would might be a good approach to estimating a VNVideoProcessor operation? I'd like to show a progress bar that's useful enough like one based the progress Apple vends for the photo picker or exports. This would make a world of difference compared to a UIActivityIndicatorView, but I'm not sure how to approach handrolling this (or if that would even be a good idea). I filed an API enhancement request for this, FB9888210.
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
1 Replies
475 Views
Below, the sampleBufferProcessor closure is where the Vision body pose detection occurs. /// Transfers the sample data from the AVAssetReaderOutput to the AVAssetWriterInput, /// processing via a CMSampleBufferProcessor. /// /// - Parameters: /// - readerOutput: The source sample data. /// - writerInput: The destination for the sample data. /// - queue: The DispatchQueue. /// - completionHandler: The completion handler to run when the transfer finishes. /// - Tag: transferSamplesAsynchronously private func transferSamplesAsynchronously(from readerOutput: AVAssetReaderOutput, to writerInput: AVAssetWriterInput, onQueue queue: DispatchQueue, sampleBufferProcessor: SampleBufferProcessor, completionHandler: @escaping () -> Void) { /* The writerInput continously invokes this closure until finished or cancelled. It throws an NSInternalInconsistencyException if called more than once for the same writer. */ writerInput.requestMediaDataWhenReady(on: queue) { var isDone = false /* While the writerInput accepts more data, process the sampleBuffer and then transfer the processed sample to the writerInput. */ while writerInput.isReadyForMoreMediaData { if self.isCancelled { isDone = true break } // Get the next sample from the asset reader output. guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else { // The asset reader output has no more samples to vend. isDone = true break } // Process the sample, if requested. do { try sampleBufferProcessor?(sampleBuffer) } catch { /* The `readingAndWritingDidFinish()` function picks up this error. */ self.sampleTransferError = error isDone = true } // Append the sample to the asset writer input. guard writerInput.append(sampleBuffer) else { /* The writer could not append the sample buffer. The `readingAndWritingDidFinish()` function handles any error information from the asset writer. */ isDone = true break } } if isDone { /* Calling `markAsFinished()` on the asset writer input does the following: 1. Unblocks any other inputs needing more samples. 2. Cancels further invocations of this "request media data" callback block. */ writerInput.markAsFinished() /* Tell the caller the reader output and writer input finished transferring samples. */ completionHandler() } } } The processor closure runs body pose detection on every sample buffer so that later in the VNDetectHumanBodyPoseRequest completion handler, VNHumanBodyPoseObservation results are fed into a custom Core ML action classifier. private func videoProcessorForActivityClassification() -> SampleBufferProcessor { let videoProcessor: SampleBufferProcessor = { sampleBuffer in do { let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer) try requestHandler.perform([self.detectHumanBodyPoseRequest]) } catch { print("Unable to perform the request: \(error.localizedDescription).") } } return videoProcessor } How could I improve the performance of this pipeline? After testing with an hour long 4K video at 60 FPS, it took several hours to process running as a Mac Catalyst app on M1 Max.
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
0 Replies
338 Views
I'm having trouble reasoning about and modifying the Detecting Human Actions in a Live Video Feed sample code since I'm new to Combine. // ---- [MLMultiArray?] -- [MLMultiArray?] ---- // Make an activity prediction from the window. .map(predictActionWithWindow) // ---- ActionPrediction -- ActionPrediction ---- // Send the action prediction to the delegate. .sink(receiveValue: sendPrediction) These are the final two operators of the video processing pipeline, where the action prediction occurs. In either the implementation for private func predictActionWithWindow(_ currentWindow: [MLMultiArray?]) -> ActionPrediction or for private func sendPrediction(_ actionPrediction: ActionPrediction), how might I access the results of a VNHumanBodyPoseRequest that's retrieved and scoped in a function called earlier in the daisy chain? When I did this imperatively, I accessed results in the VNDetectHumanBodyPoseRequest completion handler, but I'm not sure how data flow would work with Combine's programming model. I want to associate predictions with the observation results they're based on so that I can store the time range of a given prediction label.
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
0 Replies
342 Views
I have made a Scan to Text app with the help of sources from the internet, but I can’t figure out a way to get my output text to be editable. Here’s my code private func makeScannerView()-> ScannerView {         ScannerView(completion: {             textPerPage in             if let outputText = textPerPage?.joined(separator: "\n").trimmingCharacters(in: .whitespacesAndNewlines){                 let newScanData = ScanData(content: outputText)                 self.texts.append(newScanData)             }             self.showScannerSheet = false                      })     }
Posted
by YashSinha.
Last updated
.
Post not yet marked as solved
2 Replies
463 Views
I just got an app feature working where the user imports a video file, each frame is fed to a custom action classifier, and then only frames with a certain action classified are exported. However, I'm finding that testing a one hour 4K video at 60 FPS is taking an unreasonably long time - it's been processing for 7 hours now on a MacBook Pro with M1 Max running the Mac Catalyst app. Are there any techniques or general guidance that would help with improving performance? As much as possible I'd like to preserve the input video quality, especially frame rate. One hour length for the video is expected, as it's of a tennis session (could be anywhere from 10 minutes to a couple hours). I made the body pose action classifier with Create ML.
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
1 Replies
293 Views
I'm adopting and transitioning to VNVideoProcessor away from performing Vision requests on individual frames, since it more concisely does the same. However, I'm not sure how to detect when analysis of a video is finished. Previously when reading frames with AVFoundation I could check with // Get the next sample from the asset reader output. guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else { // The asset reader output has no more samples to vend. isDone = true break } What would be an equivalent when using VNVideoProcessor?
Posted
by Curiosity.
Last updated
.