Apply computer vision algorithms to perform a variety of tasks on input images and video using Vision.

Vision Documentation

Posts under Vision tag

80 Posts
Sort by:
Post not yet marked as solved
0 Replies
165 Views
Hi there, I am trying to combine this code I have for stereo vision as well as the hand tracking code available (drawing when pinch) but I'm running into trouble with this. I believe it is to do with the fact that the hand tracking code sets up AV Session where as the stereo vision uses SceneKit. Could you please provide me with some feedback as how to start integrating these two very different sets of code? StereoVision Hand Tracking
Posted
by
Post not yet marked as solved
1 Replies
179 Views
I am writing an iOS app using swift and need to analyse many photos in parallel. In order to do that, I plan to use a single CIDetector instance (as recommended by Apple). Is it safe to use the same instance of CIDetector in different threads? Tried to find a documentation about it but had no luck.
Posted
by
Post not yet marked as solved
0 Replies
143 Views
I am trying to get the feature points' coordinates in the image from VNFeaturePrintObservation but I can't find the way to get them. Is there any way to do that? The only thing I see that I can get is the metrics, the number of elements and the data in a form that I don't understand.
Posted
by
Post marked as solved
1 Replies
260 Views
In my app, I am performing a VNDetectFaceLandmarksRequest with a VNSequenceRequestHandler. The video that serves as my input is from my iPhones selfie-camera. The request returns the VNFaceLandmarkRegion2D from where I get all the landmarks as an array of CGPoints via VNFaceLandmarkRegion2D.normalizedPoints I want to compare all the CGPoint-arrays over time, but I am not sure if a point at a certain index is always representing the same landmark. Can I assume that a specific landmark, e.g. the left-most landmark of the right eye, always has the same index in the CGPoint-array?
Posted
by
Post not yet marked as solved
0 Replies
358 Views
Hello developers, I am recently struggling with providing the right data for a deep learning model i want to integrate into my swift app. (I am fairly new to swift and ios dev, please bear with me.) The App is supposed to run on an iPad Pro with Lidar sensor, no other devices. I'm working with Dense Fusion 6DoF pose estimation , which requires an RGB-D Image as input. (it's trained on the ycb-video dataset) I already looked up different examples on how i can stream depth data from the camera (fog example) and also how to capture an Image with depth information. So I started a session that gets video and depth data from the CVPixelBuffer as input. However, I don't really know what to do with them after that and how to fuse them together to create one RGBD image. Also I want to use the predicted pose to place a 3d model into a scene so i can append AR content to it. (The apps purpose is to check whether Deep Learning can outperform RealityKit in things like poor lighting conditions, etc) I'd be glad about every little help. Thanks in advance!
Posted
by
Post not yet marked as solved
0 Replies
290 Views
My activity classifier is used in tennis sessions, where there are necessarily multiple people on the court. There is also a decent chance other courts' players will be in the shot, depending on the angle and lens. For my training data, would it be best to crop out adjacent courts?
Posted
by
Post not yet marked as solved
0 Replies
290 Views
For a Create ML activity classifier, I’m classifying “playing” tennis (the points or rallies) and a second class “not playing” to be the negative class. I’m not sure what to specify for the action duration parameter given how variable a tennis point or rally can be, but I went with 10 seconds since it seems like the average duration for both the “playing” and “not playing” labels. When choosing this parameter however, I’m wondering if it affects performance, both speed of video processing and accuracy. Would the Vision framework return more results with smaller action durations?
Posted
by
Post not yet marked as solved
0 Replies
213 Views
I am using VNRecognizeTextRequest in my app to process text from a picture captured by the user. Everything works fine except that observations returned by the request are in a strange order. For the sake of simplicity, imagine that the app is a processing a square with a word in each of the four quadrants. Instead of going upper-left, upper-right, lower-left, lower-right, it goes upper-left, lower-left, upper-right, lower-right. This strange order only occurs in some places in the image, not over the whole image (it does generally go in the expected order and first picks up on the higher up words). I have checked the coordinates of the observations in a few example images, and there is no clear pattern of why this is happening (I originally thought the issue might be due to the tilt of the camera or something similar but it appears not). Would be super grateful if anyone has an idea of why this might be happening.
Posted
by
Post not yet marked as solved
0 Replies
226 Views
Would might be a good approach to estimating a VNVideoProcessor operation? I'd like to show a progress bar that's useful enough like one based the progress Apple vends for the photo picker or exports. This would make a world of difference compared to a UIActivityIndicatorView, but I'm not sure how to approach handrolling this (or if that would even be a good idea). I filed an API enhancement request for this, FB9888210.
Posted
by
Post not yet marked as solved
0 Replies
339 Views
I'm having trouble reasoning about and modifying the Detecting Human Actions in a Live Video Feed sample code since I'm new to Combine. // ---- [MLMultiArray?] -- [MLMultiArray?] ---- // Make an activity prediction from the window. .map(predictActionWithWindow) // ---- ActionPrediction -- ActionPrediction ---- // Send the action prediction to the delegate. .sink(receiveValue: sendPrediction) These are the final two operators of the video processing pipeline, where the action prediction occurs. In either the implementation for private func predictActionWithWindow(_ currentWindow: [MLMultiArray?]) -> ActionPrediction or for private func sendPrediction(_ actionPrediction: ActionPrediction), how might I access the results of a VNHumanBodyPoseRequest that's retrieved and scoped in a function called earlier in the daisy chain? When I did this imperatively, I accessed results in the VNDetectHumanBodyPoseRequest completion handler, but I'm not sure how data flow would work with Combine's programming model. I want to associate predictions with the observation results they're based on so that I can store the time range of a given prediction label.
Posted
by
Post not yet marked as solved
0 Replies
342 Views
I have made a Scan to Text app with the help of sources from the internet, but I can’t figure out a way to get my output text to be editable. Here’s my code private func makeScannerView()-> ScannerView {         ScannerView(completion: {             textPerPage in             if let outputText = textPerPage?.joined(separator: "\n").trimmingCharacters(in: .whitespacesAndNewlines){                 let newScanData = ScanData(content: outputText)                 self.texts.append(newScanData)             }             self.showScannerSheet = false                      })     }
Posted
by
Post not yet marked as solved
1 Replies
293 Views
I'm adopting and transitioning to VNVideoProcessor away from performing Vision requests on individual frames, since it more concisely does the same. However, I'm not sure how to detect when analysis of a video is finished. Previously when reading frames with AVFoundation I could check with // Get the next sample from the asset reader output. guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else { // The asset reader output has no more samples to vend. isDone = true break } What would be an equivalent when using VNVideoProcessor?
Posted
by
Post not yet marked as solved
1 Replies
475 Views
Below, the sampleBufferProcessor closure is where the Vision body pose detection occurs. /// Transfers the sample data from the AVAssetReaderOutput to the AVAssetWriterInput, /// processing via a CMSampleBufferProcessor. /// /// - Parameters: /// - readerOutput: The source sample data. /// - writerInput: The destination for the sample data. /// - queue: The DispatchQueue. /// - completionHandler: The completion handler to run when the transfer finishes. /// - Tag: transferSamplesAsynchronously private func transferSamplesAsynchronously(from readerOutput: AVAssetReaderOutput, to writerInput: AVAssetWriterInput, onQueue queue: DispatchQueue, sampleBufferProcessor: SampleBufferProcessor, completionHandler: @escaping () -> Void) { /* The writerInput continously invokes this closure until finished or cancelled. It throws an NSInternalInconsistencyException if called more than once for the same writer. */ writerInput.requestMediaDataWhenReady(on: queue) { var isDone = false /* While the writerInput accepts more data, process the sampleBuffer and then transfer the processed sample to the writerInput. */ while writerInput.isReadyForMoreMediaData { if self.isCancelled { isDone = true break } // Get the next sample from the asset reader output. guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else { // The asset reader output has no more samples to vend. isDone = true break } // Process the sample, if requested. do { try sampleBufferProcessor?(sampleBuffer) } catch { /* The `readingAndWritingDidFinish()` function picks up this error. */ self.sampleTransferError = error isDone = true } // Append the sample to the asset writer input. guard writerInput.append(sampleBuffer) else { /* The writer could not append the sample buffer. The `readingAndWritingDidFinish()` function handles any error information from the asset writer. */ isDone = true break } } if isDone { /* Calling `markAsFinished()` on the asset writer input does the following: 1. Unblocks any other inputs needing more samples. 2. Cancels further invocations of this "request media data" callback block. */ writerInput.markAsFinished() /* Tell the caller the reader output and writer input finished transferring samples. */ completionHandler() } } } The processor closure runs body pose detection on every sample buffer so that later in the VNDetectHumanBodyPoseRequest completion handler, VNHumanBodyPoseObservation results are fed into a custom Core ML action classifier. private func videoProcessorForActivityClassification() -> SampleBufferProcessor { let videoProcessor: SampleBufferProcessor = { sampleBuffer in do { let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer) try requestHandler.perform([self.detectHumanBodyPoseRequest]) } catch { print("Unable to perform the request: \(error.localizedDescription).") } } return videoProcessor } How could I improve the performance of this pipeline? After testing with an hour long 4K video at 60 FPS, it took several hours to process running as a Mac Catalyst app on M1 Max.
Posted
by
Post not yet marked as solved
2 Replies
463 Views
I just got an app feature working where the user imports a video file, each frame is fed to a custom action classifier, and then only frames with a certain action classified are exported. However, I'm finding that testing a one hour 4K video at 60 FPS is taking an unreasonably long time - it's been processing for 7 hours now on a MacBook Pro with M1 Max running the Mac Catalyst app. Are there any techniques or general guidance that would help with improving performance? As much as possible I'd like to preserve the input video quality, especially frame rate. One hour length for the video is expected, as it's of a tennis session (could be anywhere from 10 minutes to a couple hours). I made the body pose action classifier with Create ML.
Posted
by
Post not yet marked as solved
0 Replies
361 Views
Modifying guidance given in an answer on AVFoundation + Vision trajectory detection, I'm instead saving time ranges of frames that have a specific ML label from my custom action classifier: private lazy var detectHumanBodyPoseRequest: VNDetectHumanBodyPoseRequest = { let detectHumanBodyPoseRequest = VNDetectHumanBodyPoseRequest(completionHandler: completionHandler) return detectHumanBodyPoseRequest }() var timeRangesOfInterest: [Int : CMTimeRange] = [:] private func readingAndWritingDidFinish(assetReaderWriter: AVAssetReaderWriter, asset completionHandler: @escaping FinishHandler) { if isCancelled { completionHandler(.success(.cancelled)) return } // Handle any error during processing of the video. guard sampleTransferError == nil else { assetReaderWriter.cancel() completionHandler(.failure(sampleTransferError!)) return } // Evaluate the result reading the samples. let result = assetReaderWriter.readingCompleted() if case .failure = result { completionHandler(result) return } /* Finish writing, and asynchronously evaluate the results from writing the samples. */ assetReaderWriter.writingCompleted { result in self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.value }) { result in completionHandler(result) } } } func exportVideoTimeRanges(timeRanges: [CMTimeRange], completion: @escaping (Result<OperationStatus, Error>) -> Void) { let inputVideoTrack = self.asset.tracks(withMediaType: .video).first! let composition = AVMutableComposition() let compositionTrack = composition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)! var insertionPoint: CMTime = .zero for timeRange in timeRanges { try! compositionTrack.insertTimeRange(timeRange, of: inputVideoTrack, at: insertionPoint) insertionPoint = insertionPoint + timeRange.duration } let exportSession = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)! try? FileManager.default.removeItem(at: self.outputURL) exportSession.outputURL = self.outputURL exportSession.outputFileType = .mov exportSession.exportAsynchronously { var result: Result<OperationStatus, Error> switch exportSession.status { case .completed: result = .success(.completed) case .cancelled: result = .success(.cancelled) case .failed: // The `error` property is non-nil in the `.failed` status. result = .failure(exportSession.error!) default: fatalError("Unexpected terminal export session status: \(exportSession.status).") } print("export finished: \(exportSession.status.rawValue) - \(exportSession.error)") completion(result) } } This worked fine with results vended from Apple's trajectory detection, but using my custom action classifier TennisActionClassifier (Core ML model exported from Create ML), I get the console error getSubtractiveDecodeDuration signalled err=-16364 (kMediaSampleTimingGeneratorError_InvalidTimeStamp) (Decode timestamp is earlier than previous sample's decode timestamp.) at MediaSampleTimingGenerator.c:180. Why might this be?
Posted
by
Post not yet marked as solved
0 Replies
394 Views
I use VNDetectHumanBodyPoseRequest to detect body from an image which in xcode assets(I download from image website), But I get error below: 2021-12-24 21:50:19.945976+0800 Guess My Exercise[91308:4258893] [espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "I/O error": Missing weights path cnn_human_pose.espresso.weights status=-2 Unable to perform the request: Error Domain=com.apple.vis Code=9 "Unable to setup request in VNDetectHumanBodyPoseRequest" UserInfo={NSLocalizedDescription=Unable to setup request in VNDetectHumanBodyPoseRequest}. Below is my codes: let image = UIImage(named: "image2") guard let cgImage = image?.cgImage else{return} let requestHandler = VNImageRequestHandler(cgImage: cgImage) let request = VNDetectHumanBodyPoseRequest(completionHandler: bodyPoseHandler) do { // Perform the body pose-detection request. try requestHandler.perform([request]) } catch { print("Unable to perform the request: \(error).") } func bodyPoseHandler(request: VNRequest, error: Error?) { guard let observations = request.results as? [VNHumanBodyPoseObservation] else { return } let poses = Pose.fromObservations(observations) self.drawPoses(poses, onto: self.simage!) // Process each observation to find the recognized body pose points. }
Posted
by
Post not yet marked as solved
0 Replies
257 Views
My goal is to mark any tennis video's timestamps of both the start of each rally/point and the end of each rally/point. I tried trajectory detection, but the "end time" is when the ball bounces rather than when the rally/point ends. I'm not quite sure what direction to go from here to improve on this. Would action classification of body poses in each frame (two classes, "playing" and "not playing") be the best way to split the video into segments? A different technique?
Posted
by
Post not yet marked as solved
0 Replies
281 Views
I'm building a feature to automatically edit out all the downtime of a tennis video. I have a partial implementation that stores the start and end times of Vision trajectory detections and writes only those segments to an AVFoundation export session. I've encountered a major issue, which is that the trajectories returned end whenever the ball bounce, so each segment is just one tennis shot and nowhere close to an entire rally with multiple bounces. I'm ensure if I should continue done the trajectory route, maybe stitching together the trajectories and somehow only splitting at the start and end of a rally. Any general guidance would be appreciated. Is there a different Vision or ML approach that would more accurately model the start and end time of a rally? I considered creating a custom action classifier to classify frames to be either "playing tennis" or "inactivity," but I started with Apple's trajectory detection since it was already built and trained. Maybe a custom classifier would be needed, but not sure.
Posted
by
Post not yet marked as solved
0 Replies
249 Views
I am saving time ranges from an input video asset where trajectories are found, then exporting only those segments to an output video file. Currently I track these time ranges in a stored property var timeRangesOfInterest: [Double : CMTimeRange], which is set in the trajectory request's completion handler func completionHandler(request: VNRequest, error: Error?) {         guard let request = request as? VNDetectTrajectoriesRequest else { return }         if let results = request.results,            results.count > 0 {             for result in results {                 var timeRange = result.timeRange                 timeRange.start = timeRange.start - self.assetWriterStartTime                 self.timeRangesOfInterest[timeRange.start.seconds] = timeRange             }         }     } Then these time ranges of interest are used in an export session to only export those segments /*          Finish writing, and asynchronously evaluate the results from writing          the samples.         */         assetReaderWriter.writingCompleted { result in             self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.1 }) { result in                 completionHandler(result)             }         } Unfortunately however, I'm getting repeated trajectory video segments in the outputted video. Is this maybe because trajectory requests return "in progress" repeated trajectory results with slightly different time range start times? What might be a good strategy for avoiding or removing them? I noticed trajectory segments will appear out of order in the output as well.
Posted
by