Detect people, faces, and poses using Vision

RSS for tag

Discuss the WWDC21 session Detect people, faces, and poses using Vision.

View Session

Posts under wwdc21-10040 tag

3 Posts
Sort by:
Post not yet marked as solved
0 Replies
289 Views
I followed Apple's guidance in their articles Creating an Action Classifier Model, Gathering Training Videos for an Action Classifier, and Building an Action Classifier Data Source. With this Core ML model file now imported in Xcode, how do use it to classify video frames? For each video frame I call do { let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer) try requestHandler.perform([self.detectHumanBodyPoseRequest]) } catch { print("Unable to perform the request: \(error.localizedDescription).") } But it's unclear to me how to use the results of the VNDetectHumanBodyPoseRequest which come back as the type [VNHumanBodyPoseObservation]?. How would I feed to the results into my custom classifier, which has an automatically generated model class TennisActionClassifier.swift? The classifier is for making predictions on the frame's body poses, labeling the actions as either playing a rally/point or not playing.
Posted
by
Post not yet marked as solved
0 Replies
281 Views
I'm building a feature to automatically edit out all the downtime of a tennis video. I have a partial implementation that stores the start and end times of Vision trajectory detections and writes only those segments to an AVFoundation export session. I've encountered a major issue, which is that the trajectories returned end whenever the ball bounce, so each segment is just one tennis shot and nowhere close to an entire rally with multiple bounces. I'm ensure if I should continue done the trajectory route, maybe stitching together the trajectories and somehow only splitting at the start and end of a rally. Any general guidance would be appreciated. Is there a different Vision or ML approach that would more accurately model the start and end time of a rally? I considered creating a custom action classifier to classify frames to be either "playing tennis" or "inactivity," but I started with Apple's trajectory detection since it was already built and trained. Maybe a custom classifier would be needed, but not sure.
Posted
by
Post not yet marked as solved
0 Replies
498 Views
Could I please ask what is (at least plainly) the deep learning architecture of the Apple's custom pose models available through Vision (for example with the VNDetectHumanBodyPoseRequest)? Or whether it is based on some publicly used architecture (such as ResNet) only with modifications or custom Apple dataset? I was not able to find this information anywhere in the Apple documentation and it would be highly beneficial to know this, as we are using this data in a research about which we want to publish a paper. Thanks beforehand!
Posted
by