Post not yet marked as solved
Hi there, I am trying to combine this code I have for stereo vision as well as the hand tracking code available (drawing when pinch) but I'm running into trouble with this. I believe it is to do with the fact that the hand tracking code sets up AV Session where as the stereo vision uses SceneKit. Could you please provide me with some feedback as how to start integrating these two very different sets of code?
StereoVision
Hand Tracking
Post not yet marked as solved
I am writing an iOS app using swift and need to analyse many photos in parallel.
In order to do that, I plan to use a single CIDetector instance (as recommended by Apple).
Is it safe to use the same instance of CIDetector in different threads?
Tried to find a documentation about it but had no luck.
Post not yet marked as solved
I am trying to get the feature points' coordinates in the image from VNFeaturePrintObservation but I can't find the way to get them. Is there any way to do that? The only thing I see that I can get is the metrics, the number of elements and the data in a form that I don't understand.
In my app, I am performing a VNDetectFaceLandmarksRequest with a VNSequenceRequestHandler. The video that serves as my input is from my iPhones selfie-camera.
The request returns the VNFaceLandmarkRegion2D from where I get all the landmarks as an array of CGPoints via
VNFaceLandmarkRegion2D.normalizedPoints
I want to compare all the CGPoint-arrays over time, but I am not sure if a point at a certain index is always representing the same landmark.
Can I assume that a specific landmark, e.g. the left-most landmark of the right eye, always has the same index in the CGPoint-array?
Post not yet marked as solved
Hello developers,
I am recently struggling with providing the right data for a deep learning model i want to integrate into my swift app. (I am fairly new to swift and ios dev, please bear with me.)
The App is supposed to run on an iPad Pro with Lidar sensor, no other devices. I'm working with Dense Fusion 6DoF pose estimation , which requires an RGB-D Image as input. (it's trained on the ycb-video dataset)
I already looked up different examples on how i can stream depth data from the camera (fog example) and also how to capture an Image with depth information.
So I started a session that gets video and depth data from the CVPixelBuffer as input.
However, I don't really know what to do with them after that and how to fuse them together to create one RGBD image.
Also I want to use the predicted pose to place a 3d model into a scene so i can append AR content to it.
(The apps purpose is to check whether Deep Learning can outperform RealityKit in things like poor lighting conditions, etc)
I'd be glad about every little help. Thanks in advance!
Post not yet marked as solved
My activity classifier is used in tennis sessions, where there are necessarily multiple people on the court. There is also a decent chance other courts' players will be in the shot, depending on the angle and lens.
For my training data, would it be best to crop out adjacent courts?
Post not yet marked as solved
For a Create ML activity classifier, I’m classifying “playing” tennis (the points or rallies) and a second class “not playing” to be the negative class. I’m not sure what to specify for the action duration parameter given how variable a tennis point or rally can be, but I went with 10 seconds since it seems like the average duration for both the “playing” and “not playing” labels.
When choosing this parameter however, I’m wondering if it affects performance, both speed of video processing and accuracy. Would the Vision framework return more results with smaller action durations?
Post not yet marked as solved
I am using VNRecognizeTextRequest in my app to process text from a picture captured by the user. Everything works fine except that observations returned by the request are in a strange order. For the sake of simplicity, imagine that the app is a processing a square with a word in each of the four quadrants.
Instead of going upper-left, upper-right, lower-left, lower-right, it goes upper-left, lower-left, upper-right, lower-right.
This strange order only occurs in some places in the image, not over the whole image (it does generally go in the expected order and first picks up on the higher up words).
I have checked the coordinates of the observations in a few example images, and there is no clear pattern of why this is happening (I originally thought the issue might be due to the tilt of the camera or something similar but it appears not).
Would be super grateful if anyone has an idea of why this might be happening.
Post not yet marked as solved
Would might be a good approach to estimating a VNVideoProcessor operation? I'd like to show a progress bar that's useful enough like one based the progress Apple vends for the photo picker or exports. This would make a world of difference compared to a UIActivityIndicatorView, but I'm not sure how to approach handrolling this (or if that would even be a good idea).
I filed an API enhancement request for this, FB9888210.
Post not yet marked as solved
I'm having trouble reasoning about and modifying the Detecting Human Actions in a Live Video Feed sample code since I'm new to Combine.
// ---- [MLMultiArray?] -- [MLMultiArray?] ----
// Make an activity prediction from the window.
.map(predictActionWithWindow)
// ---- ActionPrediction -- ActionPrediction ----
// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)
These are the final two operators of the video processing pipeline, where the action prediction occurs. In either the implementation for private func predictActionWithWindow(_ currentWindow: [MLMultiArray?]) -> ActionPrediction or for private func sendPrediction(_ actionPrediction: ActionPrediction), how might I access the results of a VNHumanBodyPoseRequest that's retrieved and scoped in a function called earlier in the daisy chain?
When I did this imperatively, I accessed results in the VNDetectHumanBodyPoseRequest completion handler, but I'm not sure how data flow would work with Combine's programming model. I want to associate predictions with the observation results they're based on so that I can store the time range of a given prediction label.
Post not yet marked as solved
I have made a Scan to Text app with the help of sources from the internet, but I can’t figure out a way to get my output text to be editable. Here’s my code
private func makeScannerView()-> ScannerView {
ScannerView(completion: {
textPerPage in
if let outputText = textPerPage?.joined(separator: "\n").trimmingCharacters(in: .whitespacesAndNewlines){
let newScanData = ScanData(content: outputText)
self.texts.append(newScanData)
}
self.showScannerSheet = false
})
}
Post not yet marked as solved
I'm adopting and transitioning to VNVideoProcessor away from performing Vision requests on individual frames, since it more concisely does the same. However, I'm not sure how to detect when analysis of a video is finished. Previously when reading frames with AVFoundation I could check with
// Get the next sample from the asset reader output.
guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else {
// The asset reader output has no more samples to vend.
isDone = true
break
}
What would be an equivalent when using VNVideoProcessor?
Post not yet marked as solved
Below, the sampleBufferProcessor closure is where the Vision body pose detection occurs.
/// Transfers the sample data from the AVAssetReaderOutput to the AVAssetWriterInput,
/// processing via a CMSampleBufferProcessor.
///
/// - Parameters:
/// - readerOutput: The source sample data.
/// - writerInput: The destination for the sample data.
/// - queue: The DispatchQueue.
/// - completionHandler: The completion handler to run when the transfer finishes.
/// - Tag: transferSamplesAsynchronously
private func transferSamplesAsynchronously(from readerOutput: AVAssetReaderOutput,
to writerInput: AVAssetWriterInput,
onQueue queue: DispatchQueue,
sampleBufferProcessor: SampleBufferProcessor,
completionHandler: @escaping () -> Void) {
/*
The writerInput continously invokes this closure until finished or
cancelled. It throws an NSInternalInconsistencyException if called more
than once for the same writer.
*/
writerInput.requestMediaDataWhenReady(on: queue) {
var isDone = false
/*
While the writerInput accepts more data, process the sampleBuffer
and then transfer the processed sample to the writerInput.
*/
while writerInput.isReadyForMoreMediaData {
if self.isCancelled {
isDone = true
break
}
// Get the next sample from the asset reader output.
guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else {
// The asset reader output has no more samples to vend.
isDone = true
break
}
// Process the sample, if requested.
do {
try sampleBufferProcessor?(sampleBuffer)
} catch {
/*
The `readingAndWritingDidFinish()` function picks up this
error.
*/
self.sampleTransferError = error
isDone = true
}
// Append the sample to the asset writer input.
guard writerInput.append(sampleBuffer) else {
/*
The writer could not append the sample buffer.
The `readingAndWritingDidFinish()` function handles any
error information from the asset writer.
*/
isDone = true
break
}
}
if isDone {
/*
Calling `markAsFinished()` on the asset writer input does the
following:
1. Unblocks any other inputs needing more samples.
2. Cancels further invocations of this "request media data"
callback block.
*/
writerInput.markAsFinished()
/*
Tell the caller the reader output and writer input finished
transferring samples.
*/
completionHandler()
}
}
}
The processor closure runs body pose detection on every sample buffer so that later in the VNDetectHumanBodyPoseRequest completion handler, VNHumanBodyPoseObservation results are fed into a custom Core ML action classifier.
private func videoProcessorForActivityClassification() -> SampleBufferProcessor {
let videoProcessor: SampleBufferProcessor = { sampleBuffer in
do {
let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer)
try requestHandler.perform([self.detectHumanBodyPoseRequest])
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
}
return videoProcessor
}
How could I improve the performance of this pipeline? After testing with an hour long 4K video at 60 FPS, it took several hours to process running as a Mac Catalyst app on M1 Max.
Post not yet marked as solved
I just got an app feature working where the user imports a video file, each frame is fed to a custom action classifier, and then only frames with a certain action classified are exported.
However, I'm finding that testing a one hour 4K video at 60 FPS is taking an unreasonably long time - it's been processing for 7 hours now on a MacBook Pro with M1 Max running the Mac Catalyst app. Are there any techniques or general guidance that would help with improving performance? As much as possible I'd like to preserve the input video quality, especially frame rate. One hour length for the video is expected, as it's of a tennis session (could be anywhere from 10 minutes to a couple hours). I made the body pose action classifier with Create ML.
Post not yet marked as solved
Modifying guidance given in an answer on AVFoundation + Vision trajectory detection, I'm instead saving time ranges of frames that have a specific ML label from my custom action classifier:
private lazy var detectHumanBodyPoseRequest: VNDetectHumanBodyPoseRequest = {
let detectHumanBodyPoseRequest = VNDetectHumanBodyPoseRequest(completionHandler: completionHandler)
return detectHumanBodyPoseRequest
}()
var timeRangesOfInterest: [Int : CMTimeRange] = [:]
private func readingAndWritingDidFinish(assetReaderWriter: AVAssetReaderWriter,
asset
completionHandler: @escaping FinishHandler) {
if isCancelled {
completionHandler(.success(.cancelled))
return
}
// Handle any error during processing of the video.
guard sampleTransferError == nil else {
assetReaderWriter.cancel()
completionHandler(.failure(sampleTransferError!))
return
}
// Evaluate the result reading the samples.
let result = assetReaderWriter.readingCompleted()
if case .failure = result {
completionHandler(result)
return
}
/*
Finish writing, and asynchronously evaluate the results from writing
the samples.
*/
assetReaderWriter.writingCompleted { result in
self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.value }) { result in
completionHandler(result)
}
}
}
func exportVideoTimeRanges(timeRanges: [CMTimeRange], completion: @escaping (Result<OperationStatus, Error>) -> Void) {
let inputVideoTrack = self.asset.tracks(withMediaType: .video).first!
let composition = AVMutableComposition()
let compositionTrack = composition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)!
var insertionPoint: CMTime = .zero
for timeRange in timeRanges {
try! compositionTrack.insertTimeRange(timeRange, of: inputVideoTrack, at: insertionPoint)
insertionPoint = insertionPoint + timeRange.duration
}
let exportSession = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)!
try? FileManager.default.removeItem(at: self.outputURL)
exportSession.outputURL = self.outputURL
exportSession.outputFileType = .mov
exportSession.exportAsynchronously {
var result: Result<OperationStatus, Error>
switch exportSession.status {
case .completed:
result = .success(.completed)
case .cancelled:
result = .success(.cancelled)
case .failed:
// The `error` property is non-nil in the `.failed` status.
result = .failure(exportSession.error!)
default:
fatalError("Unexpected terminal export session status: \(exportSession.status).")
}
print("export finished: \(exportSession.status.rawValue) - \(exportSession.error)")
completion(result)
}
}
This worked fine with results vended from Apple's trajectory detection, but using my custom action classifier TennisActionClassifier (Core ML model exported from Create ML), I get the console error getSubtractiveDecodeDuration signalled err=-16364 (kMediaSampleTimingGeneratorError_InvalidTimeStamp) (Decode timestamp is earlier than previous sample's decode timestamp.) at MediaSampleTimingGenerator.c:180. Why might this be?
Post not yet marked as solved
I use VNDetectHumanBodyPoseRequest to detect body from an image which in xcode assets(I download from image website), But I get error below:
2021-12-24 21:50:19.945976+0800 Guess My Exercise[91308:4258893] [espresso] [Espresso::handle_ex_plan] exception=Espresso exception: "I/O error": Missing weights path cnn_human_pose.espresso.weights status=-2
Unable to perform the request: Error Domain=com.apple.vis Code=9 "Unable to setup request in VNDetectHumanBodyPoseRequest" UserInfo={NSLocalizedDescription=Unable to setup request in VNDetectHumanBodyPoseRequest}.
Below is my codes:
let image = UIImage(named: "image2")
guard let cgImage = image?.cgImage else{return}
let requestHandler = VNImageRequestHandler(cgImage: cgImage)
let request = VNDetectHumanBodyPoseRequest(completionHandler: bodyPoseHandler)
do {
// Perform the body pose-detection request.
try requestHandler.perform([request])
} catch {
print("Unable to perform the request: \(error).")
}
func bodyPoseHandler(request: VNRequest, error: Error?) {
guard let observations =
request.results as? [VNHumanBodyPoseObservation] else {
return
}
let poses = Pose.fromObservations(observations)
self.drawPoses(poses, onto: self.simage!)
// Process each observation to find the recognized body pose points.
}
Post not yet marked as solved
Is it possible to use SNAudioFileAnalyzer with live HLS(m3u8) stream? Maybe we need to extract somehow audio from it?
And Can we use SNAudioFileAnalyzer with real remote url? Or we can use it only with files in file system?
Post not yet marked as solved
My goal is to mark any tennis video's timestamps of both the start of each rally/point and the end of each rally/point. I tried trajectory detection, but the "end time" is when the ball bounces rather than when the rally/point ends. I'm not quite sure what direction to go from here to improve on this. Would action classification of body poses in each frame (two classes, "playing" and "not playing") be the best way to split the video into segments? A different technique?
Post not yet marked as solved
I'm building a feature to automatically edit out all the downtime of a tennis video. I have a partial implementation that stores the start and end times of Vision trajectory detections and writes only those segments to an AVFoundation export session.
I've encountered a major issue, which is that the trajectories returned end whenever the ball bounce, so each segment is just one tennis shot and nowhere close to an entire rally with multiple bounces. I'm ensure if I should continue done the trajectory route, maybe stitching together the trajectories and somehow only splitting at the start and end of a rally.
Any general guidance would be appreciated.
Is there a different Vision or ML approach that would more accurately model the start and end time of a rally? I considered creating a custom action classifier to classify frames to be either "playing tennis" or "inactivity," but I started with Apple's trajectory detection since it was already built and trained. Maybe a custom classifier would be needed, but not sure.
Post not yet marked as solved
I am saving time ranges from an input video asset where trajectories are found, then exporting only those segments to an output video file.
Currently I track these time ranges in a stored property var timeRangesOfInterest: [Double : CMTimeRange], which is set in the trajectory request's completion handler
func completionHandler(request: VNRequest, error: Error?) {
guard let request = request as? VNDetectTrajectoriesRequest else { return }
if let results = request.results,
results.count > 0 {
for result in results {
var timeRange = result.timeRange
timeRange.start = timeRange.start - self.assetWriterStartTime
self.timeRangesOfInterest[timeRange.start.seconds] = timeRange
}
}
}
Then these time ranges of interest are used in an export session to only export those segments
/*
Finish writing, and asynchronously evaluate the results from writing
the samples.
*/
assetReaderWriter.writingCompleted { result in
self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.1 }) { result in
completionHandler(result)
}
}
Unfortunately however, I'm getting repeated trajectory video segments in the outputted video. Is this maybe because trajectory requests return "in progress" repeated trajectory results with slightly different time range start times? What might be a good strategy for avoiding or removing them? I noticed trajectory segments will appear out of order in the output as well.