Post not yet marked as solved
Since there are only 80 class labels for existing YOLOv3 Coreml model, I want to add some more categories to be used in my app, can I do that? If this is possible, how can I do?
Post not yet marked as solved
How should I think about video quality (if it's important) when gathering training videos? Does higher video quality of training data make for better predictions, or should it more closely match the common use case (1080p I suppose, thinking about iPhones broadly)?
Post not yet marked as solved
I'm trying to implement a 5-D input tensor version of custom grid_sample based on this work. For encoding one of the input tensor, grid to a MTLTexture as shown in officical guide, i need to transpose grid from (N×D_grid×W_grid×H_grid×3) to (N×3×D_grid×W_grid ×H_grid) by builder.add_transpose similar like shown here. But in my implementation, i find that adding this transpose op will make the custom layer always run on CPU. Without this layer, the data can get in GPU. I doubt will builder.add_transpose have an effect on this?
System Information
xcode: 13.2
coremltools: 4.1/5.1.0
test device: iphone 11
Post not yet marked as solved
I have a mlmodel prediction function in the script. During building (ios 15), the error shows up, and the app crashes. The whole build works under ios 14. The current Xcode version is 13.
Post not yet marked as solved
I'm trying to implement a pytorch custom layer [grid_sampler] (https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.grid_sample.html) on GPU. Both of its inputs, input and grid can be 5-D. My implementation of encodeToCommandBuffer, which is MLCustomLayer protocol's function, is shown below. According to my current attempts, both value of id<MTLTexture> input and id<MTLTexture> grid don't meet expectations. So i wonder can MTLTexture be used to store 5-D input tensor as inputs of encodeToCommandBuffer? Or can anybody help to show me how to use MTLTexture correctly here? Thanks a lot!
- (BOOL)encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer
inputs:(NSArray<id<MTLTexture>> *)inputs
outputs:(NSArray<id<MTLTexture>> *)outputs
error:(NSError * _Nullable *)error {
NSLog(@"Dispatching to GPU");
NSLog(@"inputs count %lu", (unsigned long)inputs.count);
NSLog(@"outputs count %lu", (unsigned long)outputs.count);
id<MTLComputeCommandEncoder> encoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(encoder != nil);
id<MTLTexture> input = inputs[0];
id<MTLTexture> grid = inputs[1];
id<MTLTexture> output = outputs[0];
NSLog(@"inputs shape %lu, %lu, %lu, %lu", (unsigned long)input.width, (unsigned long)input.height, (unsigned long)input.depth, (unsigned long)input.arrayLength);
NSLog(@"grid shape %lu, %lu, %lu, %lu", (unsigned long)grid.width, (unsigned long)grid.height, (unsigned long)grid.depth, (unsigned long)grid.arrayLength);
if (encoder)
{
[encoder setTexture:input atIndex:0];
[encoder setTexture:grid atIndex:1];
[encoder setTexture:output atIndex:2];
NSUInteger wd = grid_sample_Pipeline.threadExecutionWidth;
NSUInteger ht = grid_sample_Pipeline.maxTotalThreadsPerThreadgroup / wd;
MTLSize threadsPerThreadgroup = MTLSizeMake(wd, ht, 1);
MTLSize threadgroupsPerGrid = MTLSizeMake((input.width + wd - 1) / wd, (input.height + ht - 1) / ht, input.arrayLength);
[encoder setComputePipelineState:grid_sample_Pipeline];
[encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
[encoder endEncoding];
}
else
return NO;
*error = nil;
return YES;
}
Post not yet marked as solved
I'm having trouble reasoning about and modifying the Detecting Human Actions in a Live Video Feed sample code since I'm new to Combine.
// ---- [MLMultiArray?] -- [MLMultiArray?] ----
// Make an activity prediction from the window.
.map(predictActionWithWindow)
// ---- ActionPrediction -- ActionPrediction ----
// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)
These are the final two operators of the video processing pipeline, where the action prediction occurs. In either the implementation for private func predictActionWithWindow(_ currentWindow: [MLMultiArray?]) -> ActionPrediction or for private func sendPrediction(_ actionPrediction: ActionPrediction), how might I access the results of a VNHumanBodyPoseRequest that's retrieved and scoped in a function called earlier in the daisy chain?
When I did this imperatively, I accessed results in the VNDetectHumanBodyPoseRequest completion handler, but I'm not sure how data flow would work with Combine's programming model. I want to associate predictions with the observation results they're based on so that I can store the time range of a given prediction label.
Post not yet marked as solved
I convert a pytorch model to mlmodel with a custom layer, and create a test app in swift to test my model. When i implement the custom layer by swift, it works well. However when i implement the custom layer by object-C, the code return
2022-01-14 17:58:49.964377+0800 CustomLayers[2547:968723] [coreml] Error in adding network -1.
2022-01-14 17:58:49.965023+0800 CustomLayers[2547:968723] [coreml] MLModelAsset: load failed with error Error Domain=com.apple.CoreML Code=0 "Error in declaring network." UserInfo={NSLocalizedDescription=Error in declaring network.}
2022-01-14 17:58:49.965085+0800 CustomLayers[2547:968723] [coreml] MLModelAsset: modelWithError: load failed with error Error Domain=com.apple.CoreML Code=0 "Error in declaring network." UserInfo={NSLocalizedDescription=Error in declaring network.}
Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=com.apple.CoreML Code=0 "Error in declaring network." UserInfo={NSLocalizedDescription=Error in declaring network.}: file CustomLayers/model_2.swift, line 114
2022-01-14 17:58:49.966267+0800 CustomLayers[2547:968723] Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=com.apple.CoreML Code=0 "Error in declaring network." UserInfo={NSLocalizedDescription=Error in declaring network.}: file CustomLayers/model_2.swift, line 114
(lldb)
It seems the model load failed with object-C custom layer. So i wonder does object-C custom layer implementation can't work with swift project? Although i try to set the CustomLayers-Bridging-Header.h. It still doesn't work.
System Information
mac OS: 11.6.1 Big Sur
xcode: 12.5.1
coremltools: 5.1.0
test device: iphone 11
Post not yet marked as solved
I implement a custom pytorch layer on both CPU and GPU following [Hollemans amazing blog] (https://machinethink.net/blog/coreml-custom-layers ). The cpu version works good, but when i implemented this op on GPU it cannot activate "encode" function. Always run on CPU. I have checked the coremltools.convert() options with compute_units=coremltools.ComputeUnit.CPU_AND_GPU, but it still not work. This problem also mentioned in https://stackoverflow.com/questions/51019600/why-i-enabled-metal-api-but-my-coreml-custom-layer-still-run-on-cpu and https://developer.apple.com/forums/thread/695640. Any idea on help this would be grateful.
System Information
mac OS: 11.6.1 Big Sur
xcode: 12.5.1
coremltools: 5.1.0
test device: iphone 11
Post not yet marked as solved
Good day people!
I'm currently working on my master thesis in media informatics. I'd really appreciate to discuss my topic with you guys, so I may get some interesting ideas or new information.
The goal is to implement an app, specifically designed for places like museums where the envrionment isn't perfect for AR tracking. (Darkness, no network connection, maybe exhibits made out of glass...)
Therefore, i'd like to develop a neuronal network for the new ipad pro that takes rgb-d data to predict a pose estimation in a scene for an object, so that it matches the real world object perfectly. This placed object will be a perfect 3d model replica of the real object. (hand modeled or scanned and revised)
This should allow me to place AR Content precisely over the real world object, even in difficult lightlings and stuff. Maybe it will improve occlusion, too. I can imagine that the neuronal network may also detect structures, edges and semantic coherences better than the usual approach.
My first thought was to work with CoreML, Metal, maybe Vision and ARKit. I will also try out XCode for the first time.
Maybe you guys have interesting ideas for improvement or can guide me a little bit, since i fell a bit lost at the moment.
Would you use rather point clouds or the raw depth buffer to train the model? Would you also train with edge filter images and stuff? Why or why not?
Thanks in advance, it would mean the world to me!
Kind regards, Miri :-)
Post not yet marked as solved
I am excited about Create ML and tried to train a detector for feet. I gave it training data two sets of objects: The left foot and the right foot. However, I was surprized that also just one model (left or right) detected both type of feet: left and right.
I really have no deep understanding of ML but I was wondering if this means the resulting model cannot be trained to distinguish if an object is mirrored ? Do you see any chance to train a model that could be used to find an object - but but its mirrored counterpart?
Post not yet marked as solved
Below, the sampleBufferProcessor closure is where the Vision body pose detection occurs.
/// Transfers the sample data from the AVAssetReaderOutput to the AVAssetWriterInput,
/// processing via a CMSampleBufferProcessor.
///
/// - Parameters:
/// - readerOutput: The source sample data.
/// - writerInput: The destination for the sample data.
/// - queue: The DispatchQueue.
/// - completionHandler: The completion handler to run when the transfer finishes.
/// - Tag: transferSamplesAsynchronously
private func transferSamplesAsynchronously(from readerOutput: AVAssetReaderOutput,
to writerInput: AVAssetWriterInput,
onQueue queue: DispatchQueue,
sampleBufferProcessor: SampleBufferProcessor,
completionHandler: @escaping () -> Void) {
/*
The writerInput continously invokes this closure until finished or
cancelled. It throws an NSInternalInconsistencyException if called more
than once for the same writer.
*/
writerInput.requestMediaDataWhenReady(on: queue) {
var isDone = false
/*
While the writerInput accepts more data, process the sampleBuffer
and then transfer the processed sample to the writerInput.
*/
while writerInput.isReadyForMoreMediaData {
if self.isCancelled {
isDone = true
break
}
// Get the next sample from the asset reader output.
guard let sampleBuffer = readerOutput.copyNextSampleBuffer() else {
// The asset reader output has no more samples to vend.
isDone = true
break
}
// Process the sample, if requested.
do {
try sampleBufferProcessor?(sampleBuffer)
} catch {
/*
The `readingAndWritingDidFinish()` function picks up this
error.
*/
self.sampleTransferError = error
isDone = true
}
// Append the sample to the asset writer input.
guard writerInput.append(sampleBuffer) else {
/*
The writer could not append the sample buffer.
The `readingAndWritingDidFinish()` function handles any
error information from the asset writer.
*/
isDone = true
break
}
}
if isDone {
/*
Calling `markAsFinished()` on the asset writer input does the
following:
1. Unblocks any other inputs needing more samples.
2. Cancels further invocations of this "request media data"
callback block.
*/
writerInput.markAsFinished()
/*
Tell the caller the reader output and writer input finished
transferring samples.
*/
completionHandler()
}
}
}
The processor closure runs body pose detection on every sample buffer so that later in the VNDetectHumanBodyPoseRequest completion handler, VNHumanBodyPoseObservation results are fed into a custom Core ML action classifier.
private func videoProcessorForActivityClassification() -> SampleBufferProcessor {
let videoProcessor: SampleBufferProcessor = { sampleBuffer in
do {
let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer)
try requestHandler.perform([self.detectHumanBodyPoseRequest])
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
}
return videoProcessor
}
How could I improve the performance of this pipeline? After testing with an hour long 4K video at 60 FPS, it took several hours to process running as a Mac Catalyst app on M1 Max.
Post not yet marked as solved
I just got an app feature working where the user imports a video file, each frame is fed to a custom action classifier, and then only frames with a certain action classified are exported.
However, I'm finding that testing a one hour 4K video at 60 FPS is taking an unreasonably long time - it's been processing for 7 hours now on a MacBook Pro with M1 Max running the Mac Catalyst app. Are there any techniques or general guidance that would help with improving performance? As much as possible I'd like to preserve the input video quality, especially frame rate. One hour length for the video is expected, as it's of a tennis session (could be anywhere from 10 minutes to a couple hours). I made the body pose action classifier with Create ML.
Post not yet marked as solved
After creating a custom action classifier in Create ML, previewing it (see the bottom of the page) with an input video shows the label associated with a segment of the video. What would be a good way to store the duration for a given label, say, each CMTimeRange of segment of video frames that are classified as containing "Jumping Jacks?"
I previously found that storing time ranges of trajectory results was convenient, since each VNTrajectoryObservation vended by Apple had an associated CMTimeRange.
However, using my custom action classifier instead, each VNObservation result's CMTimeRange has a duration value that's always 0.
func completionHandler(request: VNRequest, error: Error?) {
guard let results = request.results as? [VNHumanBodyPoseObservation] else {
return
}
if let result = results.first {
storeObservation(result)
}
do {
for result in results where try self.getLastTennisActionType(from: [result]) == .playing {
var fileRelativeTimeRange = result.timeRange
fileRelativeTimeRange.start = fileRelativeTimeRange.start - self.assetWriterStartTime
self.timeRangesOfInterest[Int(fileRelativeTimeRange.start.seconds)] = fileRelativeTimeRange
}
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
}
In this case I'm interested in frames with the label "Playing" and successfully classify them, but I'm not sure where to go from here to track the duration of video segments with consecutive frames that have that label.
Post not yet marked as solved
Modifying guidance given in an answer on AVFoundation + Vision trajectory detection, I'm instead saving time ranges of frames that have a specific ML label from my custom action classifier:
private lazy var detectHumanBodyPoseRequest: VNDetectHumanBodyPoseRequest = {
let detectHumanBodyPoseRequest = VNDetectHumanBodyPoseRequest(completionHandler: completionHandler)
return detectHumanBodyPoseRequest
}()
var timeRangesOfInterest: [Int : CMTimeRange] = [:]
private func readingAndWritingDidFinish(assetReaderWriter: AVAssetReaderWriter,
asset
completionHandler: @escaping FinishHandler) {
if isCancelled {
completionHandler(.success(.cancelled))
return
}
// Handle any error during processing of the video.
guard sampleTransferError == nil else {
assetReaderWriter.cancel()
completionHandler(.failure(sampleTransferError!))
return
}
// Evaluate the result reading the samples.
let result = assetReaderWriter.readingCompleted()
if case .failure = result {
completionHandler(result)
return
}
/*
Finish writing, and asynchronously evaluate the results from writing
the samples.
*/
assetReaderWriter.writingCompleted { result in
self.exportVideoTimeRanges(timeRanges: self.timeRangesOfInterest.map { $0.value }) { result in
completionHandler(result)
}
}
}
func exportVideoTimeRanges(timeRanges: [CMTimeRange], completion: @escaping (Result<OperationStatus, Error>) -> Void) {
let inputVideoTrack = self.asset.tracks(withMediaType: .video).first!
let composition = AVMutableComposition()
let compositionTrack = composition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)!
var insertionPoint: CMTime = .zero
for timeRange in timeRanges {
try! compositionTrack.insertTimeRange(timeRange, of: inputVideoTrack, at: insertionPoint)
insertionPoint = insertionPoint + timeRange.duration
}
let exportSession = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)!
try? FileManager.default.removeItem(at: self.outputURL)
exportSession.outputURL = self.outputURL
exportSession.outputFileType = .mov
exportSession.exportAsynchronously {
var result: Result<OperationStatus, Error>
switch exportSession.status {
case .completed:
result = .success(.completed)
case .cancelled:
result = .success(.cancelled)
case .failed:
// The `error` property is non-nil in the `.failed` status.
result = .failure(exportSession.error!)
default:
fatalError("Unexpected terminal export session status: \(exportSession.status).")
}
print("export finished: \(exportSession.status.rawValue) - \(exportSession.error)")
completion(result)
}
}
This worked fine with results vended from Apple's trajectory detection, but using my custom action classifier TennisActionClassifier (Core ML model exported from Create ML), I get the console error getSubtractiveDecodeDuration signalled err=-16364 (kMediaSampleTimingGeneratorError_InvalidTimeStamp) (Decode timestamp is earlier than previous sample's decode timestamp.) at MediaSampleTimingGenerator.c:180. Why might this be?
Post not yet marked as solved
Is it possible to do any of the following:
Export a model created using MetalPerformanceShadersGraph to a CoreML file;
Failing 1., save a trained MetalPerformanceShadersGraph model in any other way for deployment;
Import a CoreML model and use it as a part of a MetalPerformanceShadersGraph model.
Thanks!
Post not yet marked as solved
I am using a CoreML model from https://github.com/PeterL1n/RobustVideoMatting.
I have an M1Macbook13 16G and an M1Max Macbook 16 64G.
When "computeUnits" using .all or default, M1Max 16 is much slower than M1 13, finish one prediction time is 0.202 and 0.155.
Using .cpuOnly, M1Max 16 is fast a little, time is 0.129 and 0.146.
Using .cpuAndGPU, M1Max 16 is much fast than M1 13, time is 0.057 and 0.086.
And when I use .all or default, M1Max will appear error messages like this:
H11ANEDevice::H11ANEDeviceOpen IOServiceOpen failed result= 0xe00002e2
H11ANEDevice::H11ANEDeviceOpen kH11ANEUserClientCommand_DeviceOpen call failed result=0xe00002bc
Error opening LB - status=0xe00002bc.. Skipping LB and retrying
But M1 13 doesn't have any errors.
So I want to know is this a bug of CoreML or M1Max?
My Codes is like this:
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try rvm_mobilenetv3_1920x1080_s0_25_int8_ANE(configuration: config)
let image1 = NSImage(named: "test1")?.cgImage(forProposedRect: nil, context: nil, hints: nil)
let input = try? rvm_mobilenetv3_1920x1080_s0_25_int8_ANEInput(srcWith:image1!, r1i: MLMultiArray(), r2i: MLMultiArray(), r3i: MLMultiArray(), r4i: MLMultiArray())
_ = try? model.prediction(input: input!)
Post not yet marked as solved
I followed Apple's guidance in their articles Creating an Action Classifier Model, Gathering Training Videos for an Action Classifier, and Building an Action Classifier Data Source. With this Core ML model file now imported in Xcode, how do use it to classify video frames?
For each video frame I call
do {
let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer)
try requestHandler.perform([self.detectHumanBodyPoseRequest])
} catch {
print("Unable to perform the request: \(error.localizedDescription).")
}
But it's unclear to me how to use the results of the VNDetectHumanBodyPoseRequest which come back as the type [VNHumanBodyPoseObservation]?. How would I feed to the results into my custom classifier, which has an automatically generated model class TennisActionClassifier.swift? The classifier is for making predictions on the frame's body poses, labeling the actions as either playing a rally/point or not playing.
Post not yet marked as solved
My goal is to mark any tennis video's timestamps of both the start of each rally/point and the end of each rally/point. I tried trajectory detection, but the "end time" is when the ball bounces rather than when the rally/point ends. I'm not quite sure what direction to go from here to improve on this. Would action classification of body poses in each frame (two classes, "playing" and "not playing") be the best way to split the video into segments? A different technique?
Post not yet marked as solved
While trying to learn about coreML, I ran into an issue. Using a model from Apple's website (MobileNetV2) I had no issues. When I tried to use my own model that I created, I ran into the issue and the localized description was "Could not create inference context" when using the iPhone simulator. After a quick search I tested this using the arm64 simulator and it worked just fine. I believe this is an m1 related bug because another forum said it worked without any issues on intel Mac, but not their m1.
if let data = self.animal.imageData {
do {
let modelFile = try! DogorCatmodel1(configuration: MLModelConfiguration())
let model = try VNCoreMLModel(for: modelFile.model)
let handler = VNImageRequestHandler(data: data)
let request = VNCoreMLRequest(model: model) { (request, error) in
guard let results = request.results as? [VNClassificationObservation] else {
print("Could not classify")
return
}
for classification in results {
var identifier = classification.identifier
identifier = identifier.prefix(1).capitalized + identifier.dropFirst()
print(identifier)
print(classification.confidence)
}
}
do {
try handler.perform([request])
}
catch {
print(error.localizedDescription)
print("Invalid")
}
}
catch {
print(error.localizedDescription)
}
}
}
Post not yet marked as solved
A CoreML model with conv3d layers runs very slow on my mac. If I set usesCPUOnly to True, it will take about the same running time. So it seems that the conv3d layer only supports the CPU mode in a CoreML. When I replace conv3d layers with conv2d layers, the model will be many times faster than before.
The macOS version is 11.2.3. The CoreML model is converted from a pytorch model.