CreateML crashes with Unexpected Error on Feature Extraction

Note: I posted this to the feedback assistant but haven't gotten a response for 3months =( FB13482199

I am trying to train a large image classifier. I have a training run for ~300000 images. Each image has a folder and the file names within the folders are somewhat random. 381 classes. I am on an M2 Pro, Sonoma 14.0 running CreateML Version 5.0 (121.1). I would prefer not to pursue the pytorch/HF -> coremltools route.

CreateML seems to consistently crash ~25000-30000 images in during the feature extraction phase with "Unexpected Error". It does not seem to be due to an out of memory issue. I am looking for some guidance since it seems impossible to debug why this is consistently crashing.

My initial assumption was that it could be due to blank/corrupt files. I do not think that is the case. I also checked if there were any special characters in the data/folders. I wasn't able to go through all, but did try some programatic regex. Don't think this is the case either.

I attached the sysdiagnose results in feedback assistant after the crash happened. I did notice when going into /var/logs there was some write issue saying that Mac had written too much to disk. Note: I also tried Xcode 15.2-beta this time and the associated CoreML version.

My questions:

  1. How can I fix this?
  2. How should I go about debugging CreateML errors in the future?
  3. 'Unexpected Error' - where can I go about getting the exact createml logs on my device? This is far too broad of an error statement

Please let me know. As a note, I did successfully train a past model on ~100000 images. I am planning to 10-15x that if this run is successful. Please help, spent a lot of time gathering the extra data and to date have been an occasional power user of createml. Haven't heard back from Apple since December =/. I assume I'm not the only one with this problem, so looking for any instructions to hands on debug and help others. Thx!

Replies

Additionally, in case it's a lower barrier to solve, I tried to put this script together for my own training run using something similar to what I imagine createml does behind the scenes. My problem is it does not seem to save the model or checkpoints to the ~/.mlmodel file path specified. I added comments where I think my code is messed up, appreciate if anyone can take a look!

import CreateML
import Foundation
import Combine

let trainingData = MLImageClassifier.DataSource.labeledDirectories(at: URL(fileURLWithPath: "/Users/giovannizinzi/Desktop/FoodData/Train", isDirectory: true))
let parameters = MLImageClassifier.ModelParameters(
    validation: .split(strategy: .automatic),
    augmentation: [],
    algorithm: .transferLearning(
        featureExtractor: .scenePrint(revision: 2),
        classifier: .logisticRegressor
    )
)
var cancellables = Set<AnyCancellable>()
let trainingSession = try MLImageClassifier.makeTrainingSession(
    trainingData: trainingData,
    parameters: parameters
)
print(trainingSession.iteration)
print(trainingSession.checkpoints)
print(trainingSession.phase.rawValue)

let checkpointInterval = 0.05

Task{
    do {
        let trainingJob = try MLImageClassifier.resume(trainingSession)
        let progress = trainingJob.progress
        
        progress.publisher(for: \.fractionCompleted)
            .sink { fractionCompleted in
                print("Training progress: \(fractionCompleted * 100)%")
            }
            .store(in: &cancellables)
          
        trainingJob.phase
            .sink { phase in
                print("Current phase: \(phase)")
            }
            .store(in: &cancellables)
          
        trainingJob.checkpoints
            .sink { checkpoint in
                print("Checkpoint: \(checkpoint)")
                print("Checkpoint: \(checkpoint)")
                print("Model URL: \(checkpoint.url)")
// do I need to save the model here for a given checkpoint?
            }
            .store(in: &cancellables)
          
        trainingJob.result
            .sink { completion in
                switch completion {
                case .finished:
                    print("Training finished successfully.")
                case .failure(let error):
                    print("Training failed with error: \(error)")
                }
            } receiveValue: { classifier in
                let model = classifier.model
  
                // Save the model, not working? .store(in: &cancellables) culprit?
                let modelURL = URL(fileURLWithPath: "/Users/giovannizinzi/Desktop/avofeb24.mlmodel")
                let metadata = MLModelMetadata(author: "Gio", shortDescription: "Avo", version: "1.0")
                do {
                    try classifier.write(to: modelURL)
                    print("Model saved successfully at the end of training.")
                } catch {
                    print("Failed to save model: \(error)")
                }
            }
            .store(in: &cancellables)
          
    } catch {
        print("Error occurred: \(error)")
    }
}

print(trainingSession.checkpoints)
print(trainingSession.phase)
print(trainingSession.iteration)

Please try extracting the features separately from training. If you can't extract features for all of your images in memory, try instead to write the extracted features to disk. Then use the saved features for training. This has the added advantage that you will only need to do this once, and can then train multiple models, or one model over multiple iterations. For reference please see the WWDC session and the ImageFeaturePrint documentation..

Thanks for reporting the issue, we are looking into it.