How to Fine-Tune the SNSoundClassifier for Custom Sound Classification in iOS?

Hi Apple Developer Community,

I’m exploring ways to fine-tune the SNSoundClassifier to allow users of my iOS app to personalize the model by adding custom sounds or adjusting predictions. While Apple’s WWDC session on sound classification explains how to train from scratch, I’m specifically interested in using SNSoundClassifier as the base model and building/fine-tuning on top of it.

Here are a few questions I have:

1. Fine-Tuning on SNSoundClassifier:

  • Is there a way to fine-tune this model programmatically through APIs? The manual approach using macOS, as shown in this documentation is clear, but how can it be done dynamically - within the app for users or in a cloud backend (AWS/iCloud)?

  • Are there APIs or classes that support such on-device/cloud-based fine-tuning or incremental learning? If not directly, can the classifier’s embeddings be used to train a lightweight custom layer?

  • Training is likely computationally intensive and drains too much on battery, doing it on cloud can be right way but need the right apis to get this done. A sample code will do good.

2. Recommended Approach for In-App Model Customization:

  • If SNSoundClassifier doesn’t support fine-tuning, would transfer learning on models like MobileNetV2, YAMNet, OpenL3, or FastViT be more suitable?

  • Given these models (SNSoundClassifier, MobileNetV2, YAMNet, OpenL3, FastViT), which one would be best for accuracy and performance/efficiency on iOS? I aim to maintain real-time performance without sacrificing battery life. Also it is important to see architecture retention and accuracy after conversion to CoreML model.

3. Cost-Effective Backend Setup for Training:

  • Mac EC2 instances on AWS have a 24-hour minimum billing, which can become expensive for limited user requests. Are there better alternatives for deploying and training models on user request when s/he uploads files (training data)?

4. TensorFlow vs PyTorch:

  • Between TensorFlow and PyTorch, which framework would you recommend for iOS Core ML integration? TensorFlow Lite offers mobile-optimized models, but I’m also curious about PyTorch’s performance when converted to Core ML.

5. Metrics:

  • Metrics I have in mind while picking the model are these: Publisher, Accuracy, Fine-Tuning capability, Real-Time/Live use, Suitability of iPhone 16, Architectural retention after coreML conversion, Reasons for unsuitability, Recommended use case.

Any insights or recommended approaches would be greatly appreciated.

Thanks in advance!

Answered by Frameworks Engineer in 811750022

Thx for the detailed feedback. In fact the underlying embedding that supports CreateML sound classifier can be programmatically accessed by this API: .

You can compose a pipeline by connecting this embedding with a logistic regression classifier. From there, you can do either in-app training from scratch using .fitted() or incremental training using .update().

If you choose Python route (TF or PyTorch), you will need to use coremltools to convert to CoreML supported format, which is agnostic to where the model source comes from once converted, and leverage all available compute units on device to deliver best performance. If you see any issue with the performance, feel free to file feedback or post on the forum here.

Thx for the detailed feedback. In fact the underlying embedding that supports CreateML sound classifier can be programmatically accessed by this API: .

You can compose a pipeline by connecting this embedding with a logistic regression classifier. From there, you can do either in-app training from scratch using .fitted() or incremental training using .update().

If you choose Python route (TF or PyTorch), you will need to use coremltools to convert to CoreML supported format, which is agnostic to where the model source comes from once converted, and leverage all available compute units on device to deliver best performance. If you see any issue with the performance, feel free to file feedback or post on the forum here.

Thank you for the insights on fine-tuning SNSoundClassifier with AudioFeaturePrint and logistic regression.

However, I’m still unclear on how to effectively integrate embeddings from SNSoundClassifier into this pipeline, given that they aren’t directly accessible.

Are there specific steps or methodologies to consider for augmenting the base model with user-supplied audio data, and how can I ensure the classifier accurately reflects custom sound classes?

What specific pipeline do you recommend? Base model seems to be necessary while fine-tuning on CreateML. If SNSoundClassifier can be used then how? If it cannot be used as base model then its going to be either TF or PyTorch model (which one)

Any additional guidance would be greatly appreciated!

Accepted Answer

Hello @Blume,

I recommend that you watch this WWDC video (timestamped to the most relevant section) to get a better idea of how MLSoundClassifier relates to AudioFeaturePrint, and how you can create an updatable model.

You've mentioned "SNSoundClassifier" several times in your post. That isn't a type in our system, but I think you are effectively asking how you can fine-tune the built-in classification model that Sound Analysis provides via SNClassifierIdentifier. There is no way to do that, so if that is your goal, please file an enhancement request using Feedback Assistant.

Best regards,

Greg

Hi Greg,

Thank you for taking the time to respond and clarifying the capabilities of the built-in sound classifier. I now understand that fine-tuning the built-in classification model isn’t supported, and I appreciate you confirming that.

Looking back, I realize my query was unclear because I mistakenly referred to "SNSoundClassifier" instead of the built-in pre-trained sound classifier. I got mixed up with the terms from the Sound Analysis framework, such as SNRequest, SNClassifySoundRequest, SNAudioStreamAnalyzer, SNResultsObserving, SNClassificationResult. Thank you for interpreting my question correctly despite the confusion.

I’ll mark your reply as accepted, as it fully answers the specific question about fine-tuning the sound classifier. However, I have a broader question regarding Apple’s ML ecosystem:

Are there any other models within Vision, Natural Language, or Speech APIs that can be fine-tuned or extended for custom use cases? I’m exploring the general feasibility of designing custom ML solutions within Apple’s ecosystem and would appreciate any guidance or insights.

Thank you again for your help!

Hello @Blume,

Are there any other models within Vision, Natural Language, or Speech APIs that can be fine-tuned or extended for custom use cases?

To "fine-tune" a model, you need direct access to it. The Vision, Natural Language, and Speech frameworks do not offer direct access to the underlying models they use for the functionality they provide.

Best regards,

Greg

Thank you, Greg, for the clarification.

It's clear now that Apple's frameworks focus on providing streamlined functionality without exposing the underlying models for direct access or fine-tuning. This approach is understandable for maintaining consistency and security.

For those looking to fine-tune models, it seems leveraging platforms like AWS for training/fine-tuning PyTorch models, followed by conversion to Core ML using coremltools, is a viable path. This aligns with what Benoit Dupin highlighted at re:Invent [1] about Apple's usage of AWS for pretaining.

Of course, the conversion process with coremltools remains a critical step, and ensuring compatibility can sometimes feel like a gating factor.

Thanks again for your insights—this helps close the thread on a clear note!

[1] https://aws.amazon.com/events/aws-reinvent-2024-recap/

Best regards

How to Fine-Tune the SNSoundClassifier for Custom Sound Classification in iOS?
 
 
Q