Core ML Models
Build intelligence into your apps using machine learning models from the research community designed for Core ML.
Build intelligence into your apps using machine learning models from the research community designed for Core ML.
Models are in Core ML format and can be integrated into Xcode projects. You can select different versions of models to optimize for sizes and architectures.
image
Image Classification
A Fast Hybrid Vision Transformer architecture trained to classify the dominant object in a camera frame or image.
View details
FastViT is a general-purpose, hybrid vision transformer model, trained on the ImageNet dataset, that provides a state-of-the-art accuracy/latency trade-off.
The model's high performance, low latency, and robustness against out-of-distribution samples result from three novel architectural strategies:
FastViT consistently outperforms competing robust architectures on mobile and desktop GPU platforms across a wide range of computer vision tasks such as image classification, object detection, semantic segmentation, and 3D mesh regression.
Image classification, object detection, semantic segmentation, 3D mesh regression
Variant | Parameters | Size | Weight Precision | Activation Precision |
---|---|---|---|---|
T8 | 3.6M | 7.8 | Float16 | Float16 |
MA36 | 42.7M | 84 | Float16 | Float16 |
Variant | Device | OS | Inference Time (ms) | Dominant Compute Unit |
---|---|---|---|---|
T8 | iPhone 12 Pro Max | 17.5 | 0.79 | Neural Engine |
T8 | M3 Max | 14.4 | 0.62 | Neural Engine |
MA36 | iPhone 12 Pro Max | 18.0 | 4.50 | Neural Engine |
MA36 | M3 Max | 15.0 | 2.99 | Neural Engine |
Preprocess photos using the Vision framework and classify them with a Core ML model.
image
Depth Estimation
The Depth Anything model performs monocular depth estimation.
View details
Depth Anything v2 is a foundation model for monocular depth estimation. It maintains the strengths and rectifies the weaknesses of the original Depth Anything by refining the powerful data curation engine and teacher-student pipeline.
To train a teacher model, Depth Anything v2 uses purely synthetic, computer-generated images. This avoids problems created by using real images, which can limit monocular depth-estimation model performance due to noisy annotations and low resolution. The teacher model predicts depth information on unlabeled real images, and then uses only that new, pseudo-labeled data to train a student model. This helps avoid distribution shift between synthetic and real images.
On the depth estimation task, the Depth Anything v2 model optimizes and outperforms v1 especially in terms of robustness, inference speed, and image depth properties like fine-grained details, transparent objects, reflections, and complex scenes. Its refined data curation approach results in competitive performance on standard datasets (including KITTI, NYU-D, Sintel, ETH3D, and DIODE) and a more than 9% accuracy improvement over v1 and other community models on the new DA-2k evaluation set built for depth estimation.
Depth Anything v2 provides varied model scales and inference efficiency to support extensive applications and is generalizable for fine tuning to downstream tasks. It can be used in any application requiring depth estimation, such as 3D reconstruction, navigation, autonomous driving, and image or video generation.
Depth estimation, semantic segmentation
Variant | Parameters | Size | Weight Precision | Activation Precision |
---|---|---|---|---|
F32 | 24.8M | 99.2 | Float32 | Float32 |
F16 | 24.8M | 49.8 | Float16 | Float16 |
Variant | Device | OS | Inference Time (ms) | Dominant Compute Unit |
---|---|---|---|---|
Small F16 | iPhone 12 Pro Max | 18.0 | 31.10 | Neural Engine |
Small F16 | iPhone 15 Pro Max | 17.4 | 33.90 | Neural Engine |
Small F16 | MacBook Pro (M1 Max) | 15.0 | 32.80 | Neural Engine |
Small F16 | MacBook Pro (M3 Max) | 15.0 | 24.58 | Neural Engine |
image
Semantic Segmentation
The DEtection TRansformer (DETR) model, trained for object detection and panoptic segmentation, configured to return semantic segmentation masks.
View details
The DETR model is an encoder/decoder transformer with a convolutional backbone trained on the COCO 2017 dataset. It blends a set of proven ML strategies to detect and classify objects in images more elegantly than standard object detectors can, while matching their performance.
The model is trained with a loss function that performs bipartite matching between predicted and ground-truth objects. At inference time, DETR applies self-attention to an image globally to predict all objects at once. Thanks to global attention, the model outperforms standard object detectors on large objects but underperforms on small objects. Despite this limitation, DETR demonstrates accuracy and run-time performance on par with other highly optimized architectures when evaluated on the challenging COCO dataset.
DETR can be easily reproduced in any framework that contains standard CNN and transformer classes. It can also be easily generalized to accommodate more complex tasks, such as panoptic segmentation and other tasks requiring a simple segmentation head trained on top of a pre-trained DETR.
DETR avoids clunky surrogate tasks and hand-designed components that traditional architectures require to achieve acceptable performance and instead provides a conceptually simple, easily reproducible approach that streamlines the object detection pipeline.
Object detection, panoptic segmentation
Variant | Parameters | Size | Weight Precision | Activation Precision |
---|---|---|---|---|
F32 | 43M | 171 | Float32 | Float32 |
F16 | 43M | 86 | Float16 | Float16 |
Variant | Device | OS | Inference Time (ms) | Dominant Compute Unit |
---|---|---|---|---|
F16 | iPhone 15 Pro Max | 17.5 | 40 | Neural Engine |
F16 | MacBook Pro (M1 Max) | 14.5 | 43 | Neural Engine |
F16 | iPhone 12 Pro Max | 18.0 | 52 | Neural Engine |
F16 | MacBook Pro (M3 Max) | 15.0 | 29 | Neural Engine |
image
Image Segmentation
Segment the pixels of a camera frame or image into a predefined set of classes.
View details
image
Drawing Classification
Classify a single handwritten digit (supports digits 0-9).
View details
image
Image Classification
The MobileNetv2 architecture trained to classify the dominant object in a camera frame or image.
View details
Preprocess photos using the Vision framework and classify them with a Core ML model.
image
Image Classification
A Residual Neural Network that will classify the dominant object in a camera frame or image.
View details
Preprocess photos using the Vision framework and classify them with a Core ML model.
image
Drawing Classification
Drawing classifier that learns to recognize new drawings based on a K-Nearest Neighbors model (KNN).
View details
Model Name | Size | Action |
---|---|---|
UpdatableDrawingClassifier.mlmodel | 382KB | Download |
Learn to map drawings from a user to custom stickers by updating a drawing classification model on device.
image
Object Detection
Locate and classify 80 different types of objects present in a camera frame or image.
View details
Apply Vision algorithms to identify objects in real-time video.
text
Question Answering
Find answers to questions about paragraphs of text.
View details
Model Name | Size | Action |
---|---|---|
BERTSQUADFP16.mlmodel | 217.8MB | Download |
Locate relevant passages in a document by asking the Bidirectional Encoder Representations from Transformers (BERT) model a question.