-
Integrate on-device AI models into your app using Core AI
Discover a curated collection of popular open-source models — including Qwen, Mistral, SAM3, and more — optimized for Apple silicon using the new Core AI Framework. Learn how to download, run, and benchmark models on your Mac, and integrate them into your app with just a few lines of code. Explore a new workflow for model compilation and on-device specialization to speed up first-time model load. Find out how to profile and optimize runtime performance with Core AI tools in Xcode.
Chapters
- 0:00 - Introduction
- 1:16 - App concept: camera-based vocab learning
- 2:52 - Model discovery
- 7:40 - Getting models with the Core AI models repository
- 8:37 - Integration
- 10:55 - Writing the Swift integration code
- 13:05 - Diagnosing model specialization latency
- 14:40 - Deployment
- 17:00 - Ahead-of-time (AOT) compilation
- 18:03 - iOS demo
- 19:57 - Multiplatform
- 23:06 - Next steps
Resources
- Core AI PyTorch Extensions
- Core AI Python
- Core AI Optimization
- Core AI
- Compiling Core AI models ahead of time
Related Videos
WWDC26
-
Search this video…
Hi everyone, welcome! My name is Carina and I am from the Core AI team. Today, let's dive into the world of on-device intelligence.
In this talk, I will explore how to add some exciting new features to your app with Core AI. I'll show you how I built a language-learning app that uses a vision-transformer model and a large language model working together, running entirely on device.
Core AI is a new set of technologies that lets you bring advanced on-device AI capabilities directly into your apps.
With Core AI, you can build app experiences where user's data never leaves their device. There's no server to manage, no cost per token, and no latency to the cloud. If you haven't already, check out "Meet Core AI". You will learn the high-level ideas behind our framework and design philosophy and the best ways to use our APIs.
Let's start simple. I'm developing an iOS app for students to learn vocabulary in a new language, starting with Mandarin Chinese. I have a set of vocab cards that I've already curated by hand; it gives the word, translation, and example usage. But this is hard to scale. I would need to include all of these statically in my app.
I'd like to bring AI to my app. How cool would it be if students could point their camera at something they see in their garden, or an object on the street, and just ask the app to pull it right out of the scene? From that, it generates a vocab card in the language they're learning. No curated deck can keep up with a curious student. But a camera and an on-device model can.
Every card features something from their own life. They learn wherever they are, whenever they want, and their collection grows with them. And this runs all locally on device. I will begin by identifying some models that can help power this experience. Then I'll write the code to use those models in the app. Next, I will explore some practical considerations of model deployment. And finally, I'll expand on my idea by building a macOS version of the app, re-using the same code and unlocking some new features with larger models. Let's start with model discovery.
First, I need to define the core capabilities of my app. It starts with picture and a prompt from the user on what they want to learn about.
Given the input, the app needs to highlight and extract what the user requested from the image. This segmented image becomes the graphic on the card.
And from the text input in their native language, the app will reason about the word and generate all the vocab information: the translation, the natural example usage in the language being learned, and the English meaning of that usage.
With these in mind, I have three requirements for my use case. First is content. This app is about real-world learning, so it needs to handle settings like kitchens, streets, and offices. Second is languages. The model architecture needs to support multiple languages from the start. For my initial release, I'm scoping to Mandarin Chinese.
Third is device constraints. Everything runs on-device on iPhone, so I need to keep both storage and memory footprint small. That means being deliberate about model size and how many models I ship.
I explored a few directions here, reading through model documentation, running some prototypes, and bouncing ideas off an AI assistant.
The conclusion was clear: decompose the problem into two small models.
The first is a dedicated vision model that handles image segmentation. The second is a multi-lingual large language model that takes that English label and generates vocab, translation, and example sentences.
Why two models on device? Task-specific models give me better quality, smaller individual sizes, and the ability to upgrade them independently. I'm targeting variants under one billion parameters each, which keeps the total on-device footprint manageable.
For image segmentation, I am interested in SAM 3, the Segment Anything Model 3. SAM 3 is a vision-transformer-based model for promptable image segmentation. It's a powerful model that does exactly what my app needs. A student points their camera at something and SAM 3 isolates the object according to their prompt precisely. It provides a clean cutout for the card graphic. The prompt can provide an English label for the language model.
For the language model, the flow would be simple: an English label like "Hummingbird" goes in, and the model generates vocab information in the target language. So I need four things. Multilingual, so it handles translations accurately. Reasoning, so I get contextual example sentences. Structured output, so it fills typed fields reliably. And compact, so it fits on device alongside the vision model.
Many open source language models have strong reasoning capabilities in this size range. I did some quick tests and Qwen stood out — it supports one hundred nineteen languages and dialects, and it is a reasoning model, which means it can generate contextual examples, not just translations. A great starting point for vocab card generation.
There is even a 0.6 billion parameter version of the model, which should work great for my app. I found these models and documentation about them on HuggingFace and GitHub. So the next question is: how do I bring them into my app with Core AI? One path is to convert them directly from their PyTorch representation using the Core AI PyTorch extensions package.
I could also incorporate model compression with the Core AI optimization package. To learn more about this process, check out the talk "Dive into Core AI model authoring and optimization". In that section we even show how to convert the SAM 3 model!
Core AI has powerful tools for model optimization, conversion, and even direct authoring. However, for many popular models there is an another path.
The Core AI Models repo is a great resource to check out. It contains many popular models, each with conversion scripts that yield optimized versions of those models in the Core AI format, along with optional platform specific variants. Let's head to the Core AI models repository.
models/ is the catalog. Browse what's available, find the model you want, and follow its export recipe. python/ gives you reusable primitives and utilities for exporting. Here I found the SAM 3 and Qwen family models, and I followed the export recipe to get our Core AI models.
Now let's talk about integration.
After our model export, we get these .aimodel files in Finder. Let's see what's inside of the SAM3 model.
In Xcode, I can inspect everything about it. I can see it's 623 MB — I am interested that it targets iOS 27.0 and macOS 27.0 for my use case. You can find useful information about the model, such as the size, metadata, and more.
If I click into the Functions tab, I can see this model's interface. It actually exposes three separate functions. For instance, let's look at the imageEncode function.
The input isn't just an image, it's a tensor with a specific shape and data type. And output is a dense feature embedding.
Another function is detect. It takes those image features plus a text prompt, and outputs raw masks, bounding boxes, and confidence scores. So to use this model directly, I'd need to write all the pre-processing to get my camera frame into the right format and all the post-processing to turn these raw tensors into something meaningful.
The Core AI Models repository can help me with these model-specific pre- and post-processing tasks. In addition to the models and Python conversion utilities, the repo also hosts a Swift package for a set of runtime libraries. The libraries abstract things such as text encoding on the way in, the mask extraction and labeling on the way out. So instead of wrangling tensor shapes, you just call a clean Swift API.
I already cloned the repo so we can easily add coreai-models as a dependency to my project to try it out.
Once we add the coreai-models URL as a Swift Package, we can select the CoreAILM and CoreAISegmentation to our app target, as easy as that.
Now let's see the code we write to integrate these two models into my app. CoreAIImageSegmenter imports the image segmentation library that provides the SAM 3 model functionality, which allows us to load the SAM 3 model from disk. Then we perform text-prompted segmentation on an input text prompt, such as "flower" and lastly we extract the best segmentation mask.
Now for the language model. To load, it's just one line. I create a CoreAILanguageModel, point it at my model bundle and it's ready. One line — asset loading, engine creation, tokenizer setup — all abstracted away for you.
Notice we're importing FoundationModels here. This is the same framework you may already be familiar with.
Here's the beautiful part. To use it, I create a LanguageModelSession. This is the same API that gives you access to Apple's on-device large language model. The difference is that now you'll pass in your own model to use. Same session.respond to: call, same streaming support, same structured output capabilities. You get the ergonomics of the Foundation Models API with the flexibility of choosing exactly which model runs underneath.
We also support guided generation. This is important for our use case. Instead of letting the model generate free-form text, I can provide a @Generable macro that describes exactly what a vocabulary card looks like: a word field, a translation field, an example sentence field.
Now let's see it in action. I'll take a photo... and we're waiting. The segmentation hasn't come back yet, so we can't get to card generation. Something is clearly slow here.
I know from my code that I show this spinner when I'm first instantiating my SAM 3 model and sending it a prompt. Let's see what's going on.
I took a trace with the new Core AI instruments, and sure enough there's a model load event right at that point, with a large sub-event for specialization.
Specialization is the process that prepares a Core AI model for execution on device. When your model is loaded it is checked to see if it has already been specialized and cached. This process can take a significant amount of time for very large models. That is what we were seeing in our instrument trace.
While future loads are from the cache and are fast, that first time is something I need to plan for.
Having that happen right in the middle of the user experience is... probably not great. So when should I do it? I could kick it off at launch or run it in the background but that feels wasteful if the user isn't even interested in this feature yet.
I think a better idea is to create a dedicated first-run experience, where I can move this work to happen while the user is learning about the feature for the first time. This keeps model loading and specialization out of the interactive flow Before I make that change though, I want to step back and think more broadly about my deployment strategy for this feature.
There are a few things I want to get right. I'm shipping this as an update to my existing app, so I want the feature to be discoverable but not required. Users who try it should have a great experience, and users who don't should feel just as great about the app as before.
My first-run experience gives me a natural place to explain the feature and prepare for a smooth first launch. But I'd been assuming the models would just be bundled with the app and when I checked, they're adding over 1 GB to my download size. That hits everyone who updates, even people who'll never touch this feature. So instead, I'll have my feature introduction screen include a button that only triggers the model download if the user actually wants to try it. I'll use Background Assets for this. If you want to dig into the details, check out "Discover Apple-Hosted Background Assets" from last year's WWDC.
Now let's look at how that plays out. When a user says they want to give the feature a try, I request the model assets and show them the download progress. Once that's done, I kick off specialization.
The specialization is no longer interrupting the main experience but it's still taking a while. That's a bit of an awkward waiting time for the user experience.
Fortunately, Core AI has an awesome feature that can help here. During specialization the model goes through two main transformations. First it goes through a core set of compilation steps. Second, executable artifacts are generated. These artifacts are tied to the device and OS version they were generated on. Of these two steps, compilation is the most expensive and takes the most amount of time.
The Core AI toolchain lets me do some of that compilation ahead-of-time on my development machine, producing a compiled version of the model. While that compiled model still needs to be specialized for the specific user's device, there is now much less work to do and finishes significantly faster.
This is done with the coreai-build command. You give it a model as input, and depending on your options, it generates one or more compiled models targeting specific device architectures.
I did this with my model and created a background asset for each compiled model. There is a small amount of code I add to my app to detect the architecture of the device it's running on and then request the appropriate asset based on that.
You can find all the details in the "Compiling Core AI models ahead of time" article on developer.apple.com.
I've integrated this and now we have the ahead-of-time compilation already done. On my desk, I have some rocks I've collected from my travels. Let's see this in action.
Now the model preparation step should be a fraction of what it was before, and the user can get started quickly.
The model gave me an example usage, and I can save it to my collection.
Let's try a few more objects. Here I have a piece of wood gifted from my college roommate, and a sunflower from my little sister.
These are meaningful objects to me, and I want to capture them as I learn a new language.
And on subsequent inferences, we are using the cached model asset so the user experience is seamless.
So I've been really enjoying this feature myself. I think it could seriously streamline building more curated card sets. Way easier than typing them out one-by-one. The thing is, I do most of my content creation on my Mac. So... What if I brought this there as well? Let's talk about multiplatform.
Here's what we've built so far on iOS. SAM3 handles segmentation, and Qwen 0.6B model generates the vocab cards. With Core AI, I can reuse all the same code and just build from there on Mac.
On Mac, I'm not learning one word at a time. I'm curating. I might have a folder of photos from a recent trip, and I want to generate cards for all of them in one go. So I add a batch processing layer on top. What took an afternoon of typing can now be completely automated.
And because I have more memory and processing power on the Mac, I can step up to a larger model variant of the same model. More parameters means better reasoning and higher-quality output. For curation, that matters. I can give the model richer prompts, ask for multiple example sentences instead of one, or even have it generate pinyin in Chinese. The same code, calling the same API, just a more capable model underneath.
And with longer context, I can go beyond individual cards. I can hand the model an entire category of words and ask it to build a curriculum: sequence them from simple to complex, group them into lessons, and write example sentences that reuse earlier vocab to reinforce what the student already learned. One prompt, and I have a structured lesson plan.
I went on a road trip recently and I'd like to bring in a few photos I took to include in my iOS app.
I want to segment butterflies, rock, flower, lake, bird, etc. Right away, we are parallelizing the workload to segment the photos, to find all objects in all my photos, so I can reuse a photo to create multiple cards. Once that's done, we kick off the generation with our Qwen3 8 billion model. It is a more powerful reasoning model, so you can see that it is thinking before it gives me the outputs. In fact, it is checking whether the pinyin is correct for each word and example usage, since those are easy to mess up. Once that's done, we get cards with multiple images for me to now distribute to my apps, and even a curriculum to help me guide my teaching! There are many new features I'd like to develop, I should get back into developing, because my agents are calling me, so let's wrap up here.
With Core AI, you can build a multiplatform app experience where your user's data never leaves their device. There's no server to manage, no cost per token, and no latency to the cloud. The models are ready. The tools are ready. With Core AI you have everything you need to bring powerful, private intelligence to every Apple platform. Now, let's go build something powerful on device!
-
-
11:01 - Load and run SAM3 image segmentation
import CoreAIImageSegmenter // Load let segmenter = try await ImageSegmenter(resourcesAt: sam3ModelURL) // Use let response = try await segmenter.segment(image: inputImage, prompt: "flower") let mask = response.segments.first?.mask -
11:28 - Load a language model and create a session
import FoundationModels import CoreAILanguageModels // Create model instance let model = try await CoreAILanguageModel(resourcesAt: qwen3ModelURL) // Create session using the model let session = LanguageModelSession(model: model) // Generate response let response = try await session.respond(to: "...") -
12:29 - Generate structured output with @Generable
import FoundationModels import CoreAILanguageModels @Generable struct VocabCard { let chineseWord: String let englishMeaning: String let exampleSentence: String } let model = try await CoreAILanguageModel(resourcesAt: modelURL) let session = LanguageModelSession(model: model) let response = try await session.respond( to: "Create a vocab card for flower", generating: VocabCard.self ) let card: VocabCard = response.content -
17:22 - Compile a Core AI model ahead of time
$ xcrun coreai-build compile MyModel.aimodel --platform iOS
-
-
- 0:00 - Introduction
Overview of Core AI — a new set of technologies that lets you bring advanced on-device AI capabilities to your apps with no server, no cost per token, and no cloud latency.
- 1:16 - App concept: camera-based vocab learning
Introduction to the demo app — an iOS language-learning app where students point their camera at real-world objects to generate vocab cards with translations, example sentences, and segmented images, all running on-device.
- 2:52 - Model discovery
How to define your app's AI requirements — content, language, and device constraints — and select the right models: SAM3 for text-prompted image segmentation and Qwen 0.6B (a 119-language reasoning model) for vocab card generation.
- 7:40 - Getting models with the Core AI models repository
How to use the coreai-models GitHub repository to find popular models with ready-made export recipes — browsing the catalog, running the export script for SAM3 and Qwen, and getting optimized .aimodel files.
- 8:37 - Integration
How to inspect .aimodel files in Xcode (size, platform targets, function signatures, tensor shapes), add the coreai-models Swift package, and select the CoreAILM and CoreAISegmentation libraries as app dependencies.
- 10:55 - Writing the Swift integration code
How to write the Swift code to use both models — loading SAM3 and running text-prompted segmentation, loading Qwen with a single CoreAILanguageModel line, and using the familiar LanguageModelSession API from Foundation Models with structured @Generable output for typed vocab card fields.
- 13:05 - Diagnosing model specialization latency
Using the new Core AI Instruments template to identify that first-run latency is caused by model specialization — the process that compiles a Core AI model for the specific device — and understanding when and how to handle it gracefully.
- 14:40 - Deployment
How to design a deliberate deployment strategy: using a first-run experience to introduce the feature, keeping models out of the app bundle to avoid bloating update size for all users, and triggering on-demand model download via Background Assets only when the user opts in.
- 17:00 - Ahead-of-time (AOT) compilation
How to use the coreai-build command to perform compilation ahead-of-time on your development machine — generating device-architecture-specific compiled model assets that dramatically reduce on-device specialization time during the first-run experience.
- 18:03 - iOS demo
Live demo of the complete iOS experience: fast model preparation with AOT compilation, SAM3 segmenting real objects (rocks, wood, sunflower), and Qwen generating Mandarin vocab cards — with seamless subsequent inferences from the cached model.
- 19:57 - Multiplatform
How the same Swift code runs on macOS with no changes — adding batch processing for folders of photos, stepping up to Qwen3 8B for higher-quality reasoning and pinyin generation, using longer context for curriculum generation, and a live macOS demo processing road trip photos into a full lesson plan.
- 23:06 - Next steps
Summary: Core AI gives you everything you need to build private, multi-platform on-device AI experiences — no server, no cost per token, no cloud latency.