-
Meet Core AI
Discover Core AI, Apple's new framework for on-device AI model deployment. Tour the ecosystem, from Python libraries for converting, authoring, and optimizing models, to a Swift API for simple plug-and-play inference and advanced use cases with strict latency and memory requirements. Explore the new Core AI models repository with ready-to-run examples for popular architectures. See how deep Xcode integration, including ahead-of-time model compilation, streamlines the workflow so you can deliver smarter, more responsive app experiences.
Chapters
- 0:00 - Introduction
- 0:33 - What is Core AI
- 4:57 - Model conversion
- 6:16 - App integration
- 10:48 - Profiling with Instruments
- 11:15 - Optimizing performance
- 14:13 - Additional features
- 15:34 - Specialization
- 20:07 - Next steps
Resources
- Core AI PyTorch Extensions
- Core AI Python
- Core AI Optimization
- Core AI
- Compiling Core AI models ahead of time
- Managing model specialization and caching
Related Videos
WWDC26
-
Search this video…
Hi everyone, my name is Ben and I'm an engineer on the Core AI team.
Today I'll be giving an introduction to Core AI, and showing how you can use it to add intelligent features into your apps.
AI is advancing faster than ever. New models and capabilities that previously seemed out of reach are emerging constantly.
Core AI is built to help you harness that momentum and build on top of it. Core AI marks the next evolution of on-device AI execution across Apple platforms. It's built from the ground up for modern workloads, and delivers the high-performance inference you need to build advanced AI features. Core AI is the inference framework powering on-device Apple Intelligence. And now, it's available for you to use, bringing that same power to your app's own intelligence. Core AI is more than just a framework. It's a complete set of technologies, covering the model deployment lifecycle, from model optimization and conversion to debugging and integration into your app. All designed to support the fast, iterative cycle that building great AI features requires.
Core AI allows you to leverage all of Apple Silicon. It provides blazing fast inference across the CPU, GPU, and Neural Engine.
The framework comes with a modern Swift API. It's an expressive API that delivers the performance your app demands without compromising on memory safety.
The broader set of technologies fit naturally into common ML engineering workflows, reusing familiar Python and PyTorch foundations for model authoring, optimization and conversion.
Core AI also supports extensive customization from fine-grained inference management and model specialization to custom GPU kernels.
And all of this is tightly integrated into a new developer toolchain, with ahead-of-time compilation, dedicated Core AI Instruments, and a powerful visual Debugger to trace tensor values directly back to your original Python source code. Core AI is designed to scale to your needs and available compute.
Whether you want your app to identify who's talking in a live meeting with a small speaker diarization model, your users to point their camera at anything, ask a question, and instantly get an answer with a larger vision language model, or let them hand off complex, multi-step tasks to a powerful agentic assistant powered by a 70 billion parameter LLM. Core AI has you covered. With all of it running locally on Apple devices, with no server and no cost per token. In this talk, I'll start by showing how to get your model into the Core AI format.
Then I'll go over how to integrate the converted model into your app.
I'll then dive a little deeper into optimizing the performance of your model and app. And lastly I'll highlight some additional features of Core AI and its associated tools that you may find useful.
Let's get started. Every great app experience starts with an idea. Maybe you want to build something that feels a little magical, something that responds intelligently, or makes a decision that would otherwise require a human or hard coded rules.
Machine learning and AI are what make those kinds of experiences possible. Once you have that idea, the next step is finding or building a model that can power it. Just like your idea itself will evolve over time, finding the right model is an iterative process. You'll try things, evaluate them against your requirements and refine. Core AI is designed to support that iteration and make it as fast and frictionless as possible.
So to make this concrete, I'll implement a fun game idea I had. It's an app that lets you play a two player snake game where one snake is powered by an AI model run through Core AI. The app will follow traditional snake rules where snakes can grow by eating food and must avoid hitting walls, themselves and the other snake. The last snake standing wins. At each time step, the AI model will see a set of features describing the current board state, and those features will be accumulated into the full game history that gets fed to the model. It will then predict the best direction to move. While snake is a simple game, the tools and APIs used to create this experience are the same foundation that scale all the way up to the larger, more complex use cases.
I was curious to see what I could put together with PyTorch for this project. With a little help from an AI coding assistant I was able to sketch out a simple snake action prediction model pretty quickly. To train it, I used a naive simulation to generate training data, just running the game and recording states and actions. The idea was to start simple and get the model working in my app.
So the next step is taking this PyTorch model and converting it to Core AI.
I'll use the new Core AI Torch Python package to easily perform the conversion.
First I'll load the trained checkpoint of the SnakeTransformer module, and prepare a sample input.
Then I'll export the torch program using torch.export and also make sure to use the dynamic_shapes argument to specify that the sequence length of the features is dynamic, that way it doesn't get traced with the static sample length of 5. Also I'll run decompositions on the converted program using Core AI's decomposition table.
Next I'll run Core AI's TorchConverter, specify the names of the inputs and outputs, and finally save the converted Core AI model to disk. Before leaving the Python environment, one more thing I'll do is run a test to verify that the converted Core AI model matches the numerics of my original PyTorch model. This can be done easily with the Core AI framework Python bindings. First I'll load the PyTorch and Core AI models. Then prepare a sample snake game input.
Then run that same input through both the PyTorch module and the Core AI inference function. And finally assert a sufficiently small delta for my use case between the PyTorch and Core AI outputs.
Now that I have the converted AI model, the next step is to hop into Xcode and integrate the model into my app. First I'll open the AI model file with Xcode, which shows information about the model.
It includes the model size, the distribution of operations and other helpful metadata. Also in the Functions tab it shows you the exact function signature of each unique function in the model.
In this case the model just has one function, which takes the features of the game board as an input and produces logits as an output which indicate which direction the model thinks would be best to move. Also note that the question mark in the NDArray values denotes that the dimension has a dynamic shape, which matches how I converted the model with a dynamic sequence length. Now that I've included the AI model file in my Xcode project and have examined its structure, the next step is to use the Core AI framework to run the model.
The Core AI framework is a new Swift API surface for loading and running Core AI models.
It offers a progressively disclosing set of APIs, which makes it simple to get things up and running, while also having deeper layers of flexibility for supporting performance critical applications.
Also, it uses modern Swift language features like non-escapable types, to offer memory-safe APIs while not sacrificing performance.
Let's begin by discussing the core types within the framework.
An AIModel is initialized from a URL to a .aimodel file and is used primarily to inspect and load one or more inference functions.
An InferenceFunction is the runnable object which represents a single loaded compute graph. In the common case, your AIModel will only have a single main InferenceFunction, though you can convert a single model with multiple functions. The AIModel and InferenceFunction are typically objects you'll construct when preparing your app's AI feature. For example this could be on app initialization.
NDArray is the type which holds your multi-dimensional input and output data and you use the run method on an InferenceFunction to run inference with that data.
Finally you can read and process the outputs of the inference.
So for implementing the snake game, I'll start by making the ModelPlayer type. At app initialization time, it'll be initialized with the URL to the AI model file that it should use. Then it will initialize the AIModel, and load the main inference function from it.
Next is the logic for the model player to make decisions. It'll conform to the SnakePlayer protocol that I've defined in my app.
The main protocol requirement is the chooseAction function which is passed in the game's history, and returns the next action that the snake should take.
The first thing to do is create an NDArray to populate with the input features.
For this inference function, the expected structure of the NDArray is 2 dimensional with float32 data, where the first dimension of the shape is the current sequence length, and the second is the fixed hidden dimension size.
Then it'll write the features into that NDArray using this writeFeatures helper function which takes the game and a mutable view of the NDArray. The NDArray.MutableView type is a non-escapable type which provides safe and efficient access to the backing storage of the NDArray.
After preparing the inputs, it'll run inference with them, and extract the expected output logits ndarray.
The last step is to sample the output logits to pick the next direction that the snake will move, by passing an ndarray view into the helper function which will read the values and choose the direction with the largest corresponding logit. The writeFeatures function is what's populating the input features. Let's briefly go over what these features include.
They have the normalized distance of the AI snake's head to all the walls. The normalized relative X and Y distance to the nearest food.
Four elements encoding it's current direction.
The normalized distance to the other snake.
And finally the opponent's direction.
Now with this put together I'm going to try a test run with both snakes powered by the AI model to see how it does.
Running it shows that the model is working. However, I see that the game is getting slower as it goes on.
Alongside the Core AI framework, there's a new instrument in Xcode to help you profile the Core AI models running in your app. In this case I've ran the app with Instruments and I can see the inference intervals getting notably larger over time, which means the inference calls are increasing in latency. This makes sense because transformer models have quadratic time complexity with respect to the sequence length. And in our game the sequence length is increasing with every move the model makes. The next step in this case is to optimize the performance of the model usage.
Each time the input sequence is increased, the transformer model recomputes a set of internal key and value embeddings for every element in the sequence. A common strategy used to improve the performance of decoding loops like this when using transformers is to cache keys and values that are computed for each element in the sequence, as opposed to re-computing them all from scratch with each inference. This can be achieved through Core AI by using states.
States are inputs to the model which are both read, and updated in-place during inference. By introducing the key and value caches as states on the model, we both avoid recomputing them on each inference, and also remove the need to provide the full history of the game as an input since the data needed from older steps are stored in the states.
So after the first input, each subsequent step uses the cache for history and only takes the new features of the latest board state.
To implement the key/value caching, I'll go back to the original authoring code and make a few changes to add in the key and value caches. First I'll update the torch module by adding key and value cache tensors as buffers within the transformer module, by using the torch register_buffer API. This will later result in these tensors being mutable buffers in the exported torch program which Core AI will convert to states. Then in the forward function of the module, I'll add the logic to actually use the caches. This involves reading previous features keys and values out of the cache. Then writing the computed keys and values for the new features back into the cache. Lastly, I'll rerun the same code from before to re-convert the model, but now adding in the state_names argument to the convert call to specify the names of the new state arguments. Now that I've re-converted the model with the new function signature, I'll update the app code to handle it. To start, I'll update the ModelPlayer to store the key and value cache NDArrays which will be the state arguments passed to each inference. I'll initialize them with the expected shape for the transformer. In this case I converted the model such that it expects the key and value caches to always be a fixed size for a maximum possible context length.
Then when it's time to run inference, I'll construct a collection of MutableViews containing both views of the key and value caches. Then provide those as the states argument of the InferenceFunction.run method. Now the caches will be both read and updated in-place during each inference. Now with the updated model, I'll re-run the app. This time I can see it maintains a steady speed, no longer slowing down overtime.
When tracing the updated app in Instruments, I can confirm that the inference latency is growing at a much slower rate.
Before wrapping up, I'll show some features that I didn't use while making the snake game, but that you may find useful when developing your own apps. When converting the snake game models, I used the coreai-torch package to directly convert the PyTorch module. This flow is simple and works great for many use cases, but sometimes you may need more control over how your model is authored, and potentially even how the operations within the model are run.
We've only touched the surface of what the Core AI Python package has to offer. It also has support for directly authoring your model with Core AI APIs, optimizing the model for Apple Silicon, and defining custom kernel implementations with Metal 4. To learn more about these advanced model authoring flows, see the talk "Dive into Core AI model authoring and optimization". In addition to debugging performance, it's also crucial to be able to debug the numerics of your converted model. For this you can use the Core AI Debugger which allows you to visualize your converted model, easily inspect intermediate tensor values, and trace back operations in the converted model to the Python source code which introduced them.
There is also a convenient Core AI debug gauge which shows you streaming Core AI activity while your app is running in Xcode. This is a great place to spot performance issues before jumping into instruments.
One thing that was glossed over in the snake game implementation is the process of model specialization.
When you ship an AI model with your app, that is a source representation of the model, which can be run on any Apple device. However, to actually load and run the model within your app, it must be specialized for the device that the app is running on. When your model is loaded it is checked to see if it has already been specialized and cached. The specialization process can take a significant amount of time for very large models.
While future loads are from the cache and fast, that first time is something you may need to plan for. It is recommended you avoid having model specialization occur within user interactive flows.
Core AI can help you with that. First, Core AI gives you programmatic access to the default model cache for your app. You can request to load models directly from it. If nil is returned, it is not present and requires specialization. You can use this to gate features or inform the users that they may need to wait a bit while your app prepares the model.
Second, you can request model specialization explicitly in your app independent of it being loaded.
You can do this after downloading assets or when the user opts in to a feature so the model is ready to go ahead of time. And there is a lot more control available. SpecializationOptions help configure how you want your model to be optimized for inference.
With the AIModelCache you can also delete entries you no longer need, and control the policy on how long entries persist.
You can even share a cache between multiple apps in the same app group.
Check out the "Managing model specialization and caching" article on developer.apple.com to learn more.
Independent of when specialization occurs, it still takes time. Lets take a quick peak inside. During specialization, the model goes through two main transformations. First, it goes through a core set of compilation steps which segment, plan and optimize compute. Second, executable artifacts are generated for the compute units used. These artifacts are tied to the device and OS version they were generated on. Of these two steps, compilation is the one which incurs most of the latency.
The Core AI toolchain can help you reduce that time by allowing some compilation to occur ahead of time on your development machine, producing a compiled version of the model.
While that compiled model still needs to be specialized for the specific users device, there is now much less work to do and finishes significantly faster. To learn more about this option, check out the "Compiling Core AI models ahead of time" article on developer.apple.com. Controlling when, where, and how specialization happens is one way to help you optimize your users experience. Another area you may want to optimize is removing any overheads in tight inference loops using your model. The Core AI Framework has several APIs to help you here.
You can dynamically check the optimal memory layout of NDArray arguments and allocate them with that structure to avoid layout conversions at inference time.
You can also pre-allocate output values for the framework to write into, to avoid allocating new output values during inference.
And you can also use asynchronous values to efficiently pipeline execution of multiple inference functions together. For most use cases, the higher-level inference APIs will get you exactly where you need to be.
But when you're optimizing a tight inference loop or integrating a model into a complex compute pipeline, these lower-level APIs are there when you need them.
Whether you're just getting started or diving deep, the Core AI Models repository is a great place to find what you need.
It has a collection of popular models, each just a single command away from being converted and optimized for your app.
AI skills that are experts in Core AI model authoring, optimization, and conversion.
And a Swift package with libraries for specific families of models that give you higher-level APIs that already have many of those low-level inference optimizations built in.
It also provides an API for creating a Core AI Language model, which plugs right in to the Foundation Models framework, letting you bring your own custom models and token sampling strategies.
To wrap things up: Core AI is available on all Apple Silicon to help you build cutting edge AI experiences on all Apple platforms.
It has tight integration with the existing Python tools that you're already familiar with, a modern Swift framework for running your models efficiently within your app, and state of the art debugging tools to help you understand how your models are running on Apple devices.
We can't wait to see what sorts of experiences you build.
-
-
5:08 - Convert a PyTorch model to Core AI
import torch import coreai_torch # Load trained snake model and sample input for tracing pt_model = SnakeTransformer().load_checkpoint("snake.pt") example = torch.randn(1, 5, 16) # Export the torch program including dynamic shape for input sequence seq_len = torch.export.Dim("seq_len", min=1, max=256) exported = torch.export.export( pt_model, args=(example,), dynamic_shapes={"features": {1: seq_len}}, ) exported = exported.run_decompositions(coreai_torch.get_decomp_table()) # Convert torch graph → Core AI graph ai_program = coreai_torch.TorchConverter().add_exported_program( exported, input_names=["features"], output_names=["logits"], ).to_coreai() # Save as a .aimodel asset the runtime can load ai_program.save_asset("SnakeTransformer.aimodel") -
5:44 - Verify converted model numerics
import torch import numpy as np from coreai. runtime import AIModel, NDArray # Load models pt_model = SnakeTransformer().load_checkpoint("snake.pt") ai_model = await AIModel.load("SnakeTransformer.aimodel") function = ai_model.load_function("main") # Assemble input sample - 10 frames of 16-dim game features, shape (1, 10, 16) features = np.array(lextract_features(game) for - in range (10)], dtype=np.float32)[np.newaxis] # PyTorch reference with torch.no_grad(): pytorch_logits = pt_model(torch.from_numpy(features)) . numpy )[0, -1] # Core AI inference result = await function({ "features": NDArray(data=features)} ) coreai_logits = result["logits"]. numpy()[0, -1] # Validate max_diff = np.max(np.abs(pytorch_logits - coreai_logits)) assert max_diff < 0.01 -
7:41 - Core AI framework core types
// Core types within Core AI import CoreAI // Load the '.aimodel' file let model = try await AIModel(contentsOf: modelURL) // Load the main inference function let mainFunction: InferenceFunction = try model.loadFunction(named: "main")! // Construct the n-dimensional input data let inputNDArray: NDArray = nextInput() // Run inference var outputs = try await mainFunction.run(inputs: ["input": inputNDArray]) guard let outputNDArray = outputs.remove("output")?.ndArray else { // Handle unexpected missing output } -
8:33 - Initialize ModelPlayer with AIModel
// Initialize the player by loading the AIModel and InferenceFunction struct ModelPlayer { let nextActionFunction: InferenceFunction init(modelURL: URL) async throws { let model = try await AIModel(contentsOf: modelURL) self.nextActionFunction = try model.loadFunction(named: "main")! } } -
8:49 - Run inference with NDArray inputs
extension ModelPlayer: SnakePlayer { mutating func chooseAction(game: SnakeGame) async throws -> Direction { // Create an NDArray for the next input and write board features into it var inputFeatures = NDArray(shape: [game.stepCount, hiddenDim], scalarType: .float32) writeFeatures(of: game, into: inputFeatures.mutableView()) // Run inference and extract the expected logits output NDArray var outputs = try await nextActionFunction.run(inputs: ["features": inputFeatures]) guard let logits = outputs.remove("logits")?.ndArray else { throw ModelError.missingOutput } return predictedDirection(from: logits.view()) } func writeFeatures(of game: SnakeGame, into view: consuming NDArray.MutableView<Float>) { … } func predictedDirection(from logits: NDArray.View<Float>) -> Direction { … } } -
10:10 - Input features for the snake model
// Features at each time step var features = [Float]() // Distance to wall in all directions, normalized between [0, 1] features += [dWallUp, dWallDown, dWallLeft, dWallRight] // Distance to nearest food, normalized between [-1, 1] features += [dFoodX, dFoodY] // Direction encoded as one-hot: [1,0,0,0]=up, [0,1,0,0]=down, etc. features += dir.oneHotEncoding // Distance to the other snake, normalized to [-1, 1] features += [dUserX, dUserY] // Direction of the opponent snake features += dirU.oneHotEncoding -
12:18 - Add KV cache buffers to PyTorch module
# Update torch module to include key and value caches # Use register_buffer to later make the exported torch program treat them as mutable class SnakeTransformerStateful(nn.Module): def __init__(self, ...): super().__init__() self.register_buffer( "k_cache", torch.zeros(N_LAYERS, 1, MAX_SEQ_LEN, D_MODEL)) self.register_buffer( "v_cache", torch.zeros(N_LAYERS, 1, MAX_SEQ_LEN, D_MODEL)) # … -
12:50 - Update forward pass to read/write KV caches
# During forward pass, read/write KV caches class SnakeTransformerStateful(nn.Module): def forward(self, features, position_ids): new_k, new_v = [], [] for i, block in enumerate(self.blocks): # read previous keys/values from caches k_prev = self.k_cache[i] v_prev = self.v_cache[i] # ... compute q/k/v for the new token, attend over valid prefix ... new_k.append(k_updated) new_v.append(v_updated) # Update key/value caches self.k_cache.copy_(torch.stack(new_k)) self.v_cache.copy_(torch.stack(new_v)) return self.action_head(self.ln_final(x)) -
12:59 - Re-convert model with state names
# Updated coreai-torch conversion code using key/value cache states import torch import coreai_torch exported = torch.export.export( stateful_model, args=(example_features, example_position_ids), dynamic_shapes={"position_ids": {1: seq_len}}, ) exported = exported.run_decompositions(coreai_torch.get_decomp_table()) ai_program = coreai_torch.TorchConverter().add_exported_program( exported, input_names=["features", "position_ids"], state_names=["keyCache", "valueCache"], output_names=["logits"], ).to_coreai() ai_program.save_asset("SnakeTransformer.aimodel") -
13:17 - Store KV cache NDArrays in ModelPlayer
// Add stored properties for the key and value caches struct ModelPlayer { let nextActionFunction: InferenceFunction var keyCache: NDArray var valueCache: NDArray init(modelURL: URL) async throws { let model = try await AIModel(contentsOf: modelURL) self.nextActionFunction = try model.loadFunction(named: "main")! self.keyCache = NDArray(shape: [layers, maxContext, hiddenDim], scalarType: .float32) self.valueCache = NDArray(shape: [layers, maxContext, hiddenDim], scalarType: .float32) } } -
13:45 - Pass state views to inference function
extension ModelPlayer: SnakePlayer { mutating func chooseAction(game: SnakeGame, snakeID: Int) async throws -> Direction { // … var stateViews = InferenceFunction.MutableViews() stateViews.insert(&keyCache, for: "keyCache") stateViews.insert(&valueCache, for: "valueCache") // Run inference and extract the expected logits output NDArray var outputs = try await nextActionFunction.run( inputs: ["features": inputFeatures], states: stateViews) // … } } -
16:22 - Check model cache before loading
// Check if your model can be loaded from the cache let cache = AIModelCache.default guard let model = try cache.model(for: modelURL, options: .default) else { Task { @MainActor in informUser("Preparing AI features. This may take a while…") } } -
16:42 - Request model specialization
// Explicitly request specialization try await AIModel.specialize(contentsOf: modelURL)
-
-
- 0:00 - Introduction
Introduction to Core AI and an overview of what the session covers: model conversion, app integration, performance optimization, and additional features.
- 0:33 - What is Core AI
Core AI is the inference framework powering on-device Apple Intelligence, now available to developers. It covers the full model deployment lifecycle, leverages all of Apple Silicon (CPU, GPU, ANE), and comes with a modern Swift API, Python tooling, and a dedicated developer toolchain.
- 4:57 - Model conversion
How to convert a PyTorch model to the Core AI format using the coreai-torch Python package — including exporting with torch.export, specifying dynamic shapes, running the converter, and verifying numerical correctness of the converted model.
- 6:16 - App integration
How to load and run a Core AI model in your app using the CoreAI Swift framework — inspecting the model in Xcode's model viewer, initializing an AIModel, preparing inputs as NDArrays, running inference, and extracting outputs.
- 10:48 - Profiling with Instruments
How to use the new Core AI instrument in Xcode to profile model latency and identify performance bottlenecks, such as growing inference times caused by quadratic complexity in transformer models.
- 11:15 - Optimizing performance
How to eliminate inference slowdowns by adding a key-value cache as a stateful input to your model — authoring the cache in PyTorch, re-converting with state_names, and updating your app to pass MutableViews of the cache buffers at inference time.
- 14:13 - Additional features
A tour of Core AI tools not used in the demo: the rich Python authoring experience, the Core AI Debugger for numeric debugging of converted models, and the Core AI debug gauge in Xcode for streaming activity monitoring.
- 15:34 - Specialization
How Core AI specializes models for the target device — what happens during specialization, how to manage it with programmatic cache access and SpecializationOptions, and how ahead-of-time (AOT) compilation can shift work off the user's device.
- 20:07 - Next steps
Summary of Core AI's capabilities: on-device inference across all Apple Silicon, Python tooling integration, and debugging tools — with an invitation to explore the Core AI Models repository.