-
Bring an LLM provider to the Foundation Models framework
Extend the Foundation Models framework by implementing a LanguageModelExecutor for new models. Explore how to interface with the LanguageModelSession's transcript, manage session state effectively, and optimize KV cache utilization. Find out how to support custom segment types and unlock advanced capabilities for your generative AI features.
Chapters
- 0:00 - Introduction
- 3:37 - Packaging
- 4:48 - Protocol
- 14:50 - Authentication
- 15:51 - Customization
- 19:47 - Next steps
Resources
Related Videos
WWDC26
-
Search this video…
Hi there!
I'm Christopher Webb, an engineer on the Machine Learning Research team, and I'm excited to talk to you about a new way to use the Foundation Models framework. We previously introduced the Foundation Models framework to give you access to Apple's on-device language model. And now, we're opening up the framework to work with nearly any LLM, local or serverbased. This allows anyone, from large companies to individual developers, to easily build their own model-integration on top of the framework.
The on-device System Language Model has been rebuilt from the ground up: it's smarter, better at instruction following, and accepts images directly in your prompts. Beyond the system model, we've added three more options. Private Cloud Compute brings you the model behind many Apple Intelligence features: now with reasoning, a 32K token context window, and the privacy guarantees you'd expect.
Core AI lets you run local models efficiently and take advantage of the ANE. And MLX unlocks the thousands of models available via the MLX-Community on Hugging Face.
And because these are built on top of a brand new public protocol, developers can bring frontier AI models into their apps using the same framework. Anthropic and Google will soon extend the Foundation Models framework with Swift packages of their own, making state-of-the-art Claude and Gemini models available to all Swift developers. Which ever model you use, Apple's, yours, or the community's, you call them the same way, because every model conforms to the Language Model protocol.
For app developers, I'll show you how to call any of these models through the same familiar API. For model providers, I'll walk you through how to create a Language Model package of your own.
But first, let me show you a preview of how easy it is to use.
Here's our on-device Foundation Model. Create it, pass it into a session, and call the respond function. And there are even more model options. If you need more horsepower, try Private Cloud Compute. Just swap the model.
If you want to ship your own model, just point CoreAI at your resources.
And if you want to try the latest open source models, simply pass in a model ID, and let the framework handle the rest.
And using a model built on top of the Language Model protocol means you get access to all kinds of great Foundation Models features, like Dynamic Profiles.
For an overview of everything we’re adding, check out "What’s new in the Foundation Models framework".
The reason it’s possible to swap out models so easily is that every LanguageModel honors the same protocol, the System Language Model, PCC, Core AI, MLX, and those built by the community. If you're a model provider, you should join the fun! Let me show you how.
There are four steps to bring your model into the framework. We’ll start with packaging. A well-crafted Swift package makes it easy for developers to get started.
Then you implement the protocol by defining the types that describe your model and the EXECutor that runs it.
Next, we’ll talk about how to implement authentication for server-based models, including some best practices.
And finally, customization. If you need to tailor the protocol’s building blocks to meet your needs, you can do that. From attaching response metadata, all the way to defining entirely new modalities.
First up, packaging.
We recommend using Swift package manager so that developers can simply add your package as a dependency of their app. We'll cover how to set up Package.swift, and how to publish a release.
An important consideration is which platforms you'll want to support. Foundation Models supports iOS, macOS, visionOS, and watchOS, allowing developers to create a variety of experiences. We recommend you try to do the same.
And because the Foundation Models framework is being released as open source, your package could also be useful to developers who deploy Swift on their servers, so consider supporting Linux too.
Third, your dependencies. Every dependency translates to bytes that a developer ships to their users. Carefully consider what dependencies are linked by your package.
Publishing your package is as easy as creating a git tag. Swift Package Manager is decentralized, so your repo URL is your distribution channel. Developers can paste the URL into Xcode and start integrating your model into their apps. For more, see "Creating Swift Packages”.
With the package in place, we move on to the protocol. The protocol is the bridge between your model, and the Foundation Models framework.
The protocol has two key pieces. The first is LanguageModel. It describes the model to the framework. It declares what the model can do, through capabilities, and provides the configuration the framework needs to set up the model's EXECutor.
The second piece is LanguageModelExecutor where the work happens. It has an initializer that takes a Configuration, a prewarm function for preparing resources ahead of the first request, and a respond function that streams generation back to the session.
The Configuration is what links the two types: the Model provides it, and the framework uses it to construct the EXECutor.
Now you've seen the protocol in code, let's build an intuition for how the model's configuration links it to the EXECutor.
Each session holds an EXECutor store.
When Model1 arrives, the framework checks the store using the model's configuration, but there's no matching EXECutor.
So, the LanguageModelSession creates a new EXECutor and stores it.
Model2 produces the same configuration, and because Configuration is Hashable, the framework knows it matches, and resolves to the same EXECutor.
The configuration is the lookup key, not the model.
Model3 produces a different configuration, so it gets its own EXECutor. Each unique configuration maps to exactly one EXECutor in the store.
So what does this look like in your code? Here's a LanguageModel implementation. It declares its capabilities and returns the configuration the framework uses to find its EXECutor.
The Executor is where the real work lives, loading weights, managing resources, and streaming tokens back to the session.
The framework constructs it from a configuration your model provides, then hands the model in on every request.
That split is what keeps your Model trivial to construct.
When the session deallocates, the store goes with it. Every stored EXECutor gets released, your deinit runs, weights are freed, and connections closed, all automatically. You don't write any of that teardown code yourself.
Within that lifecycle, your EXECutor has one more function: prewarm. Before a request arrives, the developer can ask the framework to prewarm. It's your chance to do expensive setup ahead of time, like loading weights, opening connections, or anything that would otherwise slow down that first response. Let's look at how to use it.
One approach is to put that setup in a private helper that loads the weights once and caches them. prewarm calls the helper eagerly, so the weights are ready before the first request arrives. But prewarm isn't guaranteed to run.
Either way, weights load exactly once, and if your EXECutor has no expensive setup, like a server-backed model, prewarm can simply be a no-op.
Once your respond function is called, your EXECutor goes to work. It converts the transcript of the conversation into the format your model expects.
It applies the options the developer has set and it streams generation events to the session.
From the developer's side, the session is the entire interaction surface. They initialize the model, create the session, call respond, and wait. Your EXECutor and the rest of your package, all of that lives behind the session, out of sight. The developer never sees that machinery, but here's what's happening behind the scenes.
The framework hands you transcript entries, but your inference engine can only process its native types.
So your EXECutor sits in the middle, translates the entries into messages your inference engine understands, and passes them along for inference.
When your inference engine answers, the same translation runs in reverse: your messages back to transcript entries, streamed to the session.
For now, let's focus on the transcripts that flow in and out of the EXECutor.
A transcript is the conversation so far, expressed as a sequence of entries. Each entry plays a role. Instructions, set by the developer, prompts, from the user, tool calls your model made, and the outputs they returned, and the responses your model has produced.
Zooming back out: your EXECutor's job is to turn each transcript entry into a message your inference engine knows how to read.
So, what's inside a transcript? Foundation Models defines these six entry types.
Your model defines its own roles. Your EXECutor's job is to map between the two, no matter the shape your model takes.
In this example, instructions, prompt, and response map to system, user, and assistant.
Here, tool calls, tool outputs, and reasoning all map to assistant too. They're part of what the model did during its turn, and since this model doesn't have dedicated roles for these, we just map them to assistant.
If your model does define something like a dedicated tool role, you can route there instead. Either way, your EXECutor stays in control. Your EXECutor reads the conversation. But every request carries more than history, it carries the developer's intent for how the model should respond, expressed through two additional properties.
Every request object can include ContextOptions and GenerationOptions. ContextOptions control what goes into the prompt, like the reasoning level you want the model to use, or a response schema. GenerationOptions control the decoder loop: sampling strategy, temperature, and maximum response length.
Here's what that looks like inside respond. Both types of options come in on the request, your EXECutor pulls them out and passes them along when calling the model. So that's everything coming in: transcript, options, all parsed. Now for the half your developer sees: the response.
On the response side, there are a few things to send: the text your inference engine generates, any tool calls or reasoning, and the metadata that travels with them. They all go out as events on the channel.
Each chunk that the inference engine emits, a token or tool-call fragment, becomes an event. A textDelta, a toolCallDelta, and so on.
The framework writes them to the transcript. Foundation Models exposes both one-shot and streaming responses, but the implementation is always streaming; the one-shot API just collects the deltas internally.
So far we've looked at this from your model's side, events going out as the model produces them. But put yourself in the developer's seat for a moment. They've called respond and they're waiting. What do they need first? Here's your EXECutor's side of the handshake with the developer. There's a deliberate order to it.
First, a metadata update, model and request IDs the developer can use for logging and debugging.
Then a usage update, prompt token counts for accounting. Sending these upfront means the developer isn't waiting through the whole stream to learn what each request costs.
Finally, for each token your model produces, send a text delta the moment it arrives. The framework streams those deltas to the session as they arrive, so users see the response appear word-by-word instead of all at once.
Earlier we saw how the framework caches EXECutors by configuration.
If your integration is stateful, holding a KV cache or persistent session between calls, that caching is what lets you minimize network churn and avoid redoing work. Now let's look at how to design yours to take advantage of that, and how your EXECutor can preserve work across calls. Your EXECutor receives the full transcript on every call to respond. Here's what you processed last time, an instruction, a prompt, and the response you generated.
When the next call comes in you compare the new transcript to the one you saved from last time.
In most cases, new entries have simply been appended, a new prompt after the last response.
When that's the case, you can preserve your existing state and only process what's new.
But sometimes your comparison finds that entries have been removed or modified, for example, when the developer trims older entries to save context.
When that happens, you'll need to invalidate back to where the transcripts diverge. The framework gives you the full transcript on every call. Your EXECutor decides what counts as a match, and how to handle any changes. Sometimes your model can't do exactly what the developer asked. When that happens, your EXECutor has two choices: approximate or throw.
Be flexible where you can, and honor the developer's intent.
But sometimes there's no honest approximation. If a developer sets a token limit, but also specifies a schema with required fields, there might not be a way to satisfy both. So you throw. Foundation Models ships LanguageModelError for exactly these cases: context window overflows, rate limits, refusals, and more. Throw one of these, and any developer who's used the framework already knows how to handle it.
When the built-in LanguageModelError cases don't cover your situation, define your own error type. Some failures only make sense in the context of your service: your subscription tiers, your features, your account states. A purpose-built case name carries the intent, so a developer catching it knows exactly what happened. Custom errors are powerful, and sometimes you need them. But each one is a new case developers must learn, catch, and handle in their app. Try to use a built-in LanguageModelError when it fits, and save the custom ones for failures only your service can produce.
We've finished implementing the protocol requirements. Let's discuss how to handle authentication next.
Your job as a package author is to make it easy for developers to do the right thing. If your initializer takes an API key as a string, developers will be tempted to take the path of least resistance. Instead, help developers do the right thing by offering a token provider or sign in flow.
And if your package fetches access tokens on behalf of developers, make sure to persist them securely using Keychain. Credential handling is half the story. Device at-test-ation is the other half. If you're shipping a cloud-based LanguageModel package, this is worth a deep look.
This related session walks through verifying the device, catching tampered builds, signing payloads, and using Apple's fraud signal to keep bad traffic off your service. Check it out in "Secure your apps with App Attest". You've packaged your model, implemented the protocol, and handled authentication. That means you've got a solid package for your LanguageModel, with all the fundamentals covered. Now it's time to differentiate. The protocol gives you room to shape LanguageModelSession around the abilities only your model offers. Response metadata is a lightweight option to attach additional information to your responses, and give developers clear ways to access it.
You can attach your own custom metadata to the response. Here, after streaming completes, our EXECutor sends tokensPerSecond and timeToFirstToken through the channel.
We recommend providing utilities or documentation that make it easy for developers to work with your metadata; clear keys, typed accessors, whatever makes sense. Underneath, metadata is just a dictionary. It can contain strings, numbers, and other built-in types. But in some cases, you may need something more flexible.
Custom segments are the answer. You'll define a new segment type, receive it in your EXECutor, and stream results back through the same channel, and the developer never has to leave LanguageModelSession to use them. Custom segment types let you extend the protocol. When a new modality comes along, audio, video, whatever's next, developers have a typed, structured way to send that data to your model.
Here's how it works. First, you'll define a type that conforms to custom segment. Because custom segments are required to be PromptRepresentable, developers can pass it directly in their prompts, just like text.
In your EXECutor, you'll receive this as a customSegment in the transcript, alongside the text entries you're already handling. When your model responds, you emit the result back through the channel as a custom segment update.
The segment ID controls whether you're adding a new segment, or updating one you've already started streaming. This gives you full control over how results stream into the app. With custom segments in hand, there's one more thing worth calling out: a recommendation for server-side tools.
Server-side tools are capabilities your model runs on its own, like web search, code execution, or image generation. The model invokes them, the server runs them, and your EXECutor watches the results stream in. We'll walk through three levels of detail, each surfacing more of the tool's work, using web search as an example.
Server-side tools are named, typed values on your model. The developer constructs the model with the tools they want, and your EXECutor receives them through the model on every request, the same way it receives every other capability the model declares.
First, the simplest pattern: run the tool privately and stream only the answer back. The tool grounds the model's response, but its work stays inside your EXECutor.
Each text delta you append gets streamed into the transcript by the framework, with no trace of the tool that produced it.
In addition to grounding the answer on the tool's output, you can also attach additional metadata to the response.
When a text delta carries metadata, like a citation, forward both to the channel, and the framework attaches the metadata to the text segment in the transcript.
And finally, you can choose to surface the tool's work itself. With custom segments, you forward the tool's structured output to the channel, alongside the text and any metadata, giving apps everything the model produced along the way.
Through one channel, the events you forward, the metadata you attach, and the custom segments you design, server-side tools shape what apps using your package can show their users.
There's one more thing to keep in mind: whether you're choosing a package or shipping one, make sure everyone in the chain understands the privacy implications of the model behind it. On-device and cloud-based models have very different privacy characteristics, and your users deserve to know which they're getting.
You've seen how to bring your model to the framework. These sessions show what developers will build with it. Check out "Integrate On-Device AI Models into Your App Using Core AI" for bundling local models directly into an app.
"Build with the new Apple Foundation Model on Private Cloud Compute" goes deep on serverscale inference with Apple's privacy guarantees. And "Build agentic app experiences with the Foundation Models framework" shows how developers use dynamic profiles to build multi-step, tool-using workflows on top of models like yours.
We're excited about what's ahead. We hope to see a thriving ecosystem of LanguageModel packages, giving Swift developers the freedom to choose the model that's right for their app. We can't wait to see what you build.
-
-
2:00 - Choose a language model
import FoundationModels import MLXFoundationModels // On-device Apple Foundation Model let model = SystemLanguageModel() // Private Cloud Compute model // let model = PrivateCloudComputeLanguageModel() // Custom Core AI model // let model = try await CoreAILanguageModel(resourcesAt: modelURL) // Open-source MLX model from HuggingFace // let model = MLXLanguageModel(modelID: "mlx-community/my-model") let session = LanguageModelSession(model: model) let response = try await session.respond(to: "...") print(response.content) -
3:46 - Configure Package.swift for your model package
// Package.swift let package = Package( name: "MyModel", platforms: [ .macOS(.v27), .iOS(.v27), .visionOS(.v27), .watchOS(.v27) ], products: [ .library(name: "MyModel", targets: ["MyModel"]) ], dependencies: [ .package(url: "...", .upToNextMinor(from: "1.0.0")) ], targets: [ .target(name: "MyModelRuntime"), // public: LanguageModel conformance .target(name: "MyModel", dependencies: ["MyModelRuntime"]), .testTarget(name: "MyModelTests", dependencies: ["MyModel"]) ] ) -
4:56 - LanguageModel and LanguageModelExecutor protocols
// LanguageModel protocol public protocol LanguageModel: Sendable { var capabilities: LanguageModelCapabilities { get } var executorConfiguration: Executor.Configuration { get } } // LanguageModelExecutor protocol public protocol LanguageModelExecutor: Sendable { init(configuration: Configuration) throws func prewarm(model: Model, transcript: Transcript) func respond( to request: LanguageModelExecutorGenerationRequest, model: Model, streamingInto channel: LanguageModelExecutorGenerationChannel ) async throws } -
6:25 - Implement LanguageModel and Executor conformances
// LanguageModel conformance public struct MyLanguageModel: LanguageModel { typealias Executor = MyLanguageModelExecutor public var capabilities: LanguageModelCapabilities { LanguageModelCapabilities(capabilities: [ .toolCalling, .guidedGeneration, .reasoning ]) } public var executorConfiguration: Executor.Configuration { Executor.Configuration(/* ... */) } } // Executor conformance public struct MyLanguageModelExecutor: LanguageModelExecutor { public typealias Model = MyLanguageModel public struct Configuration: Hashable, Sendable { /* ... */ } public init(configuration: Configuration) throws { /* ... */ } public func respond( to request: LanguageModelExecutorGenerationRequest, model: MyLanguageModel, streamingInto channel: LanguageModelExecutorGenerationChannel ) async throws { /* ... */ } } -
7:28 - Manage model resources with prewarm and respond
// One approach to managing resources struct MyLanguageModelExecutor: LanguageModelExecutor { private mutating func loadModelIfNeeded() throws -> LoadedWeights { let weights = try loadedModel ?? loadWeights() loadedModel = weights return weights } func prewarm(transcript: Transcript) { loadedModel = try? loadModelIfNeeded() } func respond( ... ) async throws { let weights = try loadModelIfNeeded() // ...generate with 'weights'... } } -
9:00 - Map Transcript entries to model messages
// Transcript entries let transcript = Transcript(entries: [ .instructions( ... ), // "You are a helpful assistant" .prompt( ... ), // "What's the weather in Pittsburgh?" .toolCalls( ... ), // getWeather(location: "Pittsburgh") .toolOutput( ... ), // 65°F, sunny .response( ... ), // "It's 65°F and sunny in Pittsburgh" .prompt( ... ), // "What's the address of Apple Park?" .response( ... ), // "One Apple Park Way, Cupertino, CA 95014" ]) -
10:42 - Read generation and context options from the request
// Parse generation and context options func respond( to request: LanguageModelExecutorGenerationRequest, model: MyLanguageModel, streamingInto channel: LanguageModelExecutorGenerationChannel ) async throws { let reasoningLevel = request.contextOptions.reasoningLevel let temperature = request.generationOptions.temperature let maxTokens = request.generationOptions.maximumResponseTokens } -
11:47 - Stream tokens and metadata through the channel
// Streaming text tokens func respond( ... ) async throws { // 1. Report metadata await channel.send(.response(action: .updateMetadata([ "modelID": "my-model-2026-06-08", "requestID": request.id.uuidString ]))) // 2. Report prompt token usage before generating await channel.send(.response(action: .updateUsage( input: .init(totalTokenCount: promptTokens, cachedTokenCount: cachedTokens), output: .init(totalTokenCount: 0, reasoningTokenCount: 0) ))) // 3. Stream text deltas as the model generates for try await token in tokens { await channel.send(.response(action: .appendText(token))) } } -
13:33 - Honor the developer's intent or throw
// Honor the developer's intention where possible // The developer set sampling: .greedy, but our service only takes temperature if request.generationOptions.sampling?.kind == .greedy { serviceRequest.temperature = 0 } // Otherwise, throw an error // The token budget is too small to satisfy the schema if let schema = request.schema, let budget = request.generationOptions.maximumResponseTokens, budget < minimumTokens(for: schema) { throw LanguageModelError.unsupportedCapability( .init( capability: .guidedGeneration, debugDescription: "Token budget too small to satisfy this schema." ) ) } -
13:57 - Built-in errors that any model can throw
// Built-in errors that any model can throw public enum LanguageModelError: LocalizedError, CustomDebugStringConvertible { // Transcript grew past the model's context window. Trim entries and retry. case contextSizeExceeded( ) // Too many requests in a short window. Space them out or reduce load. case rateLimited( ) // Model declined to answer. Fall back to a message of your choosing. case refusal( ) // Safety guardrails tripped on the prompt or the response. case guardrailViolation( ) // Model lacks a feature you used, such as guided generation or tools. case unsupportedCapability( ) // Prompt contains content the model can't process (bad files, unknown formats). case unsupportedTranscriptContent( ) // A generation guide (e.g., a regex pattern) isn't supported by this model. case unsupportedGenerationGuide( ) // Prompt asked for output in a language or locale the model doesn't support. case unsupportedLanguageOrLocale( ) // Request timed out before the model produced a response. case timeout( ) } -
14:14 - Handle errors from your model executor
// Custom errors public enum MyModelError: Error, LocalizedError { // User hit monthly token limit. Prompt upgrade or wait for reset. case exceededSubscriptionTierLimit // Model variant isn't enabled on this account. case modelNotProvisioned // Billing or policy review locked this account. case accountSuspended public var errorDescription: String? { switch self { case .exceededSubscriptionTierLimit: String(localized: "Your plan limit has been reached.") // ... } } } -
16:08 - Attach custom metadata to responses
// Attach service-specific performance metadata let elapsed = Date().timeIntervalSince(startTime) let tokensPerSecond = Double(tokenCount) / elapsed let timeToFirstToken = firstTokenTime?.timeIntervalSince(startTime) ?? 0 await channel.send(.metadataUpdate([ "tokensPerSecond": tokensPerSecond, "timeToFirstToken": timeToFirstToken ])) -
17:05 - Define and use custom Transcript segments
// Define a custom segment public struct AudioSegment: Transcript.CustomSegment { public var id: String public var content: URL } // Pass it in a prompt let recording = AudioSegment(id: UUID().uuidString, content: URL(filePath: "/path/to/recording.m4a")) let response = try await session.respond { "Where was Frank Lloyd Wright's original architecture school located?" recording } // Emit a custom segment from the executor for try await event in stream { switch event { case .audioFileGenerated(let file): await channel.send(.response(action: .updateCustomSegment( AudioSegment(id: file.id, content: file.url) ))) } } -
18:09 - Implement server-side tools in your model
// Configure server-side tools public struct MyLanguageModel: LanguageModel { public struct ServerTool: Sendable { public static let webSearch: ServerTool = ... } public init(serverTools: [ServerTool] = []) { } } // Surface tool results through the channel let client = MyServerClient(serverTools: model.serverTools) let response = try await client.send(prompt: .init(request)) for try await chunk in response { switch chunk { case .webSearch(let webSearch): await channel.send(.response(action: .updateCustomSegment( WebSearchSegment(url: webSearch.url, content: webSearch.html) ))) case .textDelta(let textDelta): await channel.send(.response(action: .appendText( textDelta.text, tokenCount: textDelta.tokenCount ))) } }
-
-
- 0:00 - Introduction
Overview of the Foundation Models framework opening to nearly any LLM. Covers improvements to the on-device System Language Model, three new model options (Private Cloud Compute, Core AI, and MLX), upcoming Anthropic and Google partner integrations, and a code preview showing how any model can be swapped into a LanguageModelSession using the same Swift API.
- 3:37 - Packaging
How to package your LLM provider as a Swift package — configuring Package.swift with the right platform targets (iOS, macOS, visionOS, watchOS, and Linux), being deliberate about dependencies to minimize shipped bytes, and publishing a release via a git tag that developers can paste directly into Xcode.
- 4:48 - Protocol
The two core protocol types bridging your model to the framework: LanguageModel (declares capabilities and provides a Configuration) and LanguageModelExecutor (handles prewarm, translates Transcript entries to your inference engine's native format, applies ContextOptions and GenerationOptions, and streams responses with metadata-first ordering). Covers executor caching by configuration and KV cache state reuse across calls, plus how to approximate unsupported options or throw LanguageModelError when needed.
- 14:50 - Authentication
Best practices for credential handling — designing initializers that guide developers toward secure usage rather than plain API key strings, persisting tokens securely via Keychain, and using App Attest for device attestation to verify devices, catch tampered builds, and protect cloud-based language model services.
- 15:51 - Customization
How to differentiate your model package beyond the protocol fundamentals — attaching custom response metadata (e.g., tokensPerSecond, timeToFirstToken), defining custom segment types for new input and output modalities (audio, video, and beyond), and implementing server-side tools (web search, code execution, image generation) at three levels of visibility: privately grounded, metadata-enriched, or fully surfaced through custom segments.
- 19:47 - Next steps
Privacy considerations when choosing or shipping a model package — on-device versus cloud-based models have very different characteristics and users deserve to know which they're getting. Pointers to companion sessions on Core AI model integration, Private Cloud Compute, and building agentic app experiences on top of the new model ecosystem.