Debug and profile agentic app experiences with Instruments

Back to WWDC26

Debug and profile agentic app experiences with Instruments

Explore the enhanced FoundationModels instrument in Xcode to inspect behavior and optimize the performance of agentic flows. Learn how to inspect prompts, analyze latency, and trace control flow in advanced use cases that leverage multiple LanguageModelSessions and profiles.

Chapters
- 0:00 - Introduction
- 1:57 - LLM app development mindset
- 3:59 - Inspect and diagnose an agentic experience
- 5:02 - Recording a trace with Instruments
- 6:04 - Navigating the Instruments UI
- 12:07 - Performance metrics
- 13:04 - Next steps
Resources
- Analyzing the runtime performance of your Foundation Models app
- - HD Video
  - SD Video
Related Videos

WWDC26
Hi, I'm Erik an AI Tools Engineer.
In this session, I'll show you how to use Instruments to debug and develop features built with the Foundation Models framework.
The Foundation Models APIs give your app direct access to on-device and server-based generative AI. With them, you can build features that understand natural language, generate content, and respond to what the person is doing.
The features that create the best experiences aren't static. They adapt based on context. That's what the Foundation Models APIs are designed for. DynamicInstructions lets you specify exactly which instructions and tools the model can access. It re-evaluates before every request, so the model always has the right context for the task at hand. That flexibility is what makes these features so responsive, and also what makes them harder to debug. Building with Large Language Models or LLMs is different from traditional development. Traditional code is predictable. LLMs are non-deterministic - the same input can produce different outputs. When a feature loses context or responds too slowly, tracking down the cause isn't straightforward. Good tooling makes the difference. By the end of this session, you'll know how to use Instruments to identify and fix those issues and ship fast, reliable experiences with confidence. First, we'll start by comparing and contrasting traditional versus LLM app development concepts to get us into the right mindset. Then, we'll use Instruments to inspect and debug an agentic experience I'm developing in my Craft app. Before getting started, we recommend you check out "What's new in the Foundation Models framework" and "Build agentic app experiences with the Foundation Models framework" to gain a better understanding of the latest additions.
Building apps with LLMs introduces three challenges you won't find in traditional software development.
The first is probabilistic output. Give a traditional function the same input twice, and you get the same output. LLMs don't work that way. The same prompt can produce two completely different responses which means standard unit testing breaks down. You can't assert that an output matches a hardcoded string. You have to evaluate the quality and intent of the response instead.
The second is model-to-model communication. Powerful features often rely on multiple models working together. For example, in a recipe app, one model might identify ingredients in a photo, while a second generates a recipe from that result. Getting data to flow reliably between those models, and recovering gracefully when something goes wrong, is where real complexity lives.
And the third is observability. When something breaks in a multi-model pipeline, it can be very hard to know where it went wrong. You need visibility into each step: what the model received, what it decided, and why. That's exactly what this session is about.
At its core, an LLM application does three things: a person sends a prompt, the model reasons about it, and the person gets a response. Simple, fast, and for many features (a summarization tool, a writing assistant, a Q&A interface), exactly what you need. Many useful features need more than text generation. Sometimes the model needs information it doesn't have: the current time, a database record, or a search result. That's where tool calls come in. The loop works like this: the person sends a prompt, the model reasons about it and calls a tool, that tool performs an action, the model takes the result and generates a final response, which can kick off the loop again. Each extra step adds latency. Each step is a new place for failure. Understanding this loop is the basis for everything the Foundation Models Instrument shows you. Now that I have covered the mindset required for LLM app development, I'll use Instruments to debug and inspect the brainstorming feature I'm developing for my Craft app.
I'm working on a crafting companion app where you can keep a journal of your craft projects.
The app lets you record craft progress, ask questions about specific crafts, and generate tutorials. Recently, I had an idea for an interactive brainstorming feature that gives people suggestions on what to craft. The crafter can speak with the model to refine its ideas and when they're ready to commit, the app generates a detailed tutorial for that craft.
This feature uses two sets of instructions: one for brainstorming ideas, and a second for tutorial generation. The brainstorming instructions include two tools: a GenerateCraftIdeaTool and a SwitchToTutorialModeTool. Both sets of instructions use the server model on Private Cloud Compute, one for quick idea generation and the other to generate more detailed tutorials. Let's see this in action with Instruments.
The project is already open in Xcode. To begin profiling, I'll open the Product menu and select Profile. Xcode will build the app locally. From the template chooser, I'll select the Foundation Models template and click Record. This instrument captures prompt and response data from your device, which can include sensitive information. Logging is off in production but it's on for the duration of your trace so keep your trace files somewhere safe. Select "Record Anyway" to get started.
Now that the app has launched, let's give it a try. As soon as we land here, the model suggests a few project ideas: Yarn PomPom, Fabric Pouch, and Paper Butterfly. Paper Butterfly sounds fun - let's go with that.
Hm. That's not right. The model was supposed to kick off a tutorial but instead it just offered more ideas. Something's off. Let's end the recording and dig into the trace to find out what happened.
Instruments shows a lot at once, so let's walk through it together. The top section holds the tracks. Tracks show activity on the timeline, and each track can contain multiple lanes with charts that show levels or regions.
Below the timeline is the detail view. It shows summary information about the range you're currently inspecting.
If you click a bar in the timeline or a row in the detail view, the inspector opens up on the right giving you a closer look at what you've selected.
The Foundation Models Instrument has 6 lanes in the timeline. These give you a quick overview of session structure and latencies. Alongside the timeline, there's a tree detail view. That's where you can really dig into the model's chain of thought.
The Instructions lane shows how long a given set of instructions and tools was active. One set can cover multiple requests. Looking at this lane, it's clear only one set of instructions was active for the entire session but the feature was supposed to use two, so something went wrong during the handoff.
The Model Inference lane has two types of bars: yellow and orange.
Yellow bars represent how long the system spent processing the input prompt.
Orange bars represent how long it took to generate the response.
The timeline gives you a quick overview but the real power is in the tree view. It takes everything logged during this recording and organizes it into a hierarchy: sessions, requests, model inferences, instructions, prompts, and responses. Let's use it to track down why the instruction set never changed.
Session 1 had two requests. The first one was kicked off by the prompt starting with "Please generate 3 craft ideas." That request was made up of two model inferences and a few tool calls. Every model inference should have instructions, a prompt, and either a response or an error. Click any node in the tree to pull it up in the inspector.
The model inference detail shows a summary of the instructions, prompt, and response that made up this call.
Scroll down and you'll find duration visualizations and token usage metrics. We'll come back to those later when we talk about optimizing for reliability and performance.
Getting back to the failure, the timeline already told us the instruction set never changed, and here in the inspector for this model inference node, I can see the prompt tied to those instructions. Let's select the Instructions node to see how they're set up.
The inspector shows that this instruction only had one tool associated with it. The prompt references the switchToTutorialMode tool but that tool isn't actually configured with this instruction.
Without it, the app has no way to switch from brainstorm mode to tutorial mode, so the crafter gets stuck in a loop.
Looking at the subsequent nodes in the tree, this was a silent failure. The model kept accepting input and making tool calls but never threw an error.
There was no clear signal that anything had gone wrong. That makes it a hard bug to catch. Now that the root cause is clear, I'll jump into Xcode to fix it. Based on what I found in Instruments, I'll look at the BrainstormDynamicInstructions definition. In the Instructions block, the SwitchToTutorialMode tool is mentioned in the prompt but only the GenerateCraftIdeasTool is listed in the toolset, so let's add it.
Now, I'll recompile and re-run with Instruments to make sure the fix actually worked.
Back in the app, I'll head to the Ideas tab, and just like before, the model suggests some new crafts. I'll go with... necklace.
And there it is. The UI has switched to tutorial mode. The model made the transition and generated a full tutorial for this craft. Now let's jump back into Instruments and take a look at this new recording to make sure everything ran efficiently.
The Instructions lane now shows two distinct instructions active during this experience.
The first is a brainstorming instruction and the second is a tutorial generation instruction.
That lines up exactly with the brainstorm experience design we covered earlier. Let's dig into the tree view to see how that transition actually happened.
The first set of instructions now includes both the generateCraftIdea and switchToTutorialMode tools. That confirms the model had everything it needed to make the switch. The fix worked. The instruction change happened after the second model inference of Request 2.
That inference resulted in a tool call to switchToTutorialMode, passing the selected craft as an argument.
And in the following request, the instructions correctly switched over to the tutorial generator, with the selected craft passed along as context.
The info column is a great way to quickly flag nodes worth a closer look: things like errors, long durations, and large token counts. Request 1's first model inference took a bit longer than I was expecting, so let's take a look.
The metrics and duration sections break down token usage for this inference. These numbers are your starting point for understanding and improving the efficiency of an experience.
You can measure performance using three key metrics. Time to First Token measures how long it takes for the model to begin generating a response after receiving a prompt. A high Time to First Token means people are staring at a blank screen. To reduce it, shorten your prompt.
Tokens per Second measures overall generation speed of the response. Use it to benchmark performance across different prompt configurations and catch regressions after changes.
Total Latency is the complete time from sending the request to receiving the final response. This is the number people feel most directly. To reduce perceived Total Latency, utilize streaming to surface partial results sooner.
Running a trace is where optimization starts. These metrics tell you exactly where time and resources are going and point you toward the right fix. Use the model inference node to get a clear picture of your token usage.
In this session, I showed you how to use Instruments to debug an agentic experience developed with the Foundation Models framework. Once you've ironed out the bugs, the next thing to explore is evaluation. Watch "Meet the Evaluations framework" to see how you can measure and improve the quality of your prompts by using structured evaluation.
To get started with the improved Foundation Models Instrument, install Xcode 27. Then, on the device you'd like to run and profile your app on, update to the latest OS releases. Its important to note that this Instrument supports using any model you use with the Foundation Models framework.
The Foundation Models APIs are your starting point. Experiment, build, and see what's possible. When something isn't working as expected, the Foundation Models Instrument is there to help you debug, giving you direct visibility into framework behavior right in context. Go further with related sessions on agentic app experiences and the Evaluations framework and explore the full documentation to unlock everything the framework can do. Thank you for joining us! We're excited to see you develop and debug your intelligent experiences using the improved Foundation Models Instrument.
- 0:00 - Introduction
- Overview of how the Foundation Models Instruments template helps debug and profile agentic app experiences built with the Foundation Models framework, including Dynamic Instructions and tool call loops.
- 1:57 - LLM app development mindset
- The three challenges unique to LLM app development: probabilistic output (non-deterministic responses that break standard unit testing), model-to-model communication (coordinating data flow across multiple models), and observability (knowing where things went wrong in a multi-model pipeline).
- 3:59 - Inspect and diagnose an agentic experience
- Introduction to the craft companion demo app — a journaling app with an interactive brainstorming feature that uses two sets of Dynamic Instructions: one for idea generation and one for tutorial creation, both backed by the server model on Private Cloud Compute.
- 5:02 - Recording a trace with Instruments
- How to start profiling with the Foundation Models template in Instruments — selecting the template, recording a session, and an important note about sensitive prompt data in trace files.
- 6:04 - Navigating the Instruments UI
- A walkthrough of the Foundation Models instrument layout: tracks and lanes on the timeline (including the instructions lane and model inference lane with yellow/orange bars), the detail view, and the inspector — and how to use the tree view to inspect sessions, requests, inferences, and tool calls.
- 12:07 - Performance metrics
- How to measure and optimize LLM experience performance using three key metrics: time-to-first-token (reduce by shortening prompts), tokens-per-second (benchmark across configurations), and total latency (reduce perceived wait with streaming).
- 13:04 - Next steps
- Summary of what was covered, requirements to get started (Xcode 27 and latest OS), and pointers to related sessions on the Evaluations Framework and Agentic App Experiences.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources

Related Videos

WWDC26