Create robust evaluations for agentic apps

Create robust evaluations for agentic apps

Learn how to leverage advanced features of the Evaluations framework to build robust evaluations for your app. Explore evaluating flows with tool calling and dynamic conditions, and how to define what correct behavior means for your use case. Discover how to generate synthetic data, use judges effectively, and validate your datasets for reliable results.

Chapters
- 0:00 - Introduction
- 2:21 - The dataset problem in BookTracker
- 3:46 - Generating synthetic data with makeSamples
- 6:27 - Customizing generation with SampleGenerator
- 8:38 - Sampling strategies
- 10:11 - Validating synthetic samples
- 13:04 - Comparing evaluation results
- 15:09 - Tool calling and tool evaluations
- 18:54 - Trajectory expectations
- 21:26 - Building a tool call evaluation
- 22:02 - Synthetic data for tool evaluations
- 23:49 - Next steps
Resources
Hi, I'm Ada! And I'm Kyle! And we're engineers on the Evaluations team! Today, we are so excited to walk you through some of the advanced features in the Evaluations framework! The Evaluations framework introduces a way to assess intelligence-powered features in Swift apps, track improvements over time, and ensure quality in production.
This framework is new in Xcode 27 and supports macOS, iOS, watchOS and visionOS.
If you haven't already, check out the "Meet the Evaluations framework" video to learn about the building blocks of the Evaluations framework and our other video "Improve your prompts by hill climbing with Evaluations" to explore different strategies to improve your intelligence features.
In this video, we'll discuss how to address complexity and scalability with your evaluations. We'll begin by exploring how to grow your evaluation dataset by generating and validating synthetic data. And then we'll cover how to build robust evaluations for agentic workflows that incorporate a special kind of model behavior known as tool calling.
In the "Meet the Evaluations framework" video, we introduced the hill-climbing process. This illustrates the process of how we build, test and ship intelligent features.
In this video, we will primarily focus on the Develop and Evaluate step. In the Develop step, we often start with a handful of samples for our evaluations, but your feature is almost always more complex than your dataset can cover. It takes time to build, it's harder to scale, and it rarely captures the variety you need to truly understand how your feature behaves in the real world.
The quality of your evaluation results is only as good as the data behind them. And writing good evaluation data is hard. That's where synthetic data comes in. The Evaluations framework exposes APIs that let you define sample generation entirely in code, so you can build your own generation pipeline, run it from the command line, or plug it directly into your existing workflows. It supports text-based data and leverages the generable macro to generate structured synthetic data.
My colleagues and I have been working on BookTracker which is a personal library app that uses intelligence-backed features to auto-tag books based on written reviews. Let's examine how we define each book.
We have a class named Book that includes the title, author, review, tags, and rating. We define other variables used to support the cover design. We also define sampleBooks which is an array of 13 Book samples, Like this one here about Pride and Prejudice. These 13 samples might feel like a reasonable starting point, but this small dataset only give us a narrow window into how our feature performs. Our evaluation results could look great and still be completely misleading. Think about the variety of possible data used to evaluate our tag generation feature! There are countless books.
Hundreds of genres.
And a wide variety of ways a user might review what they just read. We're also talking about the real world where summaries can be vague or incomplete. Thirteen samples can't capture all of that. We need more coverage, and we need it without spending days writing examples by hand! So let's discuss how to expand our dataset to capture more of that variety. We'll start simple. The makeSamples API requires three components: a prompt, a dataset, and a target count, which is the number of samples you'd like to synthetically generate including the dataset you provide.
Here, I've defined a prompt that asks the model to suggest more diverse book review samples. To write a well-defined prompt, consider what information the model may need in order to best understand the task and handle the range of inputs your users might provide. For our dataset, I'm passing in our sampleBooks which includes our 13 initial samples.
Here we leverage the new ModelSamples API which includes the book's review as the prompt and the book's tags as the expected output.
And for the target count, I've set it to one hundred samples to start! Remember, the targetCount is the size of the full resulting dataset, including the samples we started with, so the model will actually generate 87 new ones. Now you might be wondering how much data is enough? And the answer is, it depends. For our BookTracker app, a target count of one hundred is just the starting point! Synthetic data generation is often an iterative process of defining an initial dataset, generating synthetic data, validating the samples, then, analyzing whether or not the data is representative enough and continuing this cycle until you are confident! So the right target count for your evaluation dataset depends entirely on your feature. What it does, who uses it, and how many different ways people interact with it.
What matters far more than quantity is coverage! So instead of asking how many samples do I need? Ask yourself, have I covered the meaningful variety of ways this feature will actually be used? Now that I've defined the required variables, I can use the makeSamples method, which returns an async stream of newly generated samples.
As I iterate over it, each new sample gets appended to a variable called expandedDataset that I've initialized with the starting dataset. By default, the framework uses the on device model for generation. The on-device model is a great option in most cases, but you might want to bring your own model, or customize the instructions the model operates under.
The framework provides the flexibility to define your own configurations for sample generation. Lets go over how to do that! For more complex configurations beyond the prompt, dataset, and target count, the framework provides the SampleGenerator Which gives you full control over the generation process. Let's go over some of these configurations! The sessionProvider is a closure that returns a LanguageModelSession. This is where you control which model drives generation and what system-level instructions frame the task.
For our synthetic data generation, I'll use the PrivateCloudComputeLanguageModel since the context size is larger and then I'll add custom instructions to focus generation on specific books, genres and moods.
I also specify a list of rules for expectations on the samples generated. I'll go over these later. Just a note about how the session is used. The framework handles batch size automatically which is the number of samples processed during generation.
The generator calls your sessionProvider once at the start of a run and then reuses that session across batches which helps the model maintain context as generation progresses.
But a session has a limit for how large it can grow. The one exception is if you're making a lot of requests, giving it a large prompt, or getting large outputs, You can exhaust the session's context window mid-run which will throw an error. In that case, the generator calls sessionProvider again to get a fresh one to continue generation but this won't contain context from the previous session. So make sure your instructions in your sessionProvider is self-contained and doesn't assume it'll only be called once.
To learn more ways to mitigate against context size limits, watch the video "Build agentic app experiences with Foundation Models".
Now with the custom session provider, you can also use the SampleGenerator to customize samplingStrategy, which controls how the generator selects examples from your initial dataset to show the model as in-context examples. There are two types of sampling strategies you can specify, the first one is random sampling.
This strategy selects a random subset of your initial samples as examples to show the model making sure there are no duplicates. This keeps the output varied without requiring us to think carefully about the order of our initial samples. The second type of sampling strategy you can use is sliding window.
This strategy steps through your initial samples sequentially, skipping duplicates as it goes. If your dataset has meaningful order, consider using this sliding window strategy.
For our generator, we'll use the random strategy because our initial samples are not meaningfully ordered. And since it's the default strategy we don't need to explicitly define it here.
So, now that we've configured the generator with our custom sessionProvider, we can call the .run function, which returns a stream of newly synthesized samples.
As we iterate through each one, it gets added to our expandeDataset defined earlier.
Now that we've setup our configuration, let's explore how we can ensure our synthetic data is the way we expect it.
That's where the validator closure comes in hand. The validator lets you define your own logic to accept or reject every generated sample. We've already defined a set of rules in the instructions in the session provider earlier, but that doesn't guarantee the output will actually follow the rules. Let's review them.
The first rule we defined is that the review must be at least 100 characters long Each review should also cover a wide range of genres, moods, and tones. And the review needs to vary in length.
The model should also generate between 3 and 8 book tags. And tags must be lowercase. In order to understand what to validate your samples on, we need to consider what we can systematically check based on these rules. Also, the validation closure validates per sample generation in isolation and doesn't have context to the other samples. Reviewing these rules, I can tell that the diversity of reviews will require more judgement beyond a simple validation check and the length of reviews requires assessing across all samples.
For the other rules, we can assess them systematically using the validation closure.
For the first rule, we can define a review length validation. Let's take a classic book we all know, "Frankenstein" by Mary Shelley, for example. We can check if the generated sample defines a review with at least a length of 100 characters. The model also generates tags for each review. This means we can validate when there are between 3 and 8 tags.
And lastly, we can check if the tags are all lowercase.
Here I've already defined these 3 validation metrics in the SampleGenerator to check that the samples meet our expected structure. So where do the results end up? Well, as generation progresses, valid samples are collected in the samples property on the SyntheticGenerator. Any sample that fails these validators gets set aside automatically as invalidSamples. Both are updated in real time throughout the run, so you can access them at any point. Either during iteration to check progress or after the loop completes. You can then use these results directly in your app or save the dataset locally. Now let's review our evaluation with the 13 initial samples. In Xcode 27, we introduced a new Evaluations Report to visualize your results. This is the BookTaggingEvaluation with the 13 initial samples.
As you can see we got pretty high scores for tag quality evaluating both relevance and usefulness.
I've went ahead and ran the evaluation with our new dataset of 100 samples.
Now, we can compare the two evaluations using the Compare button and we're expecting the scores to drop! And we were correct! The quality scores have dropped. Our tag generation feature looked like it was performing well earlier because we weren't testing it with a comprehensive dataset. By running our evaluation on a larger dataset, a drop in scores could signal many different things. Consider what this signal could suggest. Score changes could be due to problems with our prompt or instructions. You could refine one or both to better capture your needs.
You could also consider gaps in your intelligence feature. Or you may want to adjust your evaluation to understand what you are actually evaluating on.
And lastly, your dataset may still not be representative enough and need to capture more variation. You can continue to increase the dataset or include more edge cases using the synthetic data APIs. These are the core ways to further improve your results.
Now that we have a solid approach for building a robust evaluation dataset using synthetic data, I want to take it one step further.
So far we've been evaluating our book tagging feature, but what happens when our app becomes more complex and needs to take multiple actions to complete a task like search? That's where tool calling comes in. I'll hand it over to Kyle to show how that works! Thanks Ada! Now let's keep our evaluation driven development going and cover tool evaluations.
So far, we've been evaluating what the model generates — for our feature that's tags for books.
But intelligence features often take many behind-the-scenes steps to create their output. They perform multiple actions in your app that each contribute to the results.
Tools add structure to model workflows when they're completing a task for people using your app.
You use them to operate on real data that people use daily.
They can operate using any custom business logic you define.
They can call functionality a user can invoke directly or entirely new logic for your intelligence feature, or a combo of both.
Here's the thing. A model might give you a reasonable-sounding answer without ever calling the right tool.
The final output can look correct while the path to get there isn't right. So let's talk about those challenges and how tool evaluations can help you handle them First, instruction following: you need to tell a model how to use each tool, and the attention you pay to the details matters.
Try following the instructions word-by-word yourself to see if you miss a step.
Then there's tool complexity, they can accept simple instructions or require fine-tuning parameter ranges.
Then there are edge cases. A tool might seem to work well on common inputs, but behave surprisingly on the rare ones.
That's why we need tool evaluations. They let you verify the how, not just the what.
The model should call the correct tools, with the correct arguments in the order you expect.
And along the way, you'll double check that there weren't any unexpected tool calls in the middle.
Let's take a look at this in practice and build our first tool evaluation. In the BookTracker app, we've added a library assistant. A user can search for a book and instead of just filtering books based on the title and other strings, the model uses our app's custom tools to find relevant books.
There's a searchBooks tool to find books that might have similar tags. Then there's a getBookDetails tool to extract book metadata, like publication date from the searches.
Then there's the findSimilarBooks tool that performs a semantic search for similar books, so we're chaining together multiple steps, each one a tool call. Here's SearchBooksTool.
It conforms to the Tool protocol, it has a name the model sees and a description that tells it when this tool is useful.
The arguments are a Generable struct. Notice these are all optional, the model decides which filters to use based on what the user asked for.
If you prompt a model with find gothic books, we'd expect it to populate the tag argument.
If you prompt a model with show me something cheerful, we'd expect to generate a mood search. These are exactly the kinds of decisions we want to evaluate.
OK, so that's a refresher on the tools. Now let's write our first tool evaluation and see how they perform. The main component of a tool evaluation is a trajectory expectation. A session transcript has tool calls among the prompts and responses, A trajectory expectation checks the order and kind of each tool call in a language model session. You can think of a trajectory expectation check like going over the list of decisions you made when planning a route. Cars, bikes, and buses are all tools that have their time and place in getting somewhere, but you can evaluate their utility for each segment in a specific trip.
The expectation looks for all of the tool calls.
Then for each one, runs it against the expectations you write into your evaluations.
Here's a simple case in code form. Our prompt is "Find books tagged gothic". We expect one tool call "searchBooks".
This is a TrajectoryExpectation. It describes the tool calls we expect to see in the model's transcript. The unordered here means we don't care when this tool call happens, just that it happens.
We can further refine this by adding arguments to the expectation.
Here I'm adding an argument to expect the tag "gothic". An exact match isn't always what you want.
If the prompt is "Find something cheerful", the model might pass uplifting, happy, cheerful — any of those are fine.
The .naturalLanguage matcher checks whether the value matches the intent, not the exact string.
And there's a whole set of matchers for different situations — contains, oneOf, pattern, range, and more. Check out the developer documentation for more information. For multistep tasks, order matters.
Here the model must first call "searchBooks", then call "getBookDetails".
If an agent tries to get details first, it doesn't have a bookId yet — that's a bug.
Trajectory expectations catch it because we're checking the journey, not just the destination.
Sometimes what an agent shouldn't do is just as important.
If a prompt includes ideas like don't look for similar books, the model should follow instructions.
The disallowed parameter specifies tools that must not appear in the transcript. If an agent calls "findSimilarBooks" anyway — that's a failure. Here's where all of the trajectory expectations come together in the full evaluation.
We define a dataset of samples, each with a prompt and a trajectory expectation and use ToolCallEvaluator to score them.
The ToolCallEvaluator combines a LanguageModelSession with the tools, gets a response, and captures the structured transcript.
Tool call evaluation results show up in the Xcode assistant alongside the rest of your results, and you can get the whole picture of how your intelligence-based feature behaves. But wait! We can also use the Evaluations APIs to generate synthetic data for your tool evaluations!
Oh yes let's do that! Trajectory expectations are generable too. Expanding a dataset for your tool evaluations can be quite complex, and with the Evaluations framework we've made it a lot easier to do just that! Since our Tool Call evaluation leverages ModelSample and TrajectoryExpectation that are generable, we can synthetically generate more samples using Sample generator like before.
I've went ahead and defined a prompt and custom instructions for the sessionProvider. Keep in mind when creating synthetic data for tool evaluations, the model doesn't know what tools you've defined or what order the tools need to be called in.
So here I've specified the available tools explaining their purpose, any order expectations, and other context the model might need. Then we can define the sampleGenerator and use our existing dataset as our initial samples, and a targetCount of 100.
We can also specify validation metrics here as well! Here I've made sure there's always an expectation and I've also made sure the synthetic samples include at least one tool. And lastly any tools called are actual tools we've already defined.
And that's how you can generate and validate synthetic samples for your tool evaluations! The synthetic data APIs are a powerful way to expand your existing dataset beyond your capabilities! And the more representative your data, the more your scores reflect reality. Alright Kyle, back to you! This is where it all comes together. Earlier we built book tagging evaluation, it checks what the model produces. Tag count, genre coverage, quality scores.
Now we have tool evaluations — they check how the model gets there. The right tools, right arguments and right order.
Run both in the same evaluation suite and you'll have built end-to-end confidence in your feature. Now that we've covered some ways to make your evaluations even more robust, you can start applying these ideas to your apps and evaluation datasets.
To get started, try making your own synthetic data, evaluate the custom tools in your app and check out the sample app and other articles in the developer documentation.
Wow Ada, we've covered a lot today! Yeah, we definitely did! But the real plot twist is what you build with it. No spoilers though! And we hope you enjoyed learning about the Evaluations framework!

5:16 - Generate synthetic data with makeSamples

// Synthetic data
  let prompt = Prompt("""
      Generate diverse range of book reviews and corresponding tags.
      Cover a wide range of genres, time periods, cultures, and
      reader personas. Do not repeat books already in the dataset.
      """)
  
  let dataset = Book.sampleBooks.map { book in
      ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
  }
  
  let targetCount = 100
  var expandedDataset = dataset

  for try await sample in dataset.makeSamples(prompt, targetCount: targetCount) {
      expandedDataset.append(sample)
      print("Generated \(expandedDataset.count) samples so far.")
  }

  2. Configure a custom SampleGenerator — slides 30–43
  
  // Define your own configuration
  let generator = SampleGenerator<ModelSample<BookTags>>(
      prompt,
      samples: dataset,
      targetCount: targetCount,
      sessionProvider: {
          LanguageModelSession( 
              model: PrivateCloudComputeLanguageModel(),
              instructions: """
                  You are a synthetic data generator for a book-tracking app's evaluation suite.
                  Your job is to produce realistic, diverse book entries that will stress-test
                  a tagging system.

                  Rules:
                  - Review must be at least 100 characters long.
                  - Review should cover a mix of genre, mood/tone, and themes.
                  - Reviews should vary in length.
                  - Create between 3 and 8 tags.
                  - Tags must be lowercase.
                  """ 
          )
      }
  )

5:53 - Configure a custom SampleGenerator

// Define your own configuration
  let generator = SampleGenerator<ModelSample<BookTags>>(
      prompt,
      samples: dataset,
      targetCount: targetCount,
      sessionProvider: {
          LanguageModelSession( 
              model: PrivateCloudComputeLanguageModel(),
              instructions: """
                  You are a synthetic data generator for a book-tracking app's evaluation suite.
                  Your job is to produce realistic, diverse book entries that will stress-test
                  a tagging system.

                  Rules:
                  - Review must be at least 100 characters long.
                  - Review should cover a mix of genre, mood/tone, and themes.
                  - Reviews should vary in length.
                  - Create between 3 and 8 tags.
                  - Tags must be lowercase.
                  """ 
          )
      }
  )

10:37 - Validate generated samples

// Define validation metrics
  validator: { sample in
      guard let book = sample.expected else { return false }

      // Review must be at least 100 characters
      guard sample.promptDescription.count >= 100 else { return false }

      // Must have between 3 and 8 tags
      guard (3...8).contains(book.tags.count) else { return false }

      // All tags must be lowercase
      guard book.tags.allSatisfy({ $0 == $0.lowercased() }) else { return false }

      return true
  }

10:58 - Access valid and invalid results

// Accessing results
  for try await sample in generator.run() {
      // During iteration
      expandedDataset.append(sample)
  }

  // After iteration
  let allSamples = await generator.samples
  let invalidSamples = await generator.invalidSamples
  
  print("Generated \(allSamples.count) new samples. Total: \(expandedDataset.count)")

15:30 - Define a tool's Generable argument

@Generable
  struct SearchBooksArguments {
      @Guide(description: "A freeform search term to match against titles, reviews, or tags")
      var query: String?
  
      @Guide(description: "Filter results to books with this specific tag")
      var tag: String?

      @Guide(description: "Filter results by mood")
      var mood: String?

      @Guide(description: "Filter results by genre")
      var genre: String?

      @Guide(description: "Maximum number of results to return. Defaults to 5.")
      var limit: Int? 
  }

16:37 - A basic trajectory expectation

// "Find books tagged gothic"
  TrajectoryExpectation(
      unordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .exact(argumentName: "tag", value: .string("gothic"))
              ]
          )
      ]
  )

17:07 - Match arguments by intent (naturalLanguage)

// "Find something cheerful"
  TrajectoryExpectation(
      "searchBooks",
      arguments: [
          .naturalLanguage(
              argumentName: "mood",
              criteria: "Should relate to uplifting, hopeful, or positive feelings"
          )
      ]
  )
  Other matchers available: .contains, .oneOf, .pattern, .range, and more.

17:34 - Expect tool calls in order

// "Find gothic books and show details on the first"
  TrajectoryExpectation(
      ordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .exact(argumentName: "tag", value: .string("gothic"))
              ]
          ),
          ToolExpectation(
              "getBookDetails",
              arguments: [
                  .keyOnly(argumentName: "bookId")
              ]
          )
      ]
  )

17:55 - Disallow specific tool calls

// "Show only sci-fi books. Don't look for similar ones."
  TrajectoryExpectation(
      unordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .naturalLanguage(
                      argumentName: "genre",
                      criteria: "Should refer to science fiction")
              ]
          )
      ],
      disallowed: [
          ToolExpectation("findSimilarBooks")
      ]
  )

18:14 - Build a tool call evaluation

// Tool call evaluations
  let samples = SampleArrayLoader(samples: [
      ModelSample(
          prompt: "Find all the books tagged with 'gothic'.",
          instructions: "Help the user explore their book collection.",
          expectations: TrajectoryExpectation(  )
      )
  ])

  struct BookLibraryToolCallEval: Evaluation {
      var dataset = samples

      let pass = Metric("All Passed")
      let percent = Metric("Percentage Passed")

      var evaluators: Evaluators { 
          ToolCallEvaluator(allPass: pass, percentagePass: percent)
      }
  }

19:20 - Synthesize tool-evaluation samples

// Tool call evaluations
  let prompt = Prompt("""
      Generate diverse user queries for a personal book library assistant.
      Each sample needs a prompt (what the user says), and a trajectory
      expectation describing which tools should be called and in what order.
      """)

  let instructions = """
      AVAILABLE TOOLS:
      - searchBooks(query?, tag?, mood?, genre?, limit?): search the library
      - getBookDetails(bookId): full details for one book
      - findSimilarBooks(bookId, maxResults?): find books sharing tags
      ORDER REQUIREMENTS:
      - searchBooks must comes before getBookDetails or findSimilarBooks
      - Use TrajectoryExpectation(ordered:) when sequence matters, else (unordered:)
      USE THESE ARGUMENT MATCHERS:
      - .exact for precise values, .naturalLanguage for fuzzy matching
      - .keyOnly when any value is acceptable, .range for numeric constraints
      - .contains/.hasPrefix/.hasSuffix for partial string matching
      """

19:51 - Validate tool-evaluation samples

// Tool call evaluations
  validator: { sample in
      // Must have expectations defined
      guard sample.output.expectations != nil else { return false }

      let expectations = sample.output.expectations!

      // Must reference at least one tool
      let totalExpectations = expectations.ordered.count + expectations.unordered.count
      guard totalExpectations > 0 else { return false }

      // All tool names must be from the valid set
      let validTools: Set<String> = ["searchBooks", "getBookDetails", "findSimilarBooks"]
      let allExpectations = expectations.ordered + expectations.unordered + expectations.disallowed
      for expectation in allExpectations {
          guard validTools.contains(expectation.name) else { return false }
      }
  
      return true
  }

  ---

- 0:00 - Introduction
- Ada Wong and Kyle Murray introduce advanced features of the Evaluations framework (new in Xcode 27). Outlines the agenda: growing your dataset with synthetic data, then building robust evaluations for agentic, tool-calling workflows, focused on the develop-and-evaluate step of hill-climbing.
- 2:21 - The dataset problem in BookTracker
- The BookTracker app auto-tags books from reviews, but its 13 hand-written sampleBooks give only a narrow view. Real-world reviews span countless books, genres, lengths, and styles, too much variety to capture by hand.
- 3:46 - Generating synthetic data with makeSamples
- The makeSamples API takes a prompt, a dataset (ModelSample with review to tags), and a target count (the full resulting size, including your seeds). It returns an async stream of new samples; coverage of real usage matters more than raw quantity.
- 6:27 - Customizing generation with SampleGenerator
- For more control, SampleGenerator exposes a sessionProvider closure to pick the model (such as Private Cloud Compute) and instructions. The session is reused across batches but can exhaust its context window mid-run, so make instructions self-contained since the provider may be called again.
- 8:38 - Sampling strategies
- The samplingStrategy controls which seed samples are shown to the model as in-context examples: random (a varied subset, the default) or slidingWindow (sequential, for datasets with meaningful order).
- 10:11 - Validating synthetic samples
- A validator closure accepts or rejects each generated sample in isolation against systematic rules: review length at least 100 characters, 3 to 8 tags, lowercase tags. Valid samples collect in samples, rejects in invalidSamples, both updated in real time.
- 13:04 - Comparing evaluation results
- Using the Xcode 27 Evaluations Report, compare the 13-sample run against the 100-sample run. The quality scores drop, the feature only looked good on the small dataset, and a drop can signal issues in the prompt, the feature, the evaluation, or the dataset.
- 15:09 - Tool calling and tool evaluations
- Tool evaluations: features often take multiple behind-the-scenes tool calls, and a plausible answer can come from the wrong path. Tool evaluations verify the how: correct tools, correct arguments, correct order, no surprises, illustrated with searchBooks, getBookDetails, and findSimilarBooks.
- 18:54 - Trajectory expectations
- A TrajectoryExpectation checks the kind and order of tool calls in a session transcript. Refine with argument matchers (exact, naturalLanguage, contains, oneOf, pattern, range), plus ordered expectations and a disallowed set for tools that must not be called.
- 21:26 - Building a tool call evaluation
- Bring the trajectory expectations together: a dataset of samples (each a prompt plus expectation) scored by ToolCallEvaluator, which combines a LanguageModelSession with the tools, captures the structured transcript, and reports alongside your other results in Xcode.
- 22:02 - Synthetic data for tool evaluations
- Because ModelSample and TrajectoryExpectation are Generable, you can synthesize tool-evaluation samples too, describing the available tools, order expectations, and context in the prompt, then validating that each sample has an expectation, at least one tool, and only real tools.
- 23:49 - Next steps
- Run BookTaggingEvaluation (what the model produces) and tool evaluations (how it gets there) in one suite for end-to-end confidence. Next steps: create your own synthetic data, evaluate your app's custom tools, and explore the sample app and documentation.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources