Improve your prompts by hill-climbing with Evaluations

Improve your prompts by hill-climbing with Evaluations

Learn comparative evaluation techniques to guide your prompt engineering and select the right model for your app. Explore how to baseline performance, expand your evaluation strategy, and convert results to JSON for integration with other tools. Discover when to apply different prompting strategies and how to iteratively refine prompts for best results.

Chapters
- 0:00 - Introduction
- 2:42 - BookTracker's tagging problem
- 5:27 - Analyzing the evaluation results
- 8:26 - Drift between judge and human
- 9:37 - Measuring drift with Cohen's kappa
- 12:26 - Building a judge alignment evaluation
- 15:16 - Analyzing alignment failures
- 17:16 - Comparative evaluation: control vs experimental
- 19:12 - Refining the scoring dimensions
- 21:23 - Adding few-shot examples to the judge
- 23:38 - Going beyond prompts: adding a tool
- 27:17 - Next steps
Resources
HI! My name is Marcus, a manager on the Evaluations framework team.
I'm excited to show you how to use Evaluations to improve your intelligence-powered feature. As you probably know by now, using AI in your app is a powerful way to provide new levels of personalization to your users. This technology can add a level of depth to your app that was not previously possible with traditional software. However, it's also a challenge to know whether or not your intelligence-powered feature behaves as you'd expect in all cases. To help with this, we're releasing the Evaluations framework to provide you with the tools you need to ship with confidence. Shipping with confidence takes more than just a framework. The Evaluations framework also allows you to hill climb, which is a process of iteratively improving the quality of your feature using the scores of your evaluation as a guide. Hill-climbing starts with development, that is making some change you want to measure against your existing feature.
Once all your changes are made, you then need to run the evaluation. And see if the results have passed your expectations.
From there, you analyze the results to understand how your feature could be further improved.
Leveraging the hill-climbing process is a great way to systematically improve your feature, but effective hill-climbing takes a little bit more than just following the loop. It also takes a little bit of… Science! So, in this video, I'll walk you through how to improve prompts by following the hill-climbing loop, while incorporating some scientific thinking along the way.
Next, I'll walk you through how to conduct comparative Evaluations to make the process of hill-climbing easier. And finally, we'll go beyond changing prompts, by improving other aspects of an intelligence-powered feature.
But, before we go any further, this video is about the process of hill-climbing an existing evaluation. That means you've already written the foundations of an evaluation pipeline, which provides a wholistic understanding of the strengths and weaknesses of your intelligence-powered feature.
If you're unfamiliar with how to do that, go check out our other video, "Meet the Evaluations framework". It covers everything you need to know in order to build a great evaluation pipeline.
With that covered, let's get started.
In the "Meet the Evaluations framework" video we introduced you to Book Tracker. In case you forgot, Book Tracker allows readers to catalog and review books.
Recently, I've been reading a lot of classics, so I've added the books to my catalog. In fact, I just finished reading "Treasure Island"! It's such a thought-provoking read that covers the tension between loyalty and betrayal.
One of Book Tracker's new features is a tagging service, which uses a model to generate tags based on the readers review. While the tags for this review cover the general themes of the book, I feel like something is missing. I would have expected to see tags like "tense" or "morally grey", which speak to the themes of the story. The tags generated for "Little Women" had a similar problem. Tags like poignant has more to do with how the reader felt than the contents of the book. Emotion in book reviews is great, but it should not make it into the list of tags. Also, a tag like quiet-steadiness, which is pulled directly from the review, isn't going to be a very useful tag when I want to search my library later.
It seems like Book Tracker's tag generator isn't as good as I think it could be.
Fortunately, when my colleagues wrote this feature, they wrote an Evaluation, which measures the tags against a set of criteria.
And here is my Evaluation for Book Tracker's book tags. I'm particularly interested in how we are judging the quality of my tags so I'll scroll down and take a look.
The qualitative aspects of my app are captured by the score dimensions type.
Relevance tracks the how well the tags represent information about the book's plot, theme, or other relevant information. Usefulness, measures how good the tags are as search terms.
The ModelJudgeEvaluator uses the score dimensions and a prompt to generate a score for each set of tags.
My plan is to add these two books to my Evaluation, and review the tags I get back.
It will also be a good opportunity to see how my ratings stack up against the ratings of my model judge.
Wanting to improve some part of an intelligence powered feature is how the hill-climbing process starts. So, to start the loop you begin in the development phase. Here, you make any changes you need to feature as well as your Evaluation.
In this case, I'll add my review of "Treasure Island" to my Evaluation dataset. And I've done the same thing for "Little Women".
With those two entries now in my dataset, I now want to run my Evaluation and see what tags the model produces and how the judge scores them.
But, to get there we need to first ask if the Evaluation we ran met our expectations.
As a reminder, your expectations can be defined using Swift Testing's expect macro.
That way, you can tell if your expectations are met by whether or not your tests pass.
In this case, my Evaluation met all of my expectations; however, because I know the tags aren't as good as I'd like them to be, I need to investigate further.
And that brings us to the analyze phase.
Xcode's new evaluations report gives me in depth information about my last Evaluation run.
To see more details I can click on my BookTaggingEvaluation run.
This brings up the Evaluation detail view.
On the top are our aggregate metric charts.
And below is our table of results.
What I'd like to do now is compare the response of the model to the list of expected tags I generated before. I can do that by opening up the Assistant Editor.
And now I can see detailed information about the tags generated for each item in my dataset.
I want to focus on the difference between what the model generated and what I expected.
I can review that in detail in this table. The collection of terms isn't bad but it leaves out a number of key details from the story which the user might want to search for.
Because of that, I would have scored these tags a 4 for relevance and a 2 for usefulness.
My model judge also gave the tags a relevance score of 4 which is great, but it also gave usefulness a score of 4, which isn't right.
I should see if I feel the same way about the tags for my review of "Little Women".
The tags don't really contain all the useful information that I'd expect.
And it turns out I disagree with the judge's scores here as well.
Once again, I think relevance should be a 4 and usefulness should be a 2.
Doing this analysis has made it clear to me that there is a discrepancy between how my model judge and I rate tags.
This discrepancy between model and human is known as drift, and it is a problem faced by all developers trying to evaluate intelligent features. Here's why. Say I have an evaluation with 10 samples. I then ask a model judge and a person to rate each sample.
The model and person then give their ratings on a scale from 1 to 4, and at the end we average those scores to build an aggregate.
If the model and the human tend to disagree in their ratings, then their average scores will diverge from one another, hence the name drift.
As your data set continues to grow and grow the drift will get wider and wider. At which point, it'll be hard for you to know whether or not your feature is being properly evaluated.
To help with this, you can align your judge to a person's expert opinion. Now that we know drift is a problem, we need a way to know how much our model judge has drifted from our expert ratings.
One way to accomplish this would be to line up the ratings of the expert and mark where the two match. You can then use this to generate a percentage. This percentage is called accuracy, and it is a great way to measure alignment if every value in your scoring scale is equally likely to appear. However, it's more likely that your dataset will contain values that have an uneven distribution of scores. Think about it, datasets often contain examples of high quality output.
Therefore it is often the case that a human rater is likely to rate items in the dataset with higher scores.
If a model then happens to judge your smaller dataset with high scores, it may seem like the two are aligned. But then when unleashed on a larger dataset with more variations in scores, it's tendency to score high will still result in drift.
So we need an alternative to accuracy, one that accounts for the weighted nature of our dataset and the chance that the model might guess the right answer. Fortunately there is a solution! Cohen's kappa coefficient is a mathematical formula made popular by statistician and psychologist Jacob Cohen in 1960.
Cohen's kappa measures alignment, that is how often do two raters agree.
To do that, we need to know what percentage of the time the raters agreed, better known as accuracy.
And this is exactly what the accuracy metric from before was calculating. But now we need to calculate a new value. Coincidence, which represents the chance that one rater might get lucky and happen to align.
This luck is then weighted based on the chances certain answers are more likely to appear. So now the question is, how do you calculate it? To calculate alignment, we start with our accuracy score.
From the accuracy score we subtract the possibility of two raters randomly agreeing.
Finally, we divide the difference by the inverse of random agreement, namely the chance that the two raters intentionally agreed.
The result of that gives us alignment.
Cohen's kappa is a powerful way to measure the alignment between a model judge and your expert opinion. I can use this to hill climb the alignment scores between me and my model judge.
So now, we start back at the beginning of hill-climbing loop in the develop phase.
To do this, I am going to set up an evaluation to compare my ratings against my judge and produce and alignment score. To do that, I need to write an evaluation, which is made up of four components. First is my dataset.
Then the subject of my evaluation. Then, I need to define my evaluators. And finally, I need to aggregate my results. So let's start with the dataset.
For this evaluation to work properly, both my model judge and I need to evaluate the exact same dataset. In this case the model judge reviews tags, so I need to produce a common set of tags for the judge and I to review. And I have just the perfect dataset.
My evaluation from before contains a collection of reviews and tags.
Because I ran this evaluation in a test, Xcode generated an attachment containing all the of evaluation data that was generated. I can retrieve that attachment and extract summary and tag pairs. Now, with the summary and tag pairs extracted, I need to add my ratings. After that, I can pass the contents of this file as the input to my evaluation.
Next, I need to capture the subject of my evaluation. Normally, the subject method is for calling API related to your feature, but since the generated model responses are part of our dataset, we can simply return the already generated tags. Now, I need to define my evaluators. As you might have guessed, my evaluator is the exact same model judge evaluator as in our book tags evaluation. This is where the judge provides its rating.
Finally, I need to aggregate my results.
Here is where we compare my ratings against the judge's. To do that, we need to calculate Cohen's kappa, which I can do that with a custom aggregation method. In addition to just Cohen's kappa, I'll also calculate the mean and standard deviation of each score dimension. This will be helpful to know if the scores of the judge are going up or down. Now, I can setup my test with my evaluation. For this test, I've set an expectation that my ratings and the judges ratings should produce an alignment score of 0.6. We've chosen this number because according to statisticians, an alignment score of 0.6 represents a meaningful level of agreement. Now, it's time to evaluate and get a baseline for our alignment. And then determine if my evaluation has passed my expectations.
It appears that the tests failed, which means my expectations weren't met. So once again, it's time to analyze the results in detail. I now know that my alignment scores didn't match my expectations. I can now go to the evaluation report to get more information. As I expected, the scores for both usefulness and relevance are quite low, meaning my model judge and I aren't aligned.
Now, I want to get more information about how each sample in my dataset performed. To do that, I need to open the assistant and view the results in detail.
As I scanned through the results, this review of "Frankenstein" caught my eye. I can see a pretty large discrepancy between my rating of the tags and the judge's.
It seems like our judge thinks tags like self-help and self-improvement are relevant to the story. Also psychological is an okay search term, but probably not a term a user is likely to search for.
I then started looking through other items in my dataset that had a similar problem and came across this review of "The Ramakien".
The judge and I agree that these collection of terms are helpful and relevant to the contents of the book.
Where we disagree is on usefulness.
Terms like visual-dimension and quaint-dignity are way too specific.
So what's the problem here? I believe the model doesn't have enough knowledge on it's own to distinguish between a good tag and bad one. That's likely because the prompt of my judge doesn't provide enough context. To do that, I need to develop a new prompt. That way I can compare the alignment scores of my current prompt against my scores of my new prompt. Fortunately, in Xcode 27, we've made so you can compare the results of two evaluations against each other. When doing comparisons, some scientific thinking can go a long way. In a science experiment, you have two groups. The control group, which represents the baseline and the experimental group which represents the change we are trying to compare against. We can think of the two versions of our instructions in the same way, where the control group is represented by our base prompt and our experimental group is represented by our newly changed prompt. I now need to create a second version of our evaluation with an experimental prompt. For our baseline, we will use the same evaluation with the same model judge prompt as before. For our experimental prompt, I've written a more thorough description about how to judge the set of tags.
It starts by providing the judge context about the app and what it's about to be judging. Then it gives examples of good tags.
As well as ways to identify bad tags.
With both prompts written, I can add both evaluations to a test suite, which will run both evaluations. So, I'll run that suite now and compare the results. With the evaluation now finished I can return to the evaluation report.
Looks like my alignment scores for relevance improved. While my alignment score for usefulness dropped considerably.
Balancing tradeoffs like this are tricky so I need to think carefully how to proceed.
But before in depth analysis comes checking if we passed.
And my test confirms the obvious, we haven't.
After thinking about it further, I am going to keep this prompt change and focus the next round of iteration on improving my usefulness score. Therefore, the most effective way to review my results is to compare the usefulness scores of both judges against one another. To do that, I can use the new comparison view in the evaluation report.
From the evaluation report I can open the comparison button and open my baseline evaluation.
Here, I can review the scores of the two prompts side by side.
One thing that jumped out to me immediately is the discrepancy between usefulness scores of this review of "Picture of Dorian Gray". It seems to me that the model may be judging too harshly on usefulness. The usefulness column of the experimental evaluation seems to corroborate my guess.
I noticed that all the scores are either a 3 or 2, which is way too harsh.
I think what could help here is being more specific about how to grade each scoring dimension.
To do that, I'll need to make some changes to my experimental evaluation.
But before I can make changes to the experimental evaluation, I applied the new prompt from my experimental evaluation into my baseline. This ensures there's only one different variable.
Namely, the changes to my scoring dimensions. For relevance, I've provided a slightly longer description which emphasizes the need for a genre tag.
And here is the one for usefulness. Which emphasizes being more critical of overly specific tags.
And once again I'll wait for my evaluation to run.
And the scores both improved greatly over the baseline. It looks like these specific scoring dimensions are going to be a lot more helpful.
But, we've still not quite hit our alignment goals. So now, I need to do another comparison to see where we might be able to make improvements. To analyze further, I've gone back to the experimental evaluation. I want to review the results in detail so I'll bring up the assistant view. Thumbing through the results brought me to the review of "Moby Dick".
My relevance score is starting to align.
But my usefulness score could still use some work.
While some results are looking promising, others are still way off. This review of "Frankenstein" continues to give our judge trouble.
What I think our judge needs now is some examples of the way I judge things, which should give it a pattern for how to judge according to my scale.
That means we need another round of hill-climbing.
I've already added the new score dimensions to my baseline evaluation. Now, I've reworked my main judge prompt to give it more detail about the goal of the tag generation feature to help ground the model in the problem space.
From there, I've written out a number of examples for the model to use as a guideline for reviewing. I've made sure to only give the model a few examples. By giving it a longer list I am prone to overfit the alignment score, which would make it hard to tell if my judge is actually aligned with me. Now that it's a fair comparison, I need to run my evaluation and view the results. And now finally my scores are over my expected value! Which means I've finally passed and can exit out of the loop! This now means I can be confident that when my model judge provides ratings, I can confidently say that the tags are good or bad according my standards. That means I can now put my judge to work on evaluating Book Tracker's Book Tagging Service. So far we've seen how to hill climb on prompts, making them incrementally better and better, now I'd like to show you how to improve your feature through something other than your prompts.
To generate its tags, Book Tracker uses the on-device model. We use it because readers tend to be in all kinds of places when cataloging books, so using the on-device model ensures they can generate tags no matter where they are.
What I want to do is give the model some more context about the book it's generating tags for. I think the additional context will help the model generate more relevant and useful tags.
Better still, Book Tracker already has the data needed for this because we store the author's name and book title when they write their review. So, to help the tag generator, I've created a tool to get additional information on the book, which provides the book title and author if they are available. Adding this tool is a form of hill-climbing because we are attempting to improve the quality of our feature through an incremental change.
For this evaluation we will use the book tagging Evaluation, now with an improved model judge.
But I need a way to compare the quality of my feature without the tool to the quality of my feature with it. So to do that, I'll need to make a change to Book Tagging Service BookTaggingService now takes a list of tools as input. I also set the default to an empty array so my existing evaluation won't need any changes. But now I need to write a new evaluation to compare the service with the tool to the service without the tool.
Here is the new evaluation I wrote. It's exactly the same as the other evaluation. The only difference is I now pass my new lookup tool in the tools array.
So all I have to do is define two instances of my evaluation. One without the tool and one with it.
And now, let's evaluate it and determine if I'm ready to ship.
Well, my service which uses tools met all my expectations, so things are looking good. But, my dataset for Book Tracker contains only 13 book and review pairs, that doesn't cover the wide variety of books and reviews a user might submit for tagging.
In addition, I was looking at the results of the evaluation of my service with the tool. I can see that the service with the tool is performing better, however it does seem like my tool isn't being called in all the places I think it needs to. What I really need is a way to tell whether or not my tool has been called in the right situations. Fortunately, the Evaluations framework can help with both of those problems. To learn more about our APIs for evaluating tool usage and generating comprehensive datasets, take a look at the "Create robust evaluations for agentic apps" video. There, you'll learn about tool call Evaluators and how to use the Sample Generator API to test the wide variety of uses cases your app might see. Before we wrap up, I'd like to recap what we covered today.
Hill-climbing works best when you focus on making one change at a time. To do this, treat every iteration of the loop like a science experiment.
Being able to isolate your changes will help you to understand how each part of your feature contributes to the overall quality.
Knowing how each part works individually will also help you to know where you might need to make changes to resolve a bug or unwanted pattern later down the line.
Second, this process takes time. Not every change you make will result in positive change. However, failed experiments tell you just as much as successful ones.
Third, good experiments require creativity. In an intelligent feature there are so many things you can change.
In your feature you can change the instructions, the tools, as well as the model or models you use to generate responses.
On the evaluation side you can change the dataset, aggregation methods, and even the evaluators themselves. Everything is fair game. Make sure to consider all of these when thinking about how to hill climb.
Finally, watch out for drift. It can feel a bit meta to evaluate your evaluators but a well tuned model evaluator will save you time in the long run. Models can generate ratings much faster than humans can. So by keeping them aligned, you get useful signal as your dataset grows to cover more and more use cases.
If you want to learn more about what we've covered here today, you can review the Book Tracker app I've been using as well as the evaluations for aligning the model judge. You can also get a comprehensive rundown of all our new APIs on the developer documentation website. Thank you for taking the time to learn about how to improve your evaluation scores by hill-climbing. Your dedication will pay off as you deliver high quality experiences to your users. Thanks for watching and happy hill-climbing!

// MARK: - Evaluation
  struct BookTaggingEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(for: sample.promptDescription)
          return ModelSubject(value: result)
      }

      // MARK: - Dataset
      var dataset = ArrayLoader(samples:
          Book.sampleBooks.map { book in
              ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
          }
      )

      // MARK: - Evaluators & Metrics
      var tagCount = Metric("Tag Count")
      let hasGenreTag = Metric("Has Genre Tag")
      let noDuplicates = Metric("No Duplicates")

      let relevance = ScoreDimension(
          "Relevance",
          description: """
              Whether each tag describes a quality, theme, or tone of the
              book itself rather than incidental details or the reader's
              personal reactions.
              """,
          scale: .numeric([
              4: "Every tag describes the book itself",
              3: "Most tags describe the book, one picks up a reader reaction or minor detail",
              2: "Most tags are surface details or personal reactions, not book descriptors",
              1: "Tags don't meaningfully describe the book"
          ])
      )

      let usefulness = ScoreDimension(
          "Usefulness",
          description: """
              Whether tags are at the right granularity for browsing — broad
              enough that multiple books could share the tag, specific enough
              to help filter.
              """,
          scale: .numeric([
              4: "Every tag could group multiple books while still narrowing a search",
              3: "Most tags are at the right level, one is either too broad or too narrow",
              2: "Most tags are too broad to filter or too narrow to group",
              1: "Tags would not help with browsing"
          ])
      )

      var evaluators: Evaluators {
          // 1. Tag count is within the required 3–8 range
          Evaluator { _, subject in
              let count = subject.value.tags.count
              if (count >= 3 && count <= 8) {
                  return tagCount.passing(rationale: "\(count) tags")
              }
              return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
          }
  
          // 2. At least one tag identifies the genre or literary form
          Evaluator { _, subject in
              let tags = subject.value.tags.map { $0.lowercased() }
              let knownGenres = await BookTaggingService.knownGenres
              for tag in tags {
                  if knownGenres.contains(tag) {
                      return hasGenreTag.passing(rationale: "Matched \(tag)")
                  }
              }
              return hasGenreTag.failing()
          }

          // 3. No duplicate tags
          Evaluator { _, subject in
              let uniqueCount = Set(subject.value.tags.map { $0.lowercased() }).count
              if (subject.value.tags.count - uniqueCount) > 0 {
                  return noDuplicates.failing(rationale: "Found \(subject.value.tags.count - uniqueCount) duplicates")
              }
              return noDuplicates.passing()
          }
  
          // 4. Overall tag quality — groundedness, coverage, specificity
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are evaluating automatically generated tags for Shelf, a personal
                      book tracking app. Users write a short summary of their reading
                      experience, and the app generates tags to make their library browsable.
                      A good tag describes the book itself — its genre, themes, tone, or
                      setting. A bad tag picks up incidental details or the reader's personal
                      reactions that don't describe the book.
                      """,
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }

      // MARK: - Analysis
      func aggregateMetrics(using aggregator: inout MetricsAggregator) {
          aggregator.group("Heuristics") { group in
              group.computeMean(of: tagCount)
              group.computeMean(of: hasGenreTag)
              group.computeMean(of: noDuplicates)
          }
          aggregator.group("Quality") { group in
              group.computeMean(of: relevance.metric)
              group.computeMean(of: usefulness.metric)
          }
      }
  }

4:05 - Refined Relevance & Usefulness score dimensions

let relevance = ScoreDimension(
      "Relevance",
      description: """
          Whether each tag describes the book itself — its genre, themes,
          tone, or setting — rather than the reader's reactions, meta-
          commentary about the review, or facts about the author. A book
          can be "suspenseful" (a property of the text); a reader is
          "exhausted" (a reaction). Mis-labeling the genre is a serious failure.
          """,
      scale: .numeric([
          4: "Every tag describes the book itself",
          3: "Most tags describe the book, one picks up a reader reaction or minor detail",
          2: "Most tags are surface details or personal reactions, not book descriptors",
          1: "Tags don't meaningfully describe the book"
      ])
  )

  let usefulness = ScoreDimension(
      "Usefulness",
      description: """
          Whether tags work as library shelf labels — broad enough that
          several books could plausibly share the tag, specific enough to
          meaningfully narrow a search. Standard genre and theme tags work;
          made-up phrases, character names, hyper-specific descriptors, and
          overly generic words like "interesting" don't.
          """,
      scale: .numeric([
          4: "Every tag could group multiple books while still narrowing a search",
          3: "Most tags are at the right level, one is either too broad or too narrow",
          2: "Most tags are too broad to filter or too narrow to group",
          1: "Tags would not help with browsing"
      ])
  )

11:56 - The alignment dataset, extracted to JSON

// Model judge alignment dataset
  [
    {
      "input": "I have read this book more times than I can count…",
      "response": "[\"literary-fiction\", \"historical-fiction\", \"family-drama\", \"romantic-drama\", 
  \"character-driven\", \"emotional-intensity\", \"multigenerational-narrative\", \"penned-by-a-woman\"]"
    }
    // ... add your expert ratings to each entry
  ]

12:31 - The judge alignment evaluation: dataset, subject, evaluator

// Model judge alignment evaluation
  struct BookTagJudgmentCalibration: Evaluation {

      // MARK: Dataset — load the extracted summary/tag pairs
      static let samples: [ModelSample<BookTagJudgmentValue>] = {
          guard let url = Bundle(for: BundleToken.self).url(
                  forResource: "BookTaggingEvaluation-extracted", withExtension: "json"),
                let data = try? Data(contentsOf: url) else { return [] }
          // Build ModelSample array (adding expert ratings)
          // ...
      }()

      var dataset: some Loader { ArrayLoader(samples: Self.samples) }
  
      // MARK: Capture Subject — tags are already generated, so just return them
      func subject(from sample: ModelSample<BookTagJudgmentValue>) async throws -> ModelSubject<BookTagJudgmentValue> {
          ModelSubject(value: sample.expected ?? BookTagJudgmentValue(
              tags: [], expertRelevanceScore: 0, expertUsefulnessScore: 0))
      }

      // MARK: Evaluators — the same model judge as the book-tags evaluation
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: "You are evaluating automatically generated tags for Book Tracker…",
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

13:00 - Cohen's kappa aggregation

func aggregateMetrics(using aggregator: inout MetricsAggregator) {
      let expertRelevance = Self.samples.map { Double($0.expected?.expertRelevanceScore ?? 0) }
      let expertUsefulness = Self.samples.map { Double($0.expected?.expertUsefulnessScore ?? 0) }

      aggregator.group("Relevance") { group in
          group.computeMean(of: relevance.metric)
          group.computeStandardDeviation(of: relevance.metric)
          group.custom(of: relevance.metric, label: "Relevance Alignment Score") { judge in
              cohensKappa(ratings1: expertRelevance, ratings2: judge) ?? 0
          }
      }
      aggregator.group("Usefulness") { group in
          group.computeMean(of: usefulness.metric)
          group.computeStandardDeviation(of: usefulness.metric)
          group.custom(of: usefulness.metric, label: "Usefulness Alignment Score") { judge in
              cohensKappa(ratings1: expertUsefulness, ratings2: judge) ?? 0
          }
      }
  }

13:24 - The judge calibration test

// Model judge alignment tests
  @Suite("Book Tag Judge Calibration")
  struct BookTagJudgmentCalibrationTests {
      static let evaluation = BookTagJudgmentCalibration()

      @Test("Judge Calibration", .evaluates(evaluation))
      func evaluateJudgeCalibration() async throws {
          let result = EvaluationContext.current.result

          let usefulnessMetric = BookTagJudgmentCalibrationTests.evaluation.usefulness.metric
          let relevanceMetric = BookTagJudgmentCalibrationTests.evaluation.relevance.metric

          #expect(result.aggregateValue(.custom(label: "Relevance: Judge vs Expert")) > 0.6)
          #expect(result.aggregateValue(.custom(label: "Usefulness: Judge vs Expert")) > 0.6)
      }
  }

16:33 - The experimental judge prompt

// Experimental evaluation
  struct BookTagJudgmentCalibrationExperimental: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are an experienced reader and librarian evaluating tags
                      automatically generated for Book Tracker... Score the tag set on two
                      independent dimensions: Relevance and Usefulness.

                      ## What a good tag looks like
                      - Genre/form, theme/subject, tone/atmosphere, setting/era

                      ## Common failure modes
                      - Reader reactions, meta-commentary, author facts, genre contradictions
                      """,   // ← full prompt is ~40 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

20:12 - Few-shot worked examples in the judge prompt

struct ExperimentalBookTagJudgmentCalibration: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: SystemLanguageModel(),
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are calibrating with an expert librarian who scores
                      automatically generated tags for Book Tracker... Your goal is to
                      match how the librarian scores. Use the worked examples to calibrate.

                      ## Worked examples
                      ### Example A — clean fit (Pride and Prejudice)
                      Tags: romance, historical-fiction, love, redemption, passion
                      Librarian: Relevance 4, Usefulness 4

                      ### Example E — flat genre contradiction (Frankenstein)
                      Tags: horror, science-fiction, ... self-help, self-improvement
                      Librarian: Relevance 2, Usefulness 3
                      ... (6 examples A–F; keep the set small to avoid overfitting)
                      """,   // ← full prompt is ~60 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

  9. The BookLookupTool — slides 166–167

22:03 - The BookLookupTool

// Book Information Lookup Tool
  struct BookLookupTool: Tool {
      let name = "lookupBook"
      let description = "Looks up the title and author of a book given distinguishing details — such as character names, 
  settings, quoted lines, or notable plot points — extracted from a reader's review."

      @Generable
      struct Arguments {
          @Guide(description: "Distinguishing details from the review that identify the book, such as character names, 
  settings, quoted lines, or notable plot points.")
          var details: String
      }
  
      @Generable
      struct Output {
          @Guide(description: "The title of the identified book, or an empty string if no match was found.")
          var title: String

          @Guide(description: "The author of the identified book, or an empty string if no match was found.")
          var author: String
      }
  
      func call(arguments: Arguments) async throws -> Output {
          let needles = arguments.details
              .lowercased()
              .split(whereSeparator: { !$0.isLetter && !$0.isNumber })
              .map(String.init)
              .filter { $0.count >= 4 }

          let best = Book.sampleBooks
              .map { book -> (book: Book, score: Int) in
                  let review = book.review.lowercased()
                  let score = needles.reduce(0) { partial, needle in
                      partial + (review.contains(needle) ? 1 : 0)
                  }
                  return (book, score)
              }
              .max(by: { $0.score < $1.score })

          guard let match = best, match.score > 0 else {
              return Output(title: "", author: "")
          }
          return Output(title: match.book.title, author: match.book.author)
      }
  }

22:36 - BookTaggingService with a tools parameter

// Book Tagging Service
  struct BookTaggingService {
      static func generateTags(for review: String, tools: [any Tool] = []) async throws -> BookTags {
          let prompt = tagsPrompt(review: review)
          let session = LanguageModelSession(
              model: SystemLanguageModel(guardrails: .permissiveContentTransformations),
              tools: tools,
              instructions: instructions
          )
          let response = try await session.respond(to: prompt, generating: BookTags.self)
          return response.content
      }
  }

22:57 - Evaluation with the lookup tool

// Evaluation of tags with tool
  struct BookTaggingWithLookupEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(
              for: sample.promptDescription,
              tools: [BookLookupTool()]
          )
          return ModelSubject(value: result)
      }
      // ... same dataset, evaluators, and aggregation as BookTaggingEvaluation
  }

23:09 - Compare with/without the tool in one suite

@Suite("Book Tag Evaluations")
  struct BookTagEvaluationTests {
      static let evaluation = BookTaggingEvaluation()
      static let lookupEvaluation = BookTaggingWithLookupEvaluation()

      @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
      func evaluateBookTagging() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.evaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }

      @Test("Book Tag Evaluations (with BookLookupTool)", .evaluates(lookupEvaluation, info: lookupEvaluationInfo))
      func evaluateBookTaggingWithLookup() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.lookupEvaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.lookupEvaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }
  }

- 0:00 - Introduction
- Hill-climbing — iteratively improving an intelligence feature using evaluation scores as a guide (develop, run, analyze) — framed around bringing scientific thinking to that loop. Assumes you've already built an evaluation pipeline (see "Meet the Evaluations framework").
- 2:42 - BookTracker's tagging problem
- Revisits BookTracker, whose tag generator produces tags that miss key themes or reflect the reader's feelings rather than the book. The existing evaluation judges tag quality via score dimensions (Relevance, Usefulness) and a ModelJudgeEvaluator.
- 5:27 - Analyzing the evaluation results
- Adds two reviews to the dataset, runs the evaluation (Swift Testing #expect), and uses the Xcode evaluation report and assistant editor to compare generated tags against expected ones, revealing the human and model judge disagree on usefulness.
- 8:26 - Drift between judge and human
- That disagreement is drift, the divergence between a model judge's ratings and an expert's. As the dataset grows, drift widens, making it hard to trust the evaluation, so the judge must be aligned to expert opinion.
- 9:37 - Measuring drift with Cohen's kappa
- Accuracy alone misleads on unevenly-distributed scores (a high-scoring judge looks aligned by luck). Cohen's kappa coefficient measures true alignment by subtracting the chance of random agreement from accuracy and normalizing, a robust drift metric.
- 12:26 - Building a judge alignment evaluation
- Builds an evaluation comparing the presenter's ratings to the judge's over a shared dataset: extract summary/tag pairs from the prior run's attachment, add human ratings, reuse the same ModelJudgeEvaluator as subject, and aggregate Cohen's kappa (plus mean and standard deviation), targeting an alignment of 0.6.
- 15:16 - Analyzing alignment failures
- The alignment test fails. Drilling into the report (for example Frankenstein, The Ramakien) shows the judge rating overly-specific or off-theme tags too highly, the judge's prompt lacks the context to tell a good tag from a bad one.
- 17:16 - Comparative evaluation: control vs experimental
- Xcode 27 can compare two evaluations like a controlled experiment: a baseline (control) prompt versus an experimental prompt that adds app context plus examples of good and bad tags. Running both shows relevance improved while usefulness dropped, a tradeoff to weigh.
- 19:12 - Refining the scoring dimensions
- Keeping the prompt change, the side-by-side comparison view reveals the judge grading usefulness too harshly. Applying the new prompt to the baseline to isolate one variable, the ScoreDimension descriptions are sharpened (emphasizing genre tags; being critical of overly-specific ones), improving both scores.
- 21:23 - Adding few-shot examples to the judge
- Still short of the goal, the judge prompt is grounded with the feature's purpose and a few worked examples of how the presenter rates, deliberately few to avoid overfitting the alignment score. Scores finally exceed expectations, so the judge is trusted and the loop exits.
- 23:38 - Going beyond prompts: adding a tool
- Hill-climbing isn't only prompts: to give the on-device tag model more context, a BookLookupTool supplies the title and author. BookTaggingService gains a tools parameter (defaulting empty), and a second evaluation compares the feature with versus without the tool, the tool version scores better, though the small 13-sample dataset and unobserved tool calls point to "Create robust evaluations for agentic apps."
- 27:17 - Next steps
- Think like a scientist (one change at a time), invest the time (failed experiments still inform), be creative (instructions, tools, models, datasets, aggregations, and evaluators are all fair game), and watch for drift. Download the Book Tracker sample and review the documentation.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources