Evaluations 프레임워크 만나 보기

Evaluations 프레임워크 만나 보기

Evaluations 프레임워크를 사용하여 모델 기반 경험을 평가하는 방법을 알아보세요. 확률적인 세상에서는 유닛 테스트만으로는 충분하지 않습니다. 지표를 정의하고, 결과물을 자동으로 평가하며, 통계를 집계하여 AI 기반 기능이 Apple 플랫폼 전반에서 안정적으로 동작하도록 하는 방법을 알아보세요.

챕터
- 0:00 - 소개
- 3:10 - 데모 앱 Book Tacker: 수동 평가
- 4:31 - 첫 번째 평가 구축
- 8:06 - 평가 실행 및 보고서 읽기
- 10:57 - 견고한 데이터 세트 구축
- 14:20 - 지표 및 Evaluator 개선
- 15:41 - 평가 중심 개발 및 힐 클라이밍
- 16:12 - 모델 판정자: 정성적 지표
- 18:42 - 모델 판정자 구축
- 21:19 - 점수 차원을 활용한 개선
- 23:45 - 차원 결과 검토
- 24:20 - 모범 사례
- 25:38 - 다음 단계
리소스
Hi, I'm Yada. And I'm Rob. We're excited to introduce the Evaluations framework. A new framework that measures the quality of your intelligent features so you can deliver your apps with confidence.
Last year we introduced the Foundation Models framework, which helped you add intelligent features to your apps, using our on-device models.
The same models which power Apple Intelligence.
Building app features with generative AI poses new testing challenges, because the same input can produce different outputs. These models break a contract that is fundamental to software testing.
Consider traditional software, where a particular input always produces a particular output. You can easily verify this behavior with a unit test.
You're guaranteed the same input will produce the same output on any device, including your customers'.
With intelligent software, you cannot rely on functional consistency to verify behavior. Which means that unit tests are insufficient. Unverified behavior can erode customer confidence. Your customers expect intelligent features in your app, like any feature, to be safe, trustworthy, and reliable. Shipping a feature with unpredictable behavior, can have adverse consequences on your app's reputation.
It's important we measure our intelligent features and understand how they respond to different inputs. And since functional tests can't verify probabilistic behavior, we need new a form of test that is more robust.
We need to know: how often does my app produce unexpected results? How often does the agent take an unexpected path to generate answers? And under what circumstances does the feature produce unsafe results? The challenges of testing intelligence features powered by generative AI is exactly why we built the Evaluations framework.
The Evaluations framework is a flexible system of provided types and protocols. This video will focus on evaluating intelligent features powered by language models.
But you can evaluate any stochastic system, such as classifiers and linear regression models.
Yada and I will introduce you to several types in the framework.
We'll cover data loading and building a diverse dataset.
Building quantitative metrics with Evaluator and Metric.
And refining your measurements using model judges and score dimensions to create qualitative metrics. In this video, you'll get started with Evaluations. After building your first evaluation, we'll show you how to scale that evaluation, with more data, and more measurements.
Then we'll teach you how to build powerful model judges using our simple API.
Let's get started with Evaluations.
Yada and I are building an app called Book Tracker. We both love books and wanted an app to manage our libraries. Yada just added a new feature, called BookTaggingService. It automatically tags books based on a review we've written in Book Tracker.
I can't wait to open Xcode and try it out.
Let's add a #Playground macro to BookTaggingService.swift. Here's the review of "Pride & Prejudice" Yada added to Book Tracker. Have to say I'm a fan myself. Let's see what tags we get back.
This a good start, but as I'm reading some of the tags, I can see our service will need a little work.
9 tags is more than I was expecting.
And I don't want the book's name as a tag, either.
Multi-word tags are gonna be a problem in the UI, so we should avoid those as well.
Let's see if we have better luck with another review: "Dracula".
7 tags is within our expected amount. Let's take a closer look at them. There are some behaviors that I'd like to see more of.
It identified literary genres, and some categories that would help me browse a larger library.
Okay, we've just completed our first evaluation of the service.
We created a list of expectations and used our human judgement to measure how the service performed. Every evaluation measures how well an intelligent feature performs against our expectations.
Unfortunately human judgement doesn't scale.
But we've created a way to automate and scale evaluations. All you have to do is add import Evaluations, and implement the Evaluation protocol.
Let's build an evaluation in code.
And we'll start with our first expectation: measuring that our service generates the correct number of tags. There are five steps to building and running an evaluation. You define what code you're measuring. Then, define what data you're sending the code. Next, define what measurements you're making and how.
Then, summarize your measurements. And then, finally, create a test to run your evaluation.
First, we add the call to the BookTaggingService, and return it's output inside of the subject(from:) method.
These generated tags are the subject of our evaluation. Next, define the input samples we'll feed the code we're measuring.
Then, we'll use ModelSample to wrap the same reviews we tested in the #Playground earlier: "Pride & Prejudice" and "Dracula". Notice we define expected tags as well. These are the ideal tags we'd like to see from the service.
Now, its time to define our measurements, using the Metric type.
We add a Metric called "TagCount", which will track the number of generated tags returned by the service. We need something to measure the generated tags. Evaluator takes a closure, that gets passed the output from the service, for a given sample. We can check the number of generated tags by using the count of the tags property.
If the length of the tags array is between 3 and 8, we return a passing metric from our Evaluator.
If not, we return a failing metric.
Evaluators run over a single sample at a time. But we can measure trends and look for patterns measured over all of our samples in the aggregateMetrics(using:) method. Let's calculate the average number of times the service generates the correct number of tags.
Then we'll have a ratio for how often the service behaves correctly.
Okay, we've written our first evaluation. Next, let's write some code to run it.
Evaluations integrates with Swift Testing, so you can run your evaluations in your app's test targets. Here we instantiate our BookTaggingEvaluation inside of a Test Suite. We add some notes to our evaluation run, so we can keep track of the configuration we're evaluating. This will be helpful later, when we compare across different evaluation runs.
Next, we add a test function, using the @Test macro, and a new @Test trait .evaluates. This trait takes our evaluation and a notes dictionary, like the one we've created earlier in the @Suite.
Inside our @Test, we can access an evaluation results bundle. This contains all of the metrics and aggregate metrics from our evaluation run. Let's grab all of our tagCount metrics from the results, and assert against its average value. We'll use the aggregateValue method on the results bundle. Then, assert against the average in an #expect macro.
Here, I expect the service to produce the correct number of tags 80% of the time.
Why 80%? If the service performance dips below 80%, I want to know and a failing test is great signal. But what if I want even more insight into what happened during the evaluation? We have a new test report for evaluations. It's a great way to dive into the details of your evaluation and analyze further.
Let's run our test, and I'll walk you through the report. Based on what the service returned earlier in my #Playground, specifically how many tags it generated for "Pride & Prejudice", I don't expect the test to pass.
Okay, the test didn't pass. Let's go to the report and review what happened. Click on the report navigator, and then select Evaluations in the test report.
Here's the evaluation report for the test suite. Let's double click the row to find out more.
And I see my TagCount metric only passed 50% of the time. And a quick look at the full results table shows me that my "Pride & Prejudice" sample produced a failure. But my "Dracula" sample produced the correct number of tags.
I can select each row in the table to see more details, using the assistant editor in Xcode. The detail panel shows the prompt, and each measurement for the ModelSample. At the bottom, you see the entire response from the model.
Let's recap a little. We built an evaluation for BookTaggingService. Ran that evaluation and it failed to meet our optimization target.
Remember back in our test definition? This is where we defined our optimization target. We're saying the feature behaves as expected, if the correct number of tags were generated, 80% of the time.
Beyond the automated check of our optimization target we need to analyze deeper into our results and gather insights. Specifically, think about the changes that could be made to improve the feature's performance.
I have a hunch, so I look back at the @Generable type, BookTags, that the service is generating.
We already have a @Guide macro giving the model additional instructions for the tags property.
I could specify a count property in that @Guide, which can take a range. That should instruct the model to only generate between 3 and 8 tags.
This is an interesting theory. Let's make that change.
Then re-run the evaluation to see if I'm right. We call this process hill-climbing.
All right, I made the change and I re-ran the evaluation. My test passed, and my TagCount passes a 100% of the time. But I notice a potentially strange behavior: after my change, the service always generates eight tags. Hmmm.
Now that we have the Evaluations set up, let's collect more measurements across more samples, and let's see if that strange behavior persists.
We started our evaluation with only two data samples. As we saw, that only gave us two measurements to extract trends.
Good evaluations have thousands of samples to extract trends, but also to exercise your feature in many different ways.
We should consider variety in our dataset. For example… We want the service to recognize different genres. We can't assume every user will give it a verbose review, so our reviews should be different lengths.
You browse for fiction and non-fiction using different categories, your samples should represent that variety.
Finally, you should consider different forms: novels, short stories, and essays.
Let's makes it hard on the model too. Sprinkle in personal opinions, so we can measure how well the service ignores those in the reviews.
If you want to teach the feature how to write tags like you, start by including more of your personal style in the expected values of the samples.
Let's look at a few examples in code. This review of "The Secret Garden" reads very different than the reviews we started with because we wrote it as though we were an avid gardener.
Here we challenge the model, including a personal review from a mother reading "Treasure Island" to her son. Lots of personal opinions in this review.
This board game enthusiast needed multiple paragraphs for their review of the Chinese classic, "Romance of the Three Kingdoms".
While this casual reader described a famous British detective's sidekick in a single sentence.
The game is afoot, when the model tries to decipher this one.
And while it's fun to come up with these examples, human data creation doesn't scale, either.
Consider these sentence completion pairs, where the output of the feature is compared directly to the expected answer. You need thousands of examples for this evaluation to be effective.
Fortunately, we include a SampleGenerator as part of the Evaluations framework. You can call it directly on an array of ModelSamples and it will synthetically generate more samples using a model of your choice.
To hear more about how you can synthesize larger datasets, and learn more about advanced uses of ModelSample, please check out our video "Create robust evaluations for agentic apps".
Back to BookTagging. I'm going to update my dataset property to include all of the book reviews from our library, including the four we showed earlier.
When I re-run my evaluation with the expanded dataset, my test passes, my TagCount average is still 100%, and the service generated eight tags for all of them. Now we know there's a weird behavior in the service.
Looking back at my expectations, I've built an evaluator to track if the number of tags are in range. I think I still need to refine that a little. Here's my current Metric and Evaluators setup.
First, I define a new Metric, "TagTotal", that will record the number of generated tags.
Then I build a simple Evaluator, which records the length of the generated tags array. Then, we record a measurement using a scoring value, instead of a pass/fail value.
Using the "TagTotal" and "TagCount" metrics we evaluate range compliance and the distribution of generated tags.
We can follow a similar pattern for checking the number of words in tags. Here, we check each tag for a space, then returning a failing metric if it does. Identifying a literary genre is equally straightforward assuming you're looking for a known set of genres. We check the BookTaggingService for knownGenres. Then compare each of the generated tags for a match.
Our evaluation is really filling out. We can already measure three of our original five expectations. And our evaluation report provides a rich picture of how our tagging service is performing. We track our three expectations using five aggregate metrics. Here, we can see the distribution of tags, along with range compliance and containing genre tags.
Using our hill-climbing methodology, we've iterated on our instructions for the service. Here's where we started at the beginning.
After several updates to our evaluation and multiple runs through our loop.
And we can track each change to our instructions, by an expectation we added to our evaluation to verify that change.
When you take our hill-climbing feedback loop, and center your development process around it, we call it evaluation-driven development.
But we're not done getting our service up to spec.
We still expect our tags to be informative, relevant to the book and helpful for browsing your library. Here's Yada to tell you about model judges, and how they'll take your evaluation to the next level. Thanks Rob. Model judges are how we measure qualitative metrics at scale. Let me show you how to build and refine one. Let's take a look at a concrete example. Here's a review of "Alice in Wonderland" that Rob wrote in Book Tracker.
And here are the tags that our service generated.
Six tags, single word or hyphenated, with tags identifying genre. Every quantitative metric we built with Rob passed.
But look closer. 'Overrated' and 'pretentious' doesn't describe the book — they describe how the reader felt about it. And 'whodunit' isn't even the right genre. The model picked it up from 'riddles he never answers.' It latched onto the language of the review without understanding the book. Our metrics are passing, but they're not giving us the right signals back.
But, I think we can ask a model to help us here. If a person can read these tags and tell us which ones work, maybe a model can too.
Oh nice! The model actually captured that certain tags are not helpful.
I think I can ask the model to evaluate all of the tags that my feature generated! And that's exactly what a Model Judge is. A Model Judge is a language model used to score your feature's output. It gives you a subjective rating — the kind of judgment call a person would make — but applied consistently across your entire dataset.
So let's talk about how this works. Here's the model powering your intelligence feature. Our BookTaggingService runs on-device because it needs to be fast and local for every user interaction. You can use a second model as a judge to evaluate your feature. Your judge should be at least as capable as the model you're evaluating. In our case, we can use a more capable model from Private Cloud Compute.
The model judge has a few key components.
The instruction tells the model it will be given book reviews, and how it should evaluate it.
The feature input is the prompt given to the feature being judged, in our case, its the book review.
The feature output is the tags our service generated.
And finally, the scoring guide tells the model how to evaluate and score the feature. The Evaluations framework handles most of this for you, so you can focus on the scoring guide.
Putting it all together, here's a simple model judge. We've defined a "TagQuality" metric on a 1 to 4 scale, with each level describing what that score means. An even number of options prevents the judge from defaulting to a neutral middle score.
Four levels provides just enough distinction without diluting the meaning of each rating.
And finally, we've specified Private Cloud Compute as our judge model, giving us a more capable evaluator than the on-device model we're evaluating.
In the Evaluations framework, a model judge is just another Evaluator. It conforms to the same protocol as the quantitative evaluators and produces the same Metric type. So you can mix them freely within a single evaluation. Alright, let's run it! Every sample received a 3 or 4 quality score.
Lets go back to our "Alice in Wonderland" sample.
The model judge gave this a quality score of 3.
If we look at the rationale, we can identify that the model flagged 'whodunit' and 'detective-fiction' as not relevant to the book. But, we also expected it to flag all of these other tags that either reflect the reader's opinion or are not helpful for browsing.
With model judges, rationales are essential. They give you a window into why the judge scored what it scored.
And here's the thing: by the scale we wrote, the judge is actually right. Every tag connects to something that the user wrote. The judge is faithfully following the scoring guide we provided. We meant something specific by relevant and useful for browsing, and the judge interpreted those words differently than we did.
By asking the model to provide judgement for my feature, in my place, I expected it to provide a similar score to how I would have scored these tags.
When there is a mismatch between the model judge and us, we can refine the model judge until it can stand in for our own judgement.
Looking back, the problem with our first model judge was that it was too broad. It was asking two different questions.
When you find yourself disagreeing with a score, you should try and see if you can split the questions. In our case, relevance and usefulness are actually two different metrics. Lets take a look at defining "Relevance" as a ScoreDimension.
When we say the tags are relevant we mean that each tag describes a quality, theme, or tone of the book itself rather than small details or the reader's personal reactions.
And we can write that as the description for our ScoreDimension.
To score these tags, you'd walk through each one. Identify which tags are bad and which are good, based on whether or not they meaningfully describe the book.
You'd repeat this for every tag. In this case, all of the tags are good, which earns a score of 4 on our 1 to 4 scale. You would repeat the same process to define each scale in the scoring guide.
And that's our "Relevance" metric with the metric name, description, and scale that the model judge can use.
I can use the same process to define "Usefulness". Now, I can add both dimensions to the ModelJudgeEvaluator.
But dimensions alone aren't enough. They tell the judge what to measure, but not how to think about your app. Without that context, a judge evaluating tags for Book Tracker might treat a reader's criticism as a valid book descriptor. It has no way to know that Book Tracker is a personal library, not a review platform. And that's where the ModelJudgePrompt comes in.
This is an example of a ModelJudgePrompt. We can tell the judge its evaluating tags for a personal library app in the instructions.
Format the response in the evaluationTarget, and pass the expectedTags as reference for the model to compare against.
For more details on ModelJudgePrompt please see our documentation. Now that our model judge has the context it needs, lets rerun our evaluation.
In place of Quality we now have a relevance and usefulness score. And here is the evaluation result of our "Alice in Wonderland" book sample.
Notice how the two rationales separate the diagnosis. Relevance tells us what kind of tag is wrong. And Usefulness tells us how the wrong tags fail at browsing. With these results, I now have a clear path forward. I can update my BookTaggingService instructions, run the evaluation again, and watch the scores change. That's the feedback loop Rob walked us through, now powered by qualitative metrics. When are you uploading to TestFlight? Well Rob, I've been a little busy! Let's wrap-up with a few best practices for evaluating your apps.
Start small. A focused dataset of 20 to 30 samples is a great place to get started. Spec out your app by thinking about how you want the model to behave.
Use heuristics to measure quantifiable traits. These rule-of-thumb metrics are a great way to start understanding your feature. The rule-of-thumb is: if you can measure it in code, then it's quantitative. And if you can only describe it in words, then you need a qualitative metric, using a ModelJudgeEvaluator. Start simple with your model judge. Define your scoring dimension, run it, and read the rationales. You'll learn more from a single run than from hours of careful planning. Use rationales to drive your next change. If the scores are all the same, your question is too broad. If you can't isolate the problem, split the dimensions. And if the judge doesn't understand your app, add context. Well, I guess we should get back to work. Be sure to check out our documentation. And our sample code. And check out our other videos featuring the Evaluations framework: "Improve your prompts by hill climbing with Evaluations", and "Create robust evaluations for agentic apps". Later!
Bye!

// Evaluations
  import Evaluations

  struct BookTaggingEvaluation: Evaluation {
  
  }

8:02 - Run with Swift Testing and an optimization target

// Optimization Target
  @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
  func evaluateBookTagging() async throws {
      let result = EvaluationContext.current.result
  
      let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
      #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
  }

10:09 - Constrain output with a Generable @Guide

// BookTags.swift
  @Generable
  struct BookTags: Codable {
      @Guide(description: "Descriptive tags capturing themes, genres, moods, and topics from the summary", .count(3...8))
      var tags: [String]
  } snippet.

11:15 - Define the dataset with ModelSample

// BookTaggingEvaluation
  var dataset = ArrayLoader(samples: [
      ModelSample(prompt: "okay I am OBSESSED and I need everyone to read this RIGHT NOW...",
                  expected: BookTags(tags: ["classic", "romance", "wit", "regency"])),

      ModelSample(prompt: "Read this in one sitting between midnight and 4am and I cannot...",
                  expected: BookTags(tags: ["classic", "gothic", "horror", "vampire", "suspense"])),
  ])
  
  // Or load your whole library:
  var dataset = ArrayLoader(samples:
      Book.sampleBooks.map { book in
          ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
      }
  )

12:53 - Synthesize more samples with a SampleGenerator

// Synthesizing more inputs
  let samples: [ModelSample<String>] = [
      ModelSample(prompt: "The largest planet in our solar system...", expected: "Jupiter."),
      ModelSample(prompt: "The capital of Thailand...", expected: "Bangkok."),
      ModelSample(prompt: "Swift is...", expected: "a powerful programming language."),
      ModelSample(prompt: "All those moments will be lost in time...", expected: "Like tears in rain.")
  ]
  
  for try await sample in samples.makeSamples(
      """
      Generate diverse sentence completions about the listed topics:
        - The Solar System
        - World Capitals 
      """,
      targetCount: 1000) {
          samples.append(sample)
  }

14:02 - More evaluators: word count and genre

let wordCount = Metric("WordCount")

  Evaluator { _, subject in
      for tag in subject.value.tags {
          if tag.contains(" ") {
              return wordCount.failing(rationale: "Tag \(tag) contains multiple words")
          }
      }
      return wordCount.passing()
  }

  let hasGenreTag = Metric("HasGenreTag")
  
  Evaluator { _, subject in
      let tags = subject.value.tags.map { $0.lowercased() }
      let knownGenres = await BookTaggingService.knownGenres
      for tag in tags {
          if knownGenres.contains(tag) {
              return hasGenreTag.passing(rationale: "Matched \(tag)")
          }
      }
      return hasGenreTag.failing() 
  }

14:03 - Define a Metric and Evaluator

let tagCount = Metric("TagCount")

  var evaluators: Evaluators {

      // Tag count is within the required 3–8 range
      Evaluator { _, subject in 
          let count = subject.value.tags.count
          if (count >= 3 && count <= 8) {
              return tagCount.passing(rationale: "\(count) tags")
          } 
          return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
      }
  }

14:27 - Aggregate metrics across samples

let tagCount = Metric("TagCount")
  let tagTotal = Metric("TagTotal")
  
  func aggregateMetrics(using aggregator: inout MetricsAggregator) {
      aggregator.computeMean(of: tagCount)
      aggregator.group("Distribution of Tag Totals") { aggregator in
          aggregator.computeStandardDeviation(of: tagTotal)
          aggregator.computeMean(of: tagTotal)
          aggregator.computeVariance(of: tagTotal)
      }
  }

15:33 - Iterate the feature's instructions (hill-climbing)

// BookTaggingService.swift
  let instructions = Instructions {
      """
      You are a librarian and literary analyst. Given a reader's
      freeform summary of a book they read — describing their
      thoughts, feelings, and what stood out — generate a set of
      descriptive tags reflected in the summary.

      Rules:
       - Return between 3 and 8 tags.
       - Tags should be lowercase, concise (single word or hyphenated), and descriptive.
       - Tags should include the book's genre, chosen from the included list of known genres.
  
      Known Genres:
       - \(Self.knownGenres.joined(separator: ", "))
      """
  }

18:53 - Build a model judge

ModelJudgeEvaluator(
      "TagQuality",
      scale: .numeric([
          4: "Tags are relevant and helpful for browsing",
          3: "Mostly relevant, one tag too vague or generic",
          2: "Several tags are wrong or generic",
          1: "Unhelpful or irrelevant"
      ]),   
      judge: PrivateCloudComputeLanguageModel()
  )

22:17 - Split into score dimensions

// BookTaggingEvaluation.swift
  ScoreDimension(
      "Relevance",
      description: """
          Whether each tag describes a quality, theme, or tone
          of the book itself rather than incidental details or
          the reader's personal reactions.
          """,
      scale: .numeric([
          4: "Every tag describes the book itself",
          3: "Most tags describe the book",
          2: "Some tags describe personal reactions",
          1: "Tags don't meaningfully describe the book"
      ])    
  )
  // Define `usefulness` the same way as a second ScoreDimension.

22:32 - Add dimensions to the judge

// BookTaggingEvaluation.swift
  var evaluators: Evaluators {

      Evaluator {  }  

      Evaluator {  }

      Evaluator {  }
  
      ModelJudgeEvaluator(
          judge: PrivateCloudComputeLanguageModel(),
          dimensions: [relevance, usefulness]
      )
  }

23:17 - Add app context with a ModelJudgePrompt

// BookTaggingEvaluation.swift
  ModelJudgeEvaluator(
      judge: PrivateCloudComputeLanguageModel(),
      dimensions: [relevance, usefulness],
      prompt: ModelJudgePrompt( 
          instructions: """
              You are evaluating tags generated for a personal book-tracking app where users
              organize their library by browsing and filtering tags.
              """,
          evaluationTarget: { value in
              "\(value.tags.count) Generated tags: " + value.tags.joined(separator: ", ")
          },
          reference: { input, _ in 
              let expectedTags = input.expected?.tags.joined(separator: ", ")
              return ["Expected Tags": expectedTags ?? "No expected tags defined"]
          }
      )
  )

시작하기 탐색

알림 받기

플랫폼 탐색

피처링

기술 탐색

피처링

커뮤니티 탐색

피처링

문서 탐색

릴리즈 노트

다운로드 탐색

피처링

지원 탐색

피처링

빠른 링크

챕터

리소스