View in English

  • Apple 开发者
    • 入门汇总

    探索“入门汇总”

    • 概览
    • 学习
    • Apple Developer Program

    及时了解最新动态

    • 最新动态
    • 开发者你好
    • 平台

    探索“平台”

    • Apple 平台
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    • App Store

    精选

    • 设计
    • 分发
    • 游戏
    • 配件
    • 网页
    • Home
    • CarPlay 车载
    • 技术

    探索“技术”

    • 概览
    • Xcode
    • Swift
    • SwiftUI

    精选

    • 辅助功能
    • App Intents
    • Apple 智能
    • 游戏
    • 机器学习与 AI
    • 安全性
    • Xcode Cloud
    • 社区

    探索“社区”

    • 概览
    • “与 Apple 会面交流”活动
    • 社区主导的活动
    • 开发者论坛
    • 开源

    精选

    • WWDC
    • Swift Student Challenge
    • 开发者故事
    • App Store 大奖
    • Apple 设计大奖
    • Apple Developer Centers
    • 文档

    探索“文档”

    • 文档库
    • 技术概述
    • 示例代码
    • 《人机界面指南》
    • 视频

    发布说明

    • 精选更新
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • Apple tvOS
    • Xcode
    • 下载

    探索“下载”

    • 所有下载
    • 操作系统
    • 应用程序
    • 设计资源

    精选

    • Xcode
    • TestFlight
    • 字体
    • SF Symbols
    • Icon Composer
    • 支持

    探索“支持”

    • 概览
    • 帮助指南
    • 开发者论坛
    • “反馈助理”
    • 联系我们

    精选

    • 《开发者账户帮助》
    • 《App 审核指南》
    • 《App Store Connect 帮助》
    • 即将实行的要求
    • 协议和准则
    • 系统状态
  • 快速链接

    • 活动
    • 新闻
    • 论坛
    • 示例代码
    • 视频
 

视频

打开菜单 关闭菜单
  • 专题
  • 所有视频
  • 关于

更多视频

  • 简介
  • 概要
  • 代码
  • 了解 Evaluations 框架

    了解如何使用 Evaluations 框架来评估模型驱动的体验。在概率世界中,仅靠单元测试是不够的。探索如何定义指标、自动评估输出质量并汇总统计数据,以便确保由 AI 支持的功能在各个 Apple 平台上都能稳定可靠地运行。

    章节

    • 0:00 - Introduction
    • 3:10 - Demo app Book Tacker: a manual evaluation
    • 4:31 - Building your first evaluation
    • 8:06 - Running the evaluation and reading the report
    • 10:57 - Building robust datasets
    • 14:20 - Refining metrics and evaluators
    • 15:41 - Evaluation-driven development and hill-climbing
    • 16:12 - Model judges: qualitative metrics
    • 18:42 - Building a model judge
    • 21:19 - Refining with score dimensions
    • 23:45 - Reviewing dimension results
    • 24:20 - Best practices
    • 25:38 - Next steps

    资源

    • Book Tracker: Using Evaluations to evaluate an intelligent feature
    • Designing datasets to test your feature
    • Designing effective evaluations
    • Evaluating language model responses
      • 高清视频
      • 标清视频
  • 搜索此视频…
    • 4:54 - Define an Evaluation

      // Evaluations
        import Evaluations
      
        struct BookTaggingEvaluation: Evaluation {
        
        }
    • 8:02 - Run with Swift Testing and an optimization target

      // Optimization Target
        @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
        func evaluateBookTagging() async throws {
            let result = EvaluationContext.current.result
        
            let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
            #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
        }
    • 10:09 - Constrain output with a Generable @Guide

      // BookTags.swift
        @Generable
        struct BookTags: Codable {
            @Guide(description: "Descriptive tags capturing themes, genres, moods, and topics from the summary", .count(3...8))
            var tags: [String]
        } snippet.
    • 11:15 - Define the dataset with ModelSample

      // BookTaggingEvaluation
        var dataset = ArrayLoader(samples: [
            ModelSample(prompt: "okay I am OBSESSED and I need everyone to read this RIGHT NOW...",
                        expected: BookTags(tags: ["classic", "romance", "wit", "regency"])),
      
            ModelSample(prompt: "Read this in one sitting between midnight and 4am and I cannot...",
                        expected: BookTags(tags: ["classic", "gothic", "horror", "vampire", "suspense"])),
        ])
        
        // Or load your whole library:
        var dataset = ArrayLoader(samples:
            Book.sampleBooks.map { book in
                ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
            }
        )
    • 12:53 - Synthesize more samples with a SampleGenerator

      // Synthesizing more inputs
        let samples: [ModelSample<String>] = [
            ModelSample(prompt: "The largest planet in our solar system...", expected: "Jupiter."),
            ModelSample(prompt: "The capital of Thailand...", expected: "Bangkok."),
            ModelSample(prompt: "Swift is...", expected: "a powerful programming language."),
            ModelSample(prompt: "All those moments will be lost in time...", expected: "Like tears in rain.")
        ]
        
        for try await sample in samples.makeSamples(
            """
            Generate diverse sentence completions about the listed topics:
              - The Solar System
              - World Capitals 
            """,
            targetCount: 1000) {
                samples.append(sample)
        }
    • 14:02 - More evaluators: word count and genre

      let wordCount = Metric("WordCount")
      
        Evaluator { _, subject in
            for tag in subject.value.tags {
                if tag.contains(" ") {
                    return wordCount.failing(rationale: "Tag \(tag) contains multiple words")
                }
            }
            return wordCount.passing()
        }
      
        let hasGenreTag = Metric("HasGenreTag")
        
        Evaluator { _, subject in
            let tags = subject.value.tags.map { $0.lowercased() }
            let knownGenres = await BookTaggingService.knownGenres
            for tag in tags {
                if knownGenres.contains(tag) {
                    return hasGenreTag.passing(rationale: "Matched \(tag)")
                }
            }
            return hasGenreTag.failing() 
        }
    • 14:03 - Define a Metric and Evaluator

      let tagCount = Metric("TagCount")
      
        var evaluators: Evaluators {
      
            // Tag count is within the required 3–8 range
            Evaluator { _, subject in 
                let count = subject.value.tags.count
                if (count >= 3 && count <= 8) {
                    return tagCount.passing(rationale: "\(count) tags")
                } 
                return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
            }
        }
    • 14:27 - Aggregate metrics across samples

      let tagCount = Metric("TagCount")
        let tagTotal = Metric("TagTotal")
        
        func aggregateMetrics(using aggregator: inout MetricsAggregator) {
            aggregator.computeMean(of: tagCount)
            aggregator.group("Distribution of Tag Totals") { aggregator in
                aggregator.computeStandardDeviation(of: tagTotal)
                aggregator.computeMean(of: tagTotal)
                aggregator.computeVariance(of: tagTotal)
            }
        }
    • 15:33 - Iterate the feature's instructions (hill-climbing)

      // BookTaggingService.swift
        let instructions = Instructions {
            """
            You are a librarian and literary analyst. Given a reader's
            freeform summary of a book they read — describing their
            thoughts, feelings, and what stood out — generate a set of
            descriptive tags reflected in the summary.
      
            Rules:
             - Return between 3 and 8 tags.
             - Tags should be lowercase, concise (single word or hyphenated), and descriptive.
             - Tags should include the book's genre, chosen from the included list of known genres.
        
            Known Genres:
             - \(Self.knownGenres.joined(separator: ", "))
            """
        }
    • 18:53 - Build a model judge

      ModelJudgeEvaluator(
            "TagQuality",
            scale: .numeric([
                4: "Tags are relevant and helpful for browsing",
                3: "Mostly relevant, one tag too vague or generic",
                2: "Several tags are wrong or generic",
                1: "Unhelpful or irrelevant"
            ]),   
            judge: PrivateCloudComputeLanguageModel()
        )
    • 22:17 - Split into score dimensions

      // BookTaggingEvaluation.swift
        ScoreDimension(
            "Relevance",
            description: """
                Whether each tag describes a quality, theme, or tone
                of the book itself rather than incidental details or
                the reader's personal reactions.
                """,
            scale: .numeric([
                4: "Every tag describes the book itself",
                3: "Most tags describe the book",
                2: "Some tags describe personal reactions",
                1: "Tags don't meaningfully describe the book"
            ])    
        )
        // Define `usefulness` the same way as a second ScoreDimension.
    • 22:32 - Add dimensions to the judge

      // BookTaggingEvaluation.swift
        var evaluators: Evaluators {
      
            Evaluator {  }  
      
            Evaluator {  }
      
            Evaluator {  }
        
            ModelJudgeEvaluator(
                judge: PrivateCloudComputeLanguageModel(),
                dimensions: [relevance, usefulness]
            )
        }
    • 23:17 - Add app context with a ModelJudgePrompt

      // BookTaggingEvaluation.swift
        ModelJudgeEvaluator(
            judge: PrivateCloudComputeLanguageModel(),
            dimensions: [relevance, usefulness],
            prompt: ModelJudgePrompt( 
                instructions: """
                    You are evaluating tags generated for a personal book-tracking app where users
                    organize their library by browsing and filtering tags.
                    """,
                evaluationTarget: { value in
                    "\(value.tags.count) Generated tags: " + value.tags.joined(separator: ", ")
                },
                reference: { input, _ in 
                    let expectedTags = input.expected?.tags.joined(separator: ", ")
                    return ["Expected Tags": expectedTags ?? "No expected tags defined"]
                }
            )
        )
    • 0:00 - Introduction
    • Rob Rhyne and Yada introduce the Evaluations framework. Generative-AI features break the "same input, same output" contract that unit tests rely on, so a new, more robust form of testing is needed to measure how often features produce unexpected or unsafe results.

    • 3:10 - Demo app Book Tacker: a manual evaluation
    • Introduces the Book Tracker demo app and its BookTaggingService, which auto-tags books from reviews. Trying it in a #playground surfaces issues (too many tags, book title as a tag, multi-word tags) and produces a first human-judged list of expectations.

    • 4:31 - Building your first evaluation
    • Implement the Evaluation protocol in five steps: define the subject (the code under test), the dataset of ModelSample inputs with expected values, a Metric and Evaluator (pass/fail on tag count), and an aggregateMetrics summary.

    • 8:06 - Running the evaluation and reading the report
    • Run evaluations through Swift Testing with the evaluates trait and an optimization target (#expect average at least 80%). The new evaluation test report breaks down per-sample results, prompts, measurements, and the full model response.

    • 10:57 - Building robust datasets
    • Two samples aren't enough; good datasets have thousands with variety (genres, review lengths, fiction/non-fiction, forms, personal opinions). Hand-authoring doesn't scale, so the framework's SampleGenerator synthesizes more samples from a seed set.

    • 14:20 - Refining metrics and evaluators
    • Add metrics for deeper insight: TagTotal with a scoring (not pass/fail) evaluator, range-compliance and distribution, word-count, and genre checks against knownGenres, covering three of the five original expectations, tracked alongside instruction changes.

    • 15:41 - Evaluation-driven development and hill-climbing
    • Recap the loop: a failing optimization target prompts analysis and a change (adding a count range to the @Guide on the BookTags Generable). Re-running to check the result is hill-climbing; centering development on it is evaluation-driven development.

    • 16:12 - Model judges: qualitative metrics
    • Quantitative metrics can pass while tags are still wrong (reader opinions, mis-inferred genres). A model judge uses a second, at-least-as-capable model (here, Private Cloud Compute) to score output the way a person would, consistently across the dataset.

    • 18:42 - Building a model judge
    • A ModelJudgeEvaluator is just another Evaluator producing the same Metric type. Define a TagQuality metric on a 1-to-4 scale (an even number of levels avoids a neutral default), specify the judge model, run it, and read the rationales.

    • 21:19 - Refining with score dimensions
    • When you disagree with a score, the question is often too broad. Split it into ScoreDimensions (Relevance vs. Usefulness), each with its own description and scale, and add a ModelJudgePrompt to give the judge context about your app.

    • 23:45 - Reviewing dimension results
    • Re-running yields separate relevance and usefulness scores whose rationales split the diagnosis: relevance shows what kind of tag is wrong, usefulness shows how it fails at browsing, giving a clear path back into the hill-climbing loop.

    • 24:20 - Best practices
    • Start small (20 to 30 focused samples), use heuristics for quantitative traits (if you can measure it in code), use ModelJudgeEvaluator for qualitative ones, start simple with the judge, and let rationales drive the next change.

    • 25:38 - Next steps
    • Pointers to the Evaluations framework documentation, the Shelf sample code, and the companion sessions on hill-climbing prompts and creating robust evaluations for agentic apps.

Developer Footer

  • 视频
  • WWDC26
  • 了解 Evaluations 框架
  • 打开菜单 关闭菜单
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    打开菜单 关闭菜单
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • SF Symbols
    打开菜单 关闭菜单
    • 辅助功能
    • 配件
    • Apple 智能
    • App 扩展
    • App Store
    • 音频与视频 (英文)
    • 增强现实
    • 设计
    • 分发
    • 教育
    • 字体 (英文)
    • 游戏
    • 健康与健身
    • App 内购买项目
    • 本地化
    • 地图与位置
    • 机器学习与 AI
    • 开源资源 (英文)
    • 安全性
    • Safari 浏览器与网页 (英文)
    打开菜单 关闭菜单
    • 完整文档 (英文)
    • 部分主题文档 (简体中文)
    • 教程
    • 下载
    • 论坛 (英文)
    • 视频
    打开菜单 关闭菜单
    • 支持文档
    • 联系我们
    • 错误报告
    • 系统状态 (英文)
    打开菜单 关闭菜单
    • Apple 开发者
    • App Store Connect
    • 证书、标识符和描述文件 (英文)
    • 反馈助理
    打开菜单 关闭菜单
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program (英文)
    • Mini Apps Partner Program
    • News Partner Program (英文)
    • Video Partner Program (英文)
    • 安全赏金计划 (英文)
    • Security Research Device Program (英文)
    打开菜单 关闭菜单
    • 与 Apple 会面交流
    • Apple Developer Center
    • App Store 大奖 (英文)
    • Apple 设计大奖
    • Apple Developer Academies (英文)
    • WWDC
    阅读最近新闻。
    获取 Apple Developer App。
    版权所有 © 2026 Apple Inc. 保留所有权利。
    使用条款 隐私政策 协议和准则