View in English

  • Apple 开发者
    • 入门汇总

    探索“入门汇总”

    • 概览
    • 学习
    • Apple Developer Program

    及时了解最新动态

    • 最新动态
    • 开发者你好
    • 平台

    探索“平台”

    • Apple 平台
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    • App Store

    精选

    • 设计
    • 分发
    • 游戏
    • 配件
    • 网页
    • Home
    • CarPlay 车载
    • 技术

    探索“技术”

    • 概览
    • Xcode
    • Swift
    • SwiftUI

    精选

    • 辅助功能
    • App Intents
    • Apple 智能
    • 游戏
    • 机器学习与 AI
    • 安全性
    • Xcode Cloud
    • 社区

    探索“社区”

    • 概览
    • “与 Apple 会面交流”活动
    • 社区主导的活动
    • 开发者论坛
    • 开源

    精选

    • WWDC
    • Swift Student Challenge
    • 开发者故事
    • App Store 大奖
    • Apple 设计大奖
    • Apple Developer Centers
    • 文档

    探索“文档”

    • 文档库
    • 技术概述
    • 示例代码
    • 《人机界面指南》
    • 视频

    发布说明

    • 精选更新
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • Apple tvOS
    • Xcode
    • 下载

    探索“下载”

    • 所有下载
    • 操作系统
    • 应用程序
    • 设计资源

    精选

    • Xcode
    • TestFlight
    • 字体
    • SF Symbols
    • Icon Composer
    • 支持

    探索“支持”

    • 概览
    • 帮助指南
    • 开发者论坛
    • “反馈助理”
    • 联系我们

    精选

    • 《开发者账户帮助》
    • 《App 审核指南》
    • 《App Store Connect 帮助》
    • 即将实行的要求
    • 协议和准则
    • 系统状态
  • 快速链接

    • 活动
    • 新闻
    • 论坛
    • 示例代码
    • 视频
 

视频

打开菜单 关闭菜单
  • 专题
  • 所有视频
  • 关于

更多视频

  • 简介
  • 概要
  • 代码
  • 为智能体 App 打造稳健的评估

    了解如何利用 Evaluations 框架的高级功能,为你的 App 构建稳健的评估。探索涉及工具调用和动态条件的评估流程,以及如何为你的用例定义正确的行为。了解如何生成合成数据、有效使用评审模型,并验证你的数据集以便获得可靠的结果。

    章节

    • 0:00 - Introduction
    • 2:21 - The dataset problem in BookTracker
    • 3:46 - Generating synthetic data with makeSamples
    • 6:27 - Customizing generation with SampleGenerator
    • 8:38 - Sampling strategies
    • 10:11 - Validating synthetic samples
    • 13:04 - Comparing evaluation results
    • 15:09 - Tool calling and tool evaluations
    • 18:54 - Trajectory expectations
    • 21:26 - Building a tool call evaluation
    • 22:02 - Synthetic data for tool evaluations
    • 23:49 - Next steps

    资源

    • Book Tracker: Using Evaluations to evaluate an intelligent feature
    • Generating synthetic datasets
    • Evaluating tool-calling behavior
    • Scoring with model-as-judge evaluators
      • 高清视频
      • 标清视频
  • 搜索此视频…
    • 5:16 - Generate synthetic data with makeSamples

      // Synthetic data
        let prompt = Prompt("""
            Generate diverse range of book reviews and corresponding tags.
            Cover a wide range of genres, time periods, cultures, and
            reader personas. Do not repeat books already in the dataset.
            """)
        
        let dataset = Book.sampleBooks.map { book in
            ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
        }
        
        let targetCount = 100
        var expandedDataset = dataset
      
        for try await sample in dataset.makeSamples(prompt, targetCount: targetCount) {
            expandedDataset.append(sample)
            print("Generated \(expandedDataset.count) samples so far.")
        }
      
        2. Configure a custom SampleGenerator — slides 30–43
        
        // Define your own configuration
        let generator = SampleGenerator<ModelSample<BookTags>>(
            prompt,
            samples: dataset,
            targetCount: targetCount,
            sessionProvider: {
                LanguageModelSession( 
                    model: PrivateCloudComputeLanguageModel(),
                    instructions: """
                        You are a synthetic data generator for a book-tracking app's evaluation suite.
                        Your job is to produce realistic, diverse book entries that will stress-test
                        a tagging system.
      
                        Rules:
                        - Review must be at least 100 characters long.
                        - Review should cover a mix of genre, mood/tone, and themes.
                        - Reviews should vary in length.
                        - Create between 3 and 8 tags.
                        - Tags must be lowercase.
                        """ 
                )
            }
        )
    • 5:53 - Configure a custom SampleGenerator

      // Define your own configuration
        let generator = SampleGenerator<ModelSample<BookTags>>(
            prompt,
            samples: dataset,
            targetCount: targetCount,
            sessionProvider: {
                LanguageModelSession( 
                    model: PrivateCloudComputeLanguageModel(),
                    instructions: """
                        You are a synthetic data generator for a book-tracking app's evaluation suite.
                        Your job is to produce realistic, diverse book entries that will stress-test
                        a tagging system.
      
                        Rules:
                        - Review must be at least 100 characters long.
                        - Review should cover a mix of genre, mood/tone, and themes.
                        - Reviews should vary in length.
                        - Create between 3 and 8 tags.
                        - Tags must be lowercase.
                        """ 
                )
            }
        )
    • 10:37 - Validate generated samples

      // Define validation metrics
        validator: { sample in
            guard let book = sample.expected else { return false }
      
            // Review must be at least 100 characters
            guard sample.promptDescription.count >= 100 else { return false }
      
            // Must have between 3 and 8 tags
            guard (3...8).contains(book.tags.count) else { return false }
      
            // All tags must be lowercase
            guard book.tags.allSatisfy({ $0 == $0.lowercased() }) else { return false }
      
            return true
        }
    • 10:58 - Access valid and invalid results

      // Accessing results
        for try await sample in generator.run() {
            // During iteration
            expandedDataset.append(sample)
        }
      
        // After iteration
        let allSamples = await generator.samples
        let invalidSamples = await generator.invalidSamples
        
        print("Generated \(allSamples.count) new samples. Total: \(expandedDataset.count)")
    • 15:30 - Define a tool's Generable argument

      @Generable
        struct SearchBooksArguments {
            @Guide(description: "A freeform search term to match against titles, reviews, or tags")
            var query: String?
        
            @Guide(description: "Filter results to books with this specific tag")
            var tag: String?
      
            @Guide(description: "Filter results by mood")
            var mood: String?
      
            @Guide(description: "Filter results by genre")
            var genre: String?
      
            @Guide(description: "Maximum number of results to return. Defaults to 5.")
            var limit: Int? 
        }
    • 16:37 - A basic trajectory expectation

      // "Find books tagged gothic"
        TrajectoryExpectation(
            unordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .exact(argumentName: "tag", value: .string("gothic"))
                    ]
                )
            ]
        )
    • 17:07 - Match arguments by intent (naturalLanguage)

      // "Find something cheerful"
        TrajectoryExpectation(
            "searchBooks",
            arguments: [
                .naturalLanguage(
                    argumentName: "mood",
                    criteria: "Should relate to uplifting, hopeful, or positive feelings"
                )
            ]
        )
        Other matchers available: .contains, .oneOf, .pattern, .range, and more.
    • 17:34 - Expect tool calls in order

      // "Find gothic books and show details on the first"
        TrajectoryExpectation(
            ordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .exact(argumentName: "tag", value: .string("gothic"))
                    ]
                ),
                ToolExpectation(
                    "getBookDetails",
                    arguments: [
                        .keyOnly(argumentName: "bookId")
                    ]
                )
            ]
        )
    • 17:55 - Disallow specific tool calls

      // "Show only sci-fi books. Don't look for similar ones."
        TrajectoryExpectation(
            unordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .naturalLanguage(
                            argumentName: "genre",
                            criteria: "Should refer to science fiction")
                    ]
                )
            ],
            disallowed: [
                ToolExpectation("findSimilarBooks")
            ]
        )
    • 18:14 - Build a tool call evaluation

      // Tool call evaluations
        let samples = SampleArrayLoader(samples: [
            ModelSample(
                prompt: "Find all the books tagged with 'gothic'.",
                instructions: "Help the user explore their book collection.",
                expectations: TrajectoryExpectation(  )
            )
        ])
      
        struct BookLibraryToolCallEval: Evaluation {
            var dataset = samples
      
            let pass = Metric("All Passed")
            let percent = Metric("Percentage Passed")
      
            var evaluators: Evaluators { 
                ToolCallEvaluator(allPass: pass, percentagePass: percent)
            }
        }
    • 19:20 - Synthesize tool-evaluation samples

      // Tool call evaluations
        let prompt = Prompt("""
            Generate diverse user queries for a personal book library assistant.
            Each sample needs a prompt (what the user says), and a trajectory
            expectation describing which tools should be called and in what order.
            """)
      
        let instructions = """
            AVAILABLE TOOLS:
            - searchBooks(query?, tag?, mood?, genre?, limit?): search the library
            - getBookDetails(bookId): full details for one book
            - findSimilarBooks(bookId, maxResults?): find books sharing tags
            ORDER REQUIREMENTS:
            - searchBooks must comes before getBookDetails or findSimilarBooks
            - Use TrajectoryExpectation(ordered:) when sequence matters, else (unordered:)
            USE THESE ARGUMENT MATCHERS:
            - .exact for precise values, .naturalLanguage for fuzzy matching
            - .keyOnly when any value is acceptable, .range for numeric constraints
            - .contains/.hasPrefix/.hasSuffix for partial string matching
            """
    • 19:51 - Validate tool-evaluation samples

      // Tool call evaluations
        validator: { sample in
            // Must have expectations defined
            guard sample.output.expectations != nil else { return false }
      
            let expectations = sample.output.expectations!
      
            // Must reference at least one tool
            let totalExpectations = expectations.ordered.count + expectations.unordered.count
            guard totalExpectations > 0 else { return false }
      
            // All tool names must be from the valid set
            let validTools: Set<String> = ["searchBooks", "getBookDetails", "findSimilarBooks"]
            let allExpectations = expectations.ordered + expectations.unordered + expectations.disallowed
            for expectation in allExpectations {
                guard validTools.contains(expectation.name) else { return false }
            }
        
            return true
        }
      
        ---
    • 0:00 - Introduction
    • Ada Wong and Kyle Murray introduce advanced features of the Evaluations framework (new in Xcode 27). Outlines the agenda: growing your dataset with synthetic data, then building robust evaluations for agentic, tool-calling workflows, focused on the develop-and-evaluate step of hill-climbing.

    • 2:21 - The dataset problem in BookTracker
    • The BookTracker app auto-tags books from reviews, but its 13 hand-written sampleBooks give only a narrow view. Real-world reviews span countless books, genres, lengths, and styles, too much variety to capture by hand.

    • 3:46 - Generating synthetic data with makeSamples
    • The makeSamples API takes a prompt, a dataset (ModelSample with review to tags), and a target count (the full resulting size, including your seeds). It returns an async stream of new samples; coverage of real usage matters more than raw quantity.

    • 6:27 - Customizing generation with SampleGenerator
    • For more control, SampleGenerator exposes a sessionProvider closure to pick the model (such as Private Cloud Compute) and instructions. The session is reused across batches but can exhaust its context window mid-run, so make instructions self-contained since the provider may be called again.

    • 8:38 - Sampling strategies
    • The samplingStrategy controls which seed samples are shown to the model as in-context examples: random (a varied subset, the default) or slidingWindow (sequential, for datasets with meaningful order).

    • 10:11 - Validating synthetic samples
    • A validator closure accepts or rejects each generated sample in isolation against systematic rules: review length at least 100 characters, 3 to 8 tags, lowercase tags. Valid samples collect in samples, rejects in invalidSamples, both updated in real time.

    • 13:04 - Comparing evaluation results
    • Using the Xcode 27 Evaluations Report, compare the 13-sample run against the 100-sample run. The quality scores drop, the feature only looked good on the small dataset, and a drop can signal issues in the prompt, the feature, the evaluation, or the dataset.

    • 15:09 - Tool calling and tool evaluations
    • Tool evaluations: features often take multiple behind-the-scenes tool calls, and a plausible answer can come from the wrong path. Tool evaluations verify the how: correct tools, correct arguments, correct order, no surprises, illustrated with searchBooks, getBookDetails, and findSimilarBooks.

    • 18:54 - Trajectory expectations
    • A TrajectoryExpectation checks the kind and order of tool calls in a session transcript. Refine with argument matchers (exact, naturalLanguage, contains, oneOf, pattern, range), plus ordered expectations and a disallowed set for tools that must not be called.

    • 21:26 - Building a tool call evaluation
    • Bring the trajectory expectations together: a dataset of samples (each a prompt plus expectation) scored by ToolCallEvaluator, which combines a LanguageModelSession with the tools, captures the structured transcript, and reports alongside your other results in Xcode.

    • 22:02 - Synthetic data for tool evaluations
    • Because ModelSample and TrajectoryExpectation are Generable, you can synthesize tool-evaluation samples too, describing the available tools, order expectations, and context in the prompt, then validating that each sample has an expectation, at least one tool, and only real tools.

    • 23:49 - Next steps
    • Run BookTaggingEvaluation (what the model produces) and tool evaluations (how it gets there) in one suite for end-to-end confidence. Next steps: create your own synthetic data, evaluate your app's custom tools, and explore the sample app and documentation.

Developer Footer

  • 视频
  • WWDC26
  • 为智能体 App 打造稳健的评估
  • 打开菜单 关闭菜单
    • iOS
    • iPadOS
    • macOS
    • Apple tvOS
    • visionOS
    • watchOS
    打开菜单 关闭菜单
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • SF Symbols
    打开菜单 关闭菜单
    • 辅助功能
    • 配件
    • Apple 智能
    • App 扩展
    • App Store
    • 音频与视频 (英文)
    • 增强现实
    • 设计
    • 分发
    • 教育
    • 字体 (英文)
    • 游戏
    • 健康与健身
    • App 内购买项目
    • 本地化
    • 地图与位置
    • 机器学习与 AI
    • 开源资源 (英文)
    • 安全性
    • Safari 浏览器与网页 (英文)
    打开菜单 关闭菜单
    • 完整文档 (英文)
    • 部分主题文档 (简体中文)
    • 教程
    • 下载
    • 论坛 (英文)
    • 视频
    打开菜单 关闭菜单
    • 支持文档
    • 联系我们
    • 错误报告
    • 系统状态 (英文)
    打开菜单 关闭菜单
    • Apple 开发者
    • App Store Connect
    • 证书、标识符和描述文件 (英文)
    • 反馈助理
    打开菜单 关闭菜单
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program (英文)
    • Mini Apps Partner Program
    • News Partner Program (英文)
    • Video Partner Program (英文)
    • 安全赏金计划 (英文)
    • Security Research Device Program (英文)
    打开菜单 关闭菜单
    • 与 Apple 会面交流
    • Apple Developer Center
    • App Store 大奖 (英文)
    • Apple 设计大奖
    • Apple Developer Academies (英文)
    • WWDC
    阅读最近新闻。
    获取 Apple Developer App。
    版权所有 © 2026 Apple Inc. 保留所有权利。
    使用条款 隐私政策 协议和准则