View in English

  • Apple Developer
    • 시작하기

    시작하기 탐색

    • 개요
    • 알아보기
    • Apple Developer Program

    알림 받기

    • 최신 뉴스
    • Hello Developer
    • 플랫폼

    플랫폼 탐색

    • Apple 플랫폼
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    • App Store

    피처링

    • 디자인
    • 배포
    • 게임
    • 액세서리
    • 웹
    • 홈
    • CarPlay
    • 기술

    기술 탐색

    • 개요
    • Xcode
    • Swift
    • SwiftUI

    피처링

    • 손쉬운 사용
    • 앱 인텐트
    • Apple Intelligence
    • 게임
    • 머신 러닝 및 AI
    • 보안
    • Xcode Cloud
    • 커뮤니티

    커뮤니티 탐색

    • 개요
    • Apple과의 만남 이벤트
    • 커뮤니티 주도 이벤트
    • 개발자 포럼
    • 오픈 소스

    피처링

    • WWDC
    • Swift Student Challenge
    • 개발자 이야기
    • App Store 어워드
    • Apple 디자인 어워드
    • 문서

    문서 탐색

    • 문서 라이브러리
    • 기술 개요
    • 샘플 코드
    • 휴먼 인터페이스 가이드라인
    • 비디오

    릴리즈 노트

    • 피처링 업데이트
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • tvOS
    • Xcode
    • 다운로드

    다운로드 탐색

    • 모든 다운로드
    • 운영 체제
    • 애플리케이션
    • 디자인 리소스

    피처링

    • Xcode
    • TestFlight
    • 서체
    • SF Symbols
    • Icon Composer
    • 지원

    지원 탐색

    • 개요
    • 도움말
    • 개발자 포럼
    • 피드백 지원
    • 문의하기

    피처링

    • 계정 도움말
    • 앱 심사 지침
    • App Store Connect 도움말
    • 새로 추가될 요구 사항
    • 계약 및 지침
    • 시스템 상태
  • 빠른 링크

    • 이벤트
    • 뉴스
    • 포럼
    • 샘플 코드
    • 비디오
 

비디오

메뉴 열기 메뉴 닫기
  • 컬렉션
  • 전체 비디오
  • 소개

더 많은 비디오

  • 소개
  • 요약
  • 코드
  • 에이전틱 앱에 대한 강력한 평가 기능 구현하기

    Evaluations 프레임워크의 고급 기능을 활용하여 앱에 대한 강력한 평가 기능을 빌드하는 방법을 알아보세요. 도구 호출과 동적 조건을 사용하여 평가 흐름을 살펴보고, 사용 사례에 맞는 올바른 동작 기준을 정의하는 방법을 알아보세요. 합성 데이터를 생성하고, 심사위원을 효과적으로 사용하며, 신뢰할 수 있는 결과를 위해 데이터세트를 검증하는 방법을 살펴보세요.

    챕터

    • 0:00 - Introduction
    • 2:21 - The dataset problem in BookTracker
    • 3:46 - Generating synthetic data with makeSamples
    • 6:27 - Customizing generation with SampleGenerator
    • 8:38 - Sampling strategies
    • 10:11 - Validating synthetic samples
    • 13:04 - Comparing evaluation results
    • 15:09 - Tool calling and tool evaluations
    • 18:54 - Trajectory expectations
    • 21:26 - Building a tool call evaluation
    • 22:02 - Synthetic data for tool evaluations
    • 23:49 - Next steps

    리소스

    • Book Tracker: Using Evaluations to evaluate an intelligent feature
    • Generating synthetic datasets
    • Evaluating tool-calling behavior
    • Scoring with model-as-judge evaluators
      • HD 비디오
      • SD 비디오
  • 비디오 검색…
    • 5:16 - Generate synthetic data with makeSamples

      // Synthetic data
        let prompt = Prompt("""
            Generate diverse range of book reviews and corresponding tags.
            Cover a wide range of genres, time periods, cultures, and
            reader personas. Do not repeat books already in the dataset.
            """)
        
        let dataset = Book.sampleBooks.map { book in
            ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
        }
        
        let targetCount = 100
        var expandedDataset = dataset
      
        for try await sample in dataset.makeSamples(prompt, targetCount: targetCount) {
            expandedDataset.append(sample)
            print("Generated \(expandedDataset.count) samples so far.")
        }
      
        2. Configure a custom SampleGenerator — slides 30–43
        
        // Define your own configuration
        let generator = SampleGenerator<ModelSample<BookTags>>(
            prompt,
            samples: dataset,
            targetCount: targetCount,
            sessionProvider: {
                LanguageModelSession( 
                    model: PrivateCloudComputeLanguageModel(),
                    instructions: """
                        You are a synthetic data generator for a book-tracking app's evaluation suite.
                        Your job is to produce realistic, diverse book entries that will stress-test
                        a tagging system.
      
                        Rules:
                        - Review must be at least 100 characters long.
                        - Review should cover a mix of genre, mood/tone, and themes.
                        - Reviews should vary in length.
                        - Create between 3 and 8 tags.
                        - Tags must be lowercase.
                        """ 
                )
            }
        )
    • 5:53 - Configure a custom SampleGenerator

      // Define your own configuration
        let generator = SampleGenerator<ModelSample<BookTags>>(
            prompt,
            samples: dataset,
            targetCount: targetCount,
            sessionProvider: {
                LanguageModelSession( 
                    model: PrivateCloudComputeLanguageModel(),
                    instructions: """
                        You are a synthetic data generator for a book-tracking app's evaluation suite.
                        Your job is to produce realistic, diverse book entries that will stress-test
                        a tagging system.
      
                        Rules:
                        - Review must be at least 100 characters long.
                        - Review should cover a mix of genre, mood/tone, and themes.
                        - Reviews should vary in length.
                        - Create between 3 and 8 tags.
                        - Tags must be lowercase.
                        """ 
                )
            }
        )
    • 10:37 - Validate generated samples

      // Define validation metrics
        validator: { sample in
            guard let book = sample.expected else { return false }
      
            // Review must be at least 100 characters
            guard sample.promptDescription.count >= 100 else { return false }
      
            // Must have between 3 and 8 tags
            guard (3...8).contains(book.tags.count) else { return false }
      
            // All tags must be lowercase
            guard book.tags.allSatisfy({ $0 == $0.lowercased() }) else { return false }
      
            return true
        }
    • 10:58 - Access valid and invalid results

      // Accessing results
        for try await sample in generator.run() {
            // During iteration
            expandedDataset.append(sample)
        }
      
        // After iteration
        let allSamples = await generator.samples
        let invalidSamples = await generator.invalidSamples
        
        print("Generated \(allSamples.count) new samples. Total: \(expandedDataset.count)")
    • 15:30 - Define a tool's Generable argument

      @Generable
        struct SearchBooksArguments {
            @Guide(description: "A freeform search term to match against titles, reviews, or tags")
            var query: String?
        
            @Guide(description: "Filter results to books with this specific tag")
            var tag: String?
      
            @Guide(description: "Filter results by mood")
            var mood: String?
      
            @Guide(description: "Filter results by genre")
            var genre: String?
      
            @Guide(description: "Maximum number of results to return. Defaults to 5.")
            var limit: Int? 
        }
    • 16:37 - A basic trajectory expectation

      // "Find books tagged gothic"
        TrajectoryExpectation(
            unordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .exact(argumentName: "tag", value: .string("gothic"))
                    ]
                )
            ]
        )
    • 17:07 - Match arguments by intent (naturalLanguage)

      // "Find something cheerful"
        TrajectoryExpectation(
            "searchBooks",
            arguments: [
                .naturalLanguage(
                    argumentName: "mood",
                    criteria: "Should relate to uplifting, hopeful, or positive feelings"
                )
            ]
        )
        Other matchers available: .contains, .oneOf, .pattern, .range, and more.
    • 17:34 - Expect tool calls in order

      // "Find gothic books and show details on the first"
        TrajectoryExpectation(
            ordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .exact(argumentName: "tag", value: .string("gothic"))
                    ]
                ),
                ToolExpectation(
                    "getBookDetails",
                    arguments: [
                        .keyOnly(argumentName: "bookId")
                    ]
                )
            ]
        )
    • 17:55 - Disallow specific tool calls

      // "Show only sci-fi books. Don't look for similar ones."
        TrajectoryExpectation(
            unordered: [
                ToolExpectation(
                    "searchBooks",
                    arguments: [
                        .naturalLanguage(
                            argumentName: "genre",
                            criteria: "Should refer to science fiction")
                    ]
                )
            ],
            disallowed: [
                ToolExpectation("findSimilarBooks")
            ]
        )
    • 18:14 - Build a tool call evaluation

      // Tool call evaluations
        let samples = SampleArrayLoader(samples: [
            ModelSample(
                prompt: "Find all the books tagged with 'gothic'.",
                instructions: "Help the user explore their book collection.",
                expectations: TrajectoryExpectation(  )
            )
        ])
      
        struct BookLibraryToolCallEval: Evaluation {
            var dataset = samples
      
            let pass = Metric("All Passed")
            let percent = Metric("Percentage Passed")
      
            var evaluators: Evaluators { 
                ToolCallEvaluator(allPass: pass, percentagePass: percent)
            }
        }
    • 19:20 - Synthesize tool-evaluation samples

      // Tool call evaluations
        let prompt = Prompt("""
            Generate diverse user queries for a personal book library assistant.
            Each sample needs a prompt (what the user says), and a trajectory
            expectation describing which tools should be called and in what order.
            """)
      
        let instructions = """
            AVAILABLE TOOLS:
            - searchBooks(query?, tag?, mood?, genre?, limit?): search the library
            - getBookDetails(bookId): full details for one book
            - findSimilarBooks(bookId, maxResults?): find books sharing tags
            ORDER REQUIREMENTS:
            - searchBooks must comes before getBookDetails or findSimilarBooks
            - Use TrajectoryExpectation(ordered:) when sequence matters, else (unordered:)
            USE THESE ARGUMENT MATCHERS:
            - .exact for precise values, .naturalLanguage for fuzzy matching
            - .keyOnly when any value is acceptable, .range for numeric constraints
            - .contains/.hasPrefix/.hasSuffix for partial string matching
            """
    • 19:51 - Validate tool-evaluation samples

      // Tool call evaluations
        validator: { sample in
            // Must have expectations defined
            guard sample.output.expectations != nil else { return false }
      
            let expectations = sample.output.expectations!
      
            // Must reference at least one tool
            let totalExpectations = expectations.ordered.count + expectations.unordered.count
            guard totalExpectations > 0 else { return false }
      
            // All tool names must be from the valid set
            let validTools: Set<String> = ["searchBooks", "getBookDetails", "findSimilarBooks"]
            let allExpectations = expectations.ordered + expectations.unordered + expectations.disallowed
            for expectation in allExpectations {
                guard validTools.contains(expectation.name) else { return false }
            }
        
            return true
        }
      
        ---
    • 0:00 - Introduction
    • Ada Wong and Kyle Murray introduce advanced features of the Evaluations framework (new in Xcode 27). Outlines the agenda: growing your dataset with synthetic data, then building robust evaluations for agentic, tool-calling workflows, focused on the develop-and-evaluate step of hill-climbing.

    • 2:21 - The dataset problem in BookTracker
    • The BookTracker app auto-tags books from reviews, but its 13 hand-written sampleBooks give only a narrow view. Real-world reviews span countless books, genres, lengths, and styles, too much variety to capture by hand.

    • 3:46 - Generating synthetic data with makeSamples
    • The makeSamples API takes a prompt, a dataset (ModelSample with review to tags), and a target count (the full resulting size, including your seeds). It returns an async stream of new samples; coverage of real usage matters more than raw quantity.

    • 6:27 - Customizing generation with SampleGenerator
    • For more control, SampleGenerator exposes a sessionProvider closure to pick the model (such as Private Cloud Compute) and instructions. The session is reused across batches but can exhaust its context window mid-run, so make instructions self-contained since the provider may be called again.

    • 8:38 - Sampling strategies
    • The samplingStrategy controls which seed samples are shown to the model as in-context examples: random (a varied subset, the default) or slidingWindow (sequential, for datasets with meaningful order).

    • 10:11 - Validating synthetic samples
    • A validator closure accepts or rejects each generated sample in isolation against systematic rules: review length at least 100 characters, 3 to 8 tags, lowercase tags. Valid samples collect in samples, rejects in invalidSamples, both updated in real time.

    • 13:04 - Comparing evaluation results
    • Using the Xcode 27 Evaluations Report, compare the 13-sample run against the 100-sample run. The quality scores drop, the feature only looked good on the small dataset, and a drop can signal issues in the prompt, the feature, the evaluation, or the dataset.

    • 15:09 - Tool calling and tool evaluations
    • Tool evaluations: features often take multiple behind-the-scenes tool calls, and a plausible answer can come from the wrong path. Tool evaluations verify the how: correct tools, correct arguments, correct order, no surprises, illustrated with searchBooks, getBookDetails, and findSimilarBooks.

    • 18:54 - Trajectory expectations
    • A TrajectoryExpectation checks the kind and order of tool calls in a session transcript. Refine with argument matchers (exact, naturalLanguage, contains, oneOf, pattern, range), plus ordered expectations and a disallowed set for tools that must not be called.

    • 21:26 - Building a tool call evaluation
    • Bring the trajectory expectations together: a dataset of samples (each a prompt plus expectation) scored by ToolCallEvaluator, which combines a LanguageModelSession with the tools, captures the structured transcript, and reports alongside your other results in Xcode.

    • 22:02 - Synthetic data for tool evaluations
    • Because ModelSample and TrajectoryExpectation are Generable, you can synthesize tool-evaluation samples too, describing the available tools, order expectations, and context in the prompt, then validating that each sample has an expectation, at least one tool, and only real tools.

    • 23:49 - Next steps
    • Run BookTaggingEvaluation (what the model produces) and tool evaluations (how it gets there) in one suite for end-to-end confidence. Next steps: create your own synthetic data, evaluate your app's custom tools, and explore the sample app and documentation.

Developer Footer

  • 비디오
  • WWDC26
  • 에이전틱 앱에 대한 강력한 평가 기능 구현하기
  • 메뉴 열기 메뉴 닫기
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    메뉴 열기 메뉴 닫기
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • SF Symbols
    메뉴 열기 메뉴 닫기
    • 손쉬운 사용
    • 액세서리
    • Apple Intelligence
    • 앱 확장 프로그램
    • App Store
    • 오디오 및 비디오(영문)
    • 증강 현실
    • 디자인
    • 배포
    • 교육
    • 서체(영문)
    • 게임
    • 건강 및 피트니스
    • 앱 내 구입
    • 현지화
    • 지도 및 위치
    • 머신 러닝 및 AI
    • 오픈 소스(영문)
    • 보안
    • Safari 및 웹(영문)
    메뉴 열기 메뉴 닫기
    • 문서(영문)
    • 튜토리얼
    • 다운로드
    • 포럼(영문)
    • 비디오
    메뉴 열기 메뉴 닫기
    • 지원 문서
    • 문의하기
    • 버그 보고
    • 시스템 상태(영문)
    메뉴 열기 메뉴 닫기
    • Apple Developer
    • App Store Connect
    • 인증서, 식별자 및 프로파일(영문)
    • 피드백 지원
    메뉴 열기 메뉴 닫기
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program(영문)
    • Mini Apps Partner Program
    • News Partner Program(영문)
    • Video Partner Program(영문)
    • Security Bounty Program(영문)
    • Security Research Device Program(영문)
    메뉴 열기 메뉴 닫기
    • Apple과의 만남
    • Apple Developer Center
    • App Store 어워드(영문)
    • Apple 디자인 어워드
    • Apple Developer Academy(영문)
    • WWDC
    최신 뉴스 읽기.
    Apple Developer 앱 받기.
    Copyright © 2026 Apple Inc. 모든 권리 보유.
    약관 개인정보 처리방침 계약 및 지침