View in English

  • Apple Developer
    • 시작하기

    시작하기 탐색

    • 개요
    • 알아보기
    • Apple Developer Program

    알림 받기

    • 최신 뉴스
    • Hello Developer
    • 플랫폼

    플랫폼 탐색

    • Apple 플랫폼
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    • App Store

    피처링

    • 디자인
    • 배포
    • 게임
    • 액세서리
    • 웹
    • 홈
    • CarPlay
    • 기술

    기술 탐색

    • 개요
    • Xcode
    • Swift
    • SwiftUI

    피처링

    • 손쉬운 사용
    • 앱 인텐트
    • Apple Intelligence
    • 게임
    • 머신 러닝 및 AI
    • 보안
    • Xcode Cloud
    • 커뮤니티

    커뮤니티 탐색

    • 개요
    • Apple과의 만남 이벤트
    • 커뮤니티 주도 이벤트
    • 개발자 포럼
    • 오픈 소스

    피처링

    • WWDC
    • Swift Student Challenge
    • 개발자 이야기
    • App Store 어워드
    • Apple 디자인 어워드
    • 문서

    문서 탐색

    • 문서 라이브러리
    • 기술 개요
    • 샘플 코드
    • 휴먼 인터페이스 가이드라인
    • 비디오

    릴리즈 노트

    • 피처링 업데이트
    • iOS
    • iPadOS
    • macOS
    • watchOS
    • visionOS
    • tvOS
    • Xcode
    • 다운로드

    다운로드 탐색

    • 모든 다운로드
    • 운영 체제
    • 애플리케이션
    • 디자인 리소스

    피처링

    • Xcode
    • TestFlight
    • 서체
    • SF Symbols
    • Icon Composer
    • 지원

    지원 탐색

    • 개요
    • 도움말
    • 개발자 포럼
    • 피드백 지원
    • 문의하기

    피처링

    • 계정 도움말
    • 앱 심사 지침
    • App Store Connect 도움말
    • 새로 추가될 요구 사항
    • 계약 및 지침
    • 시스템 상태
  • 빠른 링크

    • 이벤트
    • 뉴스
    • 포럼
    • 샘플 코드
    • 비디오
 

비디오

메뉴 열기 메뉴 닫기
  • 컬렉션
  • 전체 비디오
  • 소개

더 많은 비디오

  • 소개
  • 요약
  • 코드
  • Evaluations를 사용한 점진적 개선 방식으로 프롬프트 향상하기

    프롬프트 엔지니어링을 안내하고 앱에 적합한 모델을 선택하는 비교 평가 기법을 알아보세요. 성능 기준치를 설정하고, 평가 전략을 확장하며, 결과를 JSON 형식으로 변환하여 다른 도구와 통합하는 방법을 살펴보세요. 다양한 프롬프팅 전략을 적용해야 하는 경우와 최상의 결과를 얻기 위해 프롬프트를 반복적으로 향상해 나가는 방법을 알아보세요.

    챕터

    • 0:00 - Introduction
    • 2:42 - BookTracker's tagging problem
    • 5:27 - Analyzing the evaluation results
    • 8:26 - Drift between judge and human
    • 9:37 - Measuring drift with Cohen's kappa
    • 12:26 - Building a judge alignment evaluation
    • 15:16 - Analyzing alignment failures
    • 17:16 - Comparative evaluation: control vs experimental
    • 19:12 - Refining the scoring dimensions
    • 21:23 - Adding few-shot examples to the judge
    • 23:38 - Going beyond prompts: adding a tool
    • 27:17 - Next steps

    리소스

    • Book Tracker: Using Evaluations to evaluate an intelligent feature
    • Designing effective model-as-judge evaluators
    • Designing specific, measurable criteria in an evaluation suite
      • HD 비디오
      • SD 비디오
  • 비디오 검색…
    • 3:54 - The BookTaggingEvaluation

      // MARK: - Evaluation
        struct BookTaggingEvaluation: Evaluation {
            func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
                let result = try await BookTaggingService.generateTags(for: sample.promptDescription)
                return ModelSubject(value: result)
            }
      
            // MARK: - Dataset
            var dataset = ArrayLoader(samples:
                Book.sampleBooks.map { book in
                    ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
                }
            )
      
            // MARK: - Evaluators & Metrics
            var tagCount = Metric("Tag Count")
            let hasGenreTag = Metric("Has Genre Tag")
            let noDuplicates = Metric("No Duplicates")
      
            let relevance = ScoreDimension(
                "Relevance",
                description: """
                    Whether each tag describes a quality, theme, or tone of the
                    book itself rather than incidental details or the reader's
                    personal reactions.
                    """,
                scale: .numeric([
                    4: "Every tag describes the book itself",
                    3: "Most tags describe the book, one picks up a reader reaction or minor detail",
                    2: "Most tags are surface details or personal reactions, not book descriptors",
                    1: "Tags don't meaningfully describe the book"
                ])
            )
      
            let usefulness = ScoreDimension(
                "Usefulness",
                description: """
                    Whether tags are at the right granularity for browsing — broad
                    enough that multiple books could share the tag, specific enough
                    to help filter.
                    """,
                scale: .numeric([
                    4: "Every tag could group multiple books while still narrowing a search",
                    3: "Most tags are at the right level, one is either too broad or too narrow",
                    2: "Most tags are too broad to filter or too narrow to group",
                    1: "Tags would not help with browsing"
                ])
            )
      
            var evaluators: Evaluators {
                // 1. Tag count is within the required 3–8 range
                Evaluator { _, subject in
                    let count = subject.value.tags.count
                    if (count >= 3 && count <= 8) {
                        return tagCount.passing(rationale: "\(count) tags")
                    }
                    return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
                }
        
                // 2. At least one tag identifies the genre or literary form
                Evaluator { _, subject in
                    let tags = subject.value.tags.map { $0.lowercased() }
                    let knownGenres = await BookTaggingService.knownGenres
                    for tag in tags {
                        if knownGenres.contains(tag) {
                            return hasGenreTag.passing(rationale: "Matched \(tag)")
                        }
                    }
                    return hasGenreTag.failing()
                }
      
                // 3. No duplicate tags
                Evaluator { _, subject in
                    let uniqueCount = Set(subject.value.tags.map { $0.lowercased() }).count
                    if (subject.value.tags.count - uniqueCount) > 0 {
                        return noDuplicates.failing(rationale: "Found \(subject.value.tags.count - uniqueCount) duplicates")
                    }
                    return noDuplicates.passing()
                }
        
                // 4. Overall tag quality — groundedness, coverage, specificity
                ModelJudgeEvaluator(
                    judge: .default,
                    dimensions: [relevance, usefulness],
                    prompt: ModelJudgePrompt(
                        instructions: """
                            You are evaluating automatically generated tags for Shelf, a personal
                            book tracking app. Users write a short summary of their reading
                            experience, and the app generates tags to make their library browsable.
                            A good tag describes the book itself — its genre, themes, tone, or
                            setting. A bad tag picks up incidental details or the reader's personal
                            reactions that don't describe the book.
                            """,
                        evaluationTarget: { output in output.tags.joined(separator: ", ") },
                        reference: { input, _ in
                            ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                        }
                    )
                )
            }
      
            // MARK: - Analysis
            func aggregateMetrics(using aggregator: inout MetricsAggregator) {
                aggregator.group("Heuristics") { group in
                    group.computeMean(of: tagCount)
                    group.computeMean(of: hasGenreTag)
                    group.computeMean(of: noDuplicates)
                }
                aggregator.group("Quality") { group in
                    group.computeMean(of: relevance.metric)
                    group.computeMean(of: usefulness.metric)
                }
            }
        }
    • 4:05 - Refined Relevance & Usefulness score dimensions

      let relevance = ScoreDimension(
            "Relevance",
            description: """
                Whether each tag describes the book itself — its genre, themes,
                tone, or setting — rather than the reader's reactions, meta-
                commentary about the review, or facts about the author. A book
                can be "suspenseful" (a property of the text); a reader is
                "exhausted" (a reaction). Mis-labeling the genre is a serious failure.
                """,
            scale: .numeric([
                4: "Every tag describes the book itself",
                3: "Most tags describe the book, one picks up a reader reaction or minor detail",
                2: "Most tags are surface details or personal reactions, not book descriptors",
                1: "Tags don't meaningfully describe the book"
            ])
        )
      
        let usefulness = ScoreDimension(
            "Usefulness",
            description: """
                Whether tags work as library shelf labels — broad enough that
                several books could plausibly share the tag, specific enough to
                meaningfully narrow a search. Standard genre and theme tags work;
                made-up phrases, character names, hyper-specific descriptors, and
                overly generic words like "interesting" don't.
                """,
            scale: .numeric([
                4: "Every tag could group multiple books while still narrowing a search",
                3: "Most tags are at the right level, one is either too broad or too narrow",
                2: "Most tags are too broad to filter or too narrow to group",
                1: "Tags would not help with browsing"
            ])
        )
    • 11:56 - The alignment dataset, extracted to JSON

      // Model judge alignment dataset
        [
          {
            "input": "I have read this book more times than I can count…",
            "response": "[\"literary-fiction\", \"historical-fiction\", \"family-drama\", \"romantic-drama\", 
        \"character-driven\", \"emotional-intensity\", \"multigenerational-narrative\", \"penned-by-a-woman\"]"
          }
          // ... add your expert ratings to each entry
        ]
    • 12:31 - The judge alignment evaluation: dataset, subject, evaluator

      // Model judge alignment evaluation
        struct BookTagJudgmentCalibration: Evaluation {
      
            // MARK: Dataset — load the extracted summary/tag pairs
            static let samples: [ModelSample<BookTagJudgmentValue>] = {
                guard let url = Bundle(for: BundleToken.self).url(
                        forResource: "BookTaggingEvaluation-extracted", withExtension: "json"),
                      let data = try? Data(contentsOf: url) else { return [] }
                // Build ModelSample array (adding expert ratings)
                // ...
            }()
      
            var dataset: some Loader { ArrayLoader(samples: Self.samples) }
        
            // MARK: Capture Subject — tags are already generated, so just return them
            func subject(from sample: ModelSample<BookTagJudgmentValue>) async throws -> ModelSubject<BookTagJudgmentValue> {
                ModelSubject(value: sample.expected ?? BookTagJudgmentValue(
                    tags: [], expertRelevanceScore: 0, expertUsefulnessScore: 0))
            }
      
            // MARK: Evaluators — the same model judge as the book-tags evaluation
            var evaluators: Evaluators {
                ModelJudgeEvaluator(
                    judge: .default,
                    dimensions: [relevance, usefulness],
                    prompt: ModelJudgePrompt(
                        instructions: "You are evaluating automatically generated tags for Book Tracker…",
                        evaluationTarget: { output in output.tags.joined(separator: ", ") },
                        reference: { input, _ in
                            ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                        }
                    )
                )
            }
        }
    • 13:00 - Cohen's kappa aggregation

      func aggregateMetrics(using aggregator: inout MetricsAggregator) {
            let expertRelevance = Self.samples.map { Double($0.expected?.expertRelevanceScore ?? 0) }
            let expertUsefulness = Self.samples.map { Double($0.expected?.expertUsefulnessScore ?? 0) }
      
            aggregator.group("Relevance") { group in
                group.computeMean(of: relevance.metric)
                group.computeStandardDeviation(of: relevance.metric)
                group.custom(of: relevance.metric, label: "Relevance Alignment Score") { judge in
                    cohensKappa(ratings1: expertRelevance, ratings2: judge) ?? 0
                }
            }
            aggregator.group("Usefulness") { group in
                group.computeMean(of: usefulness.metric)
                group.computeStandardDeviation(of: usefulness.metric)
                group.custom(of: usefulness.metric, label: "Usefulness Alignment Score") { judge in
                    cohensKappa(ratings1: expertUsefulness, ratings2: judge) ?? 0
                }
            }
        }
    • 13:24 - The judge calibration test

      // Model judge alignment tests
        @Suite("Book Tag Judge Calibration")
        struct BookTagJudgmentCalibrationTests {
            static let evaluation = BookTagJudgmentCalibration()
      
            @Test("Judge Calibration", .evaluates(evaluation))
            func evaluateJudgeCalibration() async throws {
                let result = EvaluationContext.current.result
      
                let usefulnessMetric = BookTagJudgmentCalibrationTests.evaluation.usefulness.metric
                let relevanceMetric = BookTagJudgmentCalibrationTests.evaluation.relevance.metric
      
                #expect(result.aggregateValue(.custom(label: "Relevance: Judge vs Expert")) > 0.6)
                #expect(result.aggregateValue(.custom(label: "Usefulness: Judge vs Expert")) > 0.6)
            }
        }
    • 16:33 - The experimental judge prompt

      // Experimental evaluation
        struct BookTagJudgmentCalibrationExperimental: Evaluation {
            var evaluators: Evaluators {
                ModelJudgeEvaluator(
                    judge: .default,
                    dimensions: [relevance, usefulness],
                    prompt: ModelJudgePrompt(
                        instructions: """
                            You are an experienced reader and librarian evaluating tags
                            automatically generated for Book Tracker... Score the tag set on two
                            independent dimensions: Relevance and Usefulness.
      
                            ## What a good tag looks like
                            - Genre/form, theme/subject, tone/atmosphere, setting/era
      
                            ## Common failure modes
                            - Reader reactions, meta-commentary, author facts, genre contradictions
                            """,   // ← full prompt is ~40 lines; abbreviated here
                        evaluationTarget: { output in output.tags.joined(separator: ", ") },
                        reference: { input, _ in
                            ["Book Review": input.promptDescription,
                             "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                        }
                    )
                )
            }
        }
    • 20:12 - Few-shot worked examples in the judge prompt

      struct ExperimentalBookTagJudgmentCalibration: Evaluation {
            var evaluators: Evaluators {
                ModelJudgeEvaluator(
                    judge: SystemLanguageModel(),
                    dimensions: [relevance, usefulness],
                    prompt: ModelJudgePrompt(
                        instructions: """
                            You are calibrating with an expert librarian who scores
                            automatically generated tags for Book Tracker... Your goal is to
                            match how the librarian scores. Use the worked examples to calibrate.
      
                            ## Worked examples
                            ### Example A — clean fit (Pride and Prejudice)
                            Tags: romance, historical-fiction, love, redemption, passion
                            Librarian: Relevance 4, Usefulness 4
      
                            ### Example E — flat genre contradiction (Frankenstein)
                            Tags: horror, science-fiction, ... self-help, self-improvement
                            Librarian: Relevance 2, Usefulness 3
                            ... (6 examples A–F; keep the set small to avoid overfitting)
                            """,   // ← full prompt is ~60 lines; abbreviated here
                        evaluationTarget: { output in output.tags.joined(separator: ", ") },
                        reference: { input, _ in
                            ["Book Review": input.promptDescription,
                             "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                        }
                    )
                )
            }
        }
      
        9. The BookLookupTool — slides 166–167
    • 22:03 - The BookLookupTool

      // Book Information Lookup Tool
        struct BookLookupTool: Tool {
            let name = "lookupBook"
            let description = "Looks up the title and author of a book given distinguishing details — such as character names, 
        settings, quoted lines, or notable plot points — extracted from a reader's review."
      
            @Generable
            struct Arguments {
                @Guide(description: "Distinguishing details from the review that identify the book, such as character names, 
        settings, quoted lines, or notable plot points.")
                var details: String
            }
        
            @Generable
            struct Output {
                @Guide(description: "The title of the identified book, or an empty string if no match was found.")
                var title: String
      
                @Guide(description: "The author of the identified book, or an empty string if no match was found.")
                var author: String
            }
        
            func call(arguments: Arguments) async throws -> Output {
                let needles = arguments.details
                    .lowercased()
                    .split(whereSeparator: { !$0.isLetter && !$0.isNumber })
                    .map(String.init)
                    .filter { $0.count >= 4 }
      
                let best = Book.sampleBooks
                    .map { book -> (book: Book, score: Int) in
                        let review = book.review.lowercased()
                        let score = needles.reduce(0) { partial, needle in
                            partial + (review.contains(needle) ? 1 : 0)
                        }
                        return (book, score)
                    }
                    .max(by: { $0.score < $1.score })
      
                guard let match = best, match.score > 0 else {
                    return Output(title: "", author: "")
                }
                return Output(title: match.book.title, author: match.book.author)
            }
        }
    • 22:36 - BookTaggingService with a tools parameter

      // Book Tagging Service
        struct BookTaggingService {
            static func generateTags(for review: String, tools: [any Tool] = []) async throws -> BookTags {
                let prompt = tagsPrompt(review: review)
                let session = LanguageModelSession(
                    model: SystemLanguageModel(guardrails: .permissiveContentTransformations),
                    tools: tools,
                    instructions: instructions
                )
                let response = try await session.respond(to: prompt, generating: BookTags.self)
                return response.content
            }
        }
    • 22:57 - Evaluation with the lookup tool

      // Evaluation of tags with tool
        struct BookTaggingWithLookupEvaluation: Evaluation {
            func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
                let result = try await BookTaggingService.generateTags(
                    for: sample.promptDescription,
                    tools: [BookLookupTool()]
                )
                return ModelSubject(value: result)
            }
            // ... same dataset, evaluators, and aggregation as BookTaggingEvaluation
        }
    • 23:09 - Compare with/without the tool in one suite

      @Suite("Book Tag Evaluations")
        struct BookTagEvaluationTests {
            static let evaluation = BookTaggingEvaluation()
            static let lookupEvaluation = BookTaggingWithLookupEvaluation()
      
            @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
            func evaluateBookTagging() async throws {
                let result = EvaluationContext.current.result
                let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
                let dupeMetric = BookTagEvaluationTests.evaluation.noDuplicates
                #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
                #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
            }
      
            @Test("Book Tag Evaluations (with BookLookupTool)", .evaluates(lookupEvaluation, info: lookupEvaluationInfo))
            func evaluateBookTaggingWithLookup() async throws {
                let result = EvaluationContext.current.result
                let rangeMetric = BookTagEvaluationTests.lookupEvaluation.tagCount
                let dupeMetric = BookTagEvaluationTests.lookupEvaluation.noDuplicates
                #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
                #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
            }
        }
    • 0:00 - Introduction
    • Hill-climbing — iteratively improving an intelligence feature using evaluation scores as a guide (develop, run, analyze) — framed around bringing scientific thinking to that loop. Assumes you've already built an evaluation pipeline (see "Meet the Evaluations framework").

    • 2:42 - BookTracker's tagging problem
    • Revisits BookTracker, whose tag generator produces tags that miss key themes or reflect the reader's feelings rather than the book. The existing evaluation judges tag quality via score dimensions (Relevance, Usefulness) and a ModelJudgeEvaluator.

    • 5:27 - Analyzing the evaluation results
    • Adds two reviews to the dataset, runs the evaluation (Swift Testing #expect), and uses the Xcode evaluation report and assistant editor to compare generated tags against expected ones, revealing the human and model judge disagree on usefulness.

    • 8:26 - Drift between judge and human
    • That disagreement is drift, the divergence between a model judge's ratings and an expert's. As the dataset grows, drift widens, making it hard to trust the evaluation, so the judge must be aligned to expert opinion.

    • 9:37 - Measuring drift with Cohen's kappa
    • Accuracy alone misleads on unevenly-distributed scores (a high-scoring judge looks aligned by luck). Cohen's kappa coefficient measures true alignment by subtracting the chance of random agreement from accuracy and normalizing, a robust drift metric.

    • 12:26 - Building a judge alignment evaluation
    • Builds an evaluation comparing the presenter's ratings to the judge's over a shared dataset: extract summary/tag pairs from the prior run's attachment, add human ratings, reuse the same ModelJudgeEvaluator as subject, and aggregate Cohen's kappa (plus mean and standard deviation), targeting an alignment of 0.6.

    • 15:16 - Analyzing alignment failures
    • The alignment test fails. Drilling into the report (for example Frankenstein, The Ramakien) shows the judge rating overly-specific or off-theme tags too highly, the judge's prompt lacks the context to tell a good tag from a bad one.

    • 17:16 - Comparative evaluation: control vs experimental
    • Xcode 27 can compare two evaluations like a controlled experiment: a baseline (control) prompt versus an experimental prompt that adds app context plus examples of good and bad tags. Running both shows relevance improved while usefulness dropped, a tradeoff to weigh.

    • 19:12 - Refining the scoring dimensions
    • Keeping the prompt change, the side-by-side comparison view reveals the judge grading usefulness too harshly. Applying the new prompt to the baseline to isolate one variable, the ScoreDimension descriptions are sharpened (emphasizing genre tags; being critical of overly-specific ones), improving both scores.

    • 21:23 - Adding few-shot examples to the judge
    • Still short of the goal, the judge prompt is grounded with the feature's purpose and a few worked examples of how the presenter rates, deliberately few to avoid overfitting the alignment score. Scores finally exceed expectations, so the judge is trusted and the loop exits.

    • 23:38 - Going beyond prompts: adding a tool
    • Hill-climbing isn't only prompts: to give the on-device tag model more context, a BookLookupTool supplies the title and author. BookTaggingService gains a tools parameter (defaulting empty), and a second evaluation compares the feature with versus without the tool, the tool version scores better, though the small 13-sample dataset and unobserved tool calls point to "Create robust evaluations for agentic apps."

    • 27:17 - Next steps
    • Think like a scientist (one change at a time), invest the time (failed experiments still inform), be creative (instructions, tools, models, datasets, aggregations, and evaluators are all fair game), and watch for drift. Download the Book Tracker sample and review the documentation.

Developer Footer

  • 비디오
  • WWDC26
  • Evaluations를 사용한 점진적 개선 방식으로 프롬프트 향상하기
  • 메뉴 열기 메뉴 닫기
    • iOS
    • iPadOS
    • macOS
    • tvOS
    • visionOS
    • watchOS
    메뉴 열기 메뉴 닫기
    • Swift
    • SwiftUI
    • Swift Playground
    • TestFlight
    • Xcode
    • Xcode Cloud
    • SF Symbols
    메뉴 열기 메뉴 닫기
    • 손쉬운 사용
    • 액세서리
    • Apple Intelligence
    • 앱 확장 프로그램
    • App Store
    • 오디오 및 비디오(영문)
    • 증강 현실
    • 디자인
    • 배포
    • 교육
    • 서체(영문)
    • 게임
    • 건강 및 피트니스
    • 앱 내 구입
    • 현지화
    • 지도 및 위치
    • 머신 러닝 및 AI
    • 오픈 소스(영문)
    • 보안
    • Safari 및 웹(영문)
    메뉴 열기 메뉴 닫기
    • 문서(영문)
    • 튜토리얼
    • 다운로드
    • 포럼(영문)
    • 비디오
    메뉴 열기 메뉴 닫기
    • 지원 문서
    • 문의하기
    • 버그 보고
    • 시스템 상태(영문)
    메뉴 열기 메뉴 닫기
    • Apple Developer
    • App Store Connect
    • 인증서, 식별자 및 프로파일(영문)
    • 피드백 지원
    메뉴 열기 메뉴 닫기
    • Apple Developer Program
    • Apple Developer Enterprise Program
    • App Store Small Business Program
    • MFi Program(영문)
    • Mini Apps Partner Program
    • News Partner Program(영문)
    • Video Partner Program(영문)
    • Security Bounty Program(영문)
    • Security Research Device Program(영문)
    메뉴 열기 메뉴 닫기
    • Apple과의 만남
    • Apple Developer Center
    • App Store 어워드(영문)
    • Apple 디자인 어워드
    • Apple Developer Academy(영문)
    • WWDC
    최신 뉴스 읽기.
    Apple Developer 앱 받기.
    Copyright © 2026 Apple Inc. 모든 권리 보유.
    약관 개인정보 처리방침 계약 및 지침