-
エージェント型アプリに対する信頼性の高い評価プロセスの確立
Evaluationsフレームワークの高度な機能を活用して、アプリに対する信頼性の高い評価プロセスを確立する方法を学びましょう。ツール呼び出しや動的条件によるフローの評価方法と、自身のユースケースにおいてどのような動作が正しいのかを定義する方法を紹介します。信頼性の高い結果を得る上で役立つ、合成データの生成、判定の効果的な使用、データセットの検証の方法も説明します。
関連する章
- 0:00 - Introduction
- 2:21 - The dataset problem in BookTracker
- 3:46 - Generating synthetic data with makeSamples
- 6:27 - Customizing generation with SampleGenerator
- 8:38 - Sampling strategies
- 10:11 - Validating synthetic samples
- 13:04 - Comparing evaluation results
- 15:09 - Tool calling and tool evaluations
- 18:54 - Trajectory expectations
- 21:26 - Building a tool call evaluation
- 22:02 - Synthetic data for tool evaluations
- 23:49 - Next steps
リソース
-
このビデオを検索
-
-
5:16 - Generate synthetic data with makeSamples
// Synthetic data let prompt = Prompt(""" Generate diverse range of book reviews and corresponding tags. Cover a wide range of genres, time periods, cultures, and reader personas. Do not repeat books already in the dataset. """) let dataset = Book.sampleBooks.map { book in ModelSample(prompt: book.review, expected: BookTags(tags: book.tags)) } let targetCount = 100 var expandedDataset = dataset for try await sample in dataset.makeSamples(prompt, targetCount: targetCount) { expandedDataset.append(sample) print("Generated \(expandedDataset.count) samples so far.") } 2. Configure a custom SampleGenerator — slides 30–43 // Define your own configuration let generator = SampleGenerator<ModelSample<BookTags>>( prompt, samples: dataset, targetCount: targetCount, sessionProvider: { LanguageModelSession( model: PrivateCloudComputeLanguageModel(), instructions: """ You are a synthetic data generator for a book-tracking app's evaluation suite. Your job is to produce realistic, diverse book entries that will stress-test a tagging system. Rules: - Review must be at least 100 characters long. - Review should cover a mix of genre, mood/tone, and themes. - Reviews should vary in length. - Create between 3 and 8 tags. - Tags must be lowercase. """ ) } ) -
5:53 - Configure a custom SampleGenerator
// Define your own configuration let generator = SampleGenerator<ModelSample<BookTags>>( prompt, samples: dataset, targetCount: targetCount, sessionProvider: { LanguageModelSession( model: PrivateCloudComputeLanguageModel(), instructions: """ You are a synthetic data generator for a book-tracking app's evaluation suite. Your job is to produce realistic, diverse book entries that will stress-test a tagging system. Rules: - Review must be at least 100 characters long. - Review should cover a mix of genre, mood/tone, and themes. - Reviews should vary in length. - Create between 3 and 8 tags. - Tags must be lowercase. """ ) } ) -
10:37 - Validate generated samples
// Define validation metrics validator: { sample in guard let book = sample.expected else { return false } // Review must be at least 100 characters guard sample.promptDescription.count >= 100 else { return false } // Must have between 3 and 8 tags guard (3...8).contains(book.tags.count) else { return false } // All tags must be lowercase guard book.tags.allSatisfy({ $0 == $0.lowercased() }) else { return false } return true } -
10:58 - Access valid and invalid results
// Accessing results for try await sample in generator.run() { // During iteration expandedDataset.append(sample) } // After iteration let allSamples = await generator.samples let invalidSamples = await generator.invalidSamples print("Generated \(allSamples.count) new samples. Total: \(expandedDataset.count)") -
15:30 - Define a tool's Generable argument
@Generable struct SearchBooksArguments { @Guide(description: "A freeform search term to match against titles, reviews, or tags") var query: String? @Guide(description: "Filter results to books with this specific tag") var tag: String? @Guide(description: "Filter results by mood") var mood: String? @Guide(description: "Filter results by genre") var genre: String? @Guide(description: "Maximum number of results to return. Defaults to 5.") var limit: Int? } -
16:37 - A basic trajectory expectation
// "Find books tagged gothic" TrajectoryExpectation( unordered: [ ToolExpectation( "searchBooks", arguments: [ .exact(argumentName: "tag", value: .string("gothic")) ] ) ] ) -
17:07 - Match arguments by intent (naturalLanguage)
// "Find something cheerful" TrajectoryExpectation( "searchBooks", arguments: [ .naturalLanguage( argumentName: "mood", criteria: "Should relate to uplifting, hopeful, or positive feelings" ) ] ) Other matchers available: .contains, .oneOf, .pattern, .range, and more. -
17:34 - Expect tool calls in order
// "Find gothic books and show details on the first" TrajectoryExpectation( ordered: [ ToolExpectation( "searchBooks", arguments: [ .exact(argumentName: "tag", value: .string("gothic")) ] ), ToolExpectation( "getBookDetails", arguments: [ .keyOnly(argumentName: "bookId") ] ) ] ) -
17:55 - Disallow specific tool calls
// "Show only sci-fi books. Don't look for similar ones." TrajectoryExpectation( unordered: [ ToolExpectation( "searchBooks", arguments: [ .naturalLanguage( argumentName: "genre", criteria: "Should refer to science fiction") ] ) ], disallowed: [ ToolExpectation("findSimilarBooks") ] ) -
18:14 - Build a tool call evaluation
// Tool call evaluations let samples = SampleArrayLoader(samples: [ ModelSample( prompt: "Find all the books tagged with 'gothic'.", instructions: "Help the user explore their book collection.", expectations: TrajectoryExpectation( ) ) ]) struct BookLibraryToolCallEval: Evaluation { var dataset = samples let pass = Metric("All Passed") let percent = Metric("Percentage Passed") var evaluators: Evaluators { ToolCallEvaluator(allPass: pass, percentagePass: percent) } } -
19:20 - Synthesize tool-evaluation samples
// Tool call evaluations let prompt = Prompt(""" Generate diverse user queries for a personal book library assistant. Each sample needs a prompt (what the user says), and a trajectory expectation describing which tools should be called and in what order. """) let instructions = """ AVAILABLE TOOLS: - searchBooks(query?, tag?, mood?, genre?, limit?): search the library - getBookDetails(bookId): full details for one book - findSimilarBooks(bookId, maxResults?): find books sharing tags ORDER REQUIREMENTS: - searchBooks must comes before getBookDetails or findSimilarBooks - Use TrajectoryExpectation(ordered:) when sequence matters, else (unordered:) USE THESE ARGUMENT MATCHERS: - .exact for precise values, .naturalLanguage for fuzzy matching - .keyOnly when any value is acceptable, .range for numeric constraints - .contains/.hasPrefix/.hasSuffix for partial string matching """ -
19:51 - Validate tool-evaluation samples
// Tool call evaluations validator: { sample in // Must have expectations defined guard sample.output.expectations != nil else { return false } let expectations = sample.output.expectations! // Must reference at least one tool let totalExpectations = expectations.ordered.count + expectations.unordered.count guard totalExpectations > 0 else { return false } // All tool names must be from the valid set let validTools: Set<String> = ["searchBooks", "getBookDetails", "findSimilarBooks"] let allExpectations = expectations.ordered + expectations.unordered + expectations.disallowed for expectation in allExpectations { guard validTools.contains(expectation.name) else { return false } } return true } ---
-
-
- 0:00 - Introduction
Ada Wong and Kyle Murray introduce advanced features of the Evaluations framework (new in Xcode 27). Outlines the agenda: growing your dataset with synthetic data, then building robust evaluations for agentic, tool-calling workflows, focused on the develop-and-evaluate step of hill-climbing.
- 2:21 - The dataset problem in BookTracker
The BookTracker app auto-tags books from reviews, but its 13 hand-written sampleBooks give only a narrow view. Real-world reviews span countless books, genres, lengths, and styles, too much variety to capture by hand.
- 3:46 - Generating synthetic data with makeSamples
The makeSamples API takes a prompt, a dataset (ModelSample with review to tags), and a target count (the full resulting size, including your seeds). It returns an async stream of new samples; coverage of real usage matters more than raw quantity.
- 6:27 - Customizing generation with SampleGenerator
For more control, SampleGenerator exposes a sessionProvider closure to pick the model (such as Private Cloud Compute) and instructions. The session is reused across batches but can exhaust its context window mid-run, so make instructions self-contained since the provider may be called again.
- 8:38 - Sampling strategies
The samplingStrategy controls which seed samples are shown to the model as in-context examples: random (a varied subset, the default) or slidingWindow (sequential, for datasets with meaningful order).
- 10:11 - Validating synthetic samples
A validator closure accepts or rejects each generated sample in isolation against systematic rules: review length at least 100 characters, 3 to 8 tags, lowercase tags. Valid samples collect in samples, rejects in invalidSamples, both updated in real time.
- 13:04 - Comparing evaluation results
Using the Xcode 27 Evaluations Report, compare the 13-sample run against the 100-sample run. The quality scores drop, the feature only looked good on the small dataset, and a drop can signal issues in the prompt, the feature, the evaluation, or the dataset.
- 15:09 - Tool calling and tool evaluations
Tool evaluations: features often take multiple behind-the-scenes tool calls, and a plausible answer can come from the wrong path. Tool evaluations verify the how: correct tools, correct arguments, correct order, no surprises, illustrated with searchBooks, getBookDetails, and findSimilarBooks.
- 18:54 - Trajectory expectations
A TrajectoryExpectation checks the kind and order of tool calls in a session transcript. Refine with argument matchers (exact, naturalLanguage, contains, oneOf, pattern, range), plus ordered expectations and a disallowed set for tools that must not be called.
- 21:26 - Building a tool call evaluation
Bring the trajectory expectations together: a dataset of samples (each a prompt plus expectation) scored by ToolCallEvaluator, which combines a LanguageModelSession with the tools, captures the structured transcript, and reports alongside your other results in Xcode.
- 22:02 - Synthetic data for tool evaluations
Because ModelSample and TrajectoryExpectation are Generable, you can synthesize tool-evaluation samples too, describing the available tools, order expectations, and context in the prompt, then validating that each sample has an expectation, at least one tool, and only real tools.
- 23:49 - Next steps
Run BookTaggingEvaluation (what the model produces) and tool evaluations (how it gets there) in one suite for end-to-end confidence. Next steps: create your own synthetic data, evaluate your app's custom tools, and explore the sample app and documentation.