Evals For Taste - V3 Reconstruction

This page reconstructs the readable workshop material from IMG_7451 through IMG_7459. The source images are appended beneath each section and can be clicked to expand.

Title

Source: IMG_7451

Evals for taste: Hill-climbing a slide-generation agent

The title slide shows the session running from 13:00 to 13:45 and lists Koki Yoshida as presenter.

What Are Evals?

Source: IMG_7452

Systematic tests

Evals measure how well an AI system performs on a specific domain or use case.

Tasks + grading logic

They are made up of tasks that define scenarios and encode expectations through grading logic.

Confidence bridge

They bridge the gap between "it seems to work" and "we know it works", so a team can ship confidently.

Some Famous Evals

Source: IMG_7453

The slide groups public benchmarks into three broad families.

Category	Examples shown
Agentic coding	SWE-bench, Terminal-bench
Tool use and agents	tauBench, MCP Atlas, OSWorld, BrowseComp
Reasoning and knowledge	GPQA Diamond, MMMLU, MMMU, ARC-AGI-2

The benchmark table compares models across agentic coding, terminal coding, multidisciplinary reasoning, agentic search, scaled tool use, computer use, financial analysis, cybersecurity vulnerability reproduction, graduate-level reasoning, visual reasoning, and multilingual Q&A.

Why Are Evals Important?

Sources: IMG_7454, IMG_7455

Without evals

Teams are flying blind and get stuck in reactive loops:

Catching issues only in production.
Fixing one failure creates others.
Cannot distinguish genuine feedback from noise.
No way to verify improvements or regressions except guess and check.

With evals

Teams can streamline AI system development:

Forcing clarity: what does success look like?
Iterating on optimal agent configs.
Adopting new models fast, gaining insights into performance, latency, cost, error rates, etc.
Making problems visible before launch, upholding a consistent quality bar.

Evals In The Prompt Engineering Lifecycle

Sources: IMG_7456, IMG_7457

Develop eval test cases, also called tasks.
Write a preliminary prompt or agent config.
Run the prompt or agent against tasks.
Refine the prompt or agent config.
Ship the polished prompt or agent config.

Evals sit in the iteration loop between running the agent against tasks and refining the prompt or agent config.

Agent config means architecture, system prompt, tool design, context engineering techniques, and related choices.

Graders: Code-Based

Source: IMG_7458

String match, regex, fuzzy.
Unit tests: fail-to-pass, pass-to-pass.
Static analysis: lint, type.
Final state and tool call checks.

Strength

Fast, cheap, deterministic.

Weakness

Brittle, lacking in nuance.

Graders: Code, Model, Human

Source: IMG_7459

Grader type	Methods	Strength	Weakness
Code-based graders	String match, regex, fuzzy; unit tests; static analysis; final state and tool call checks.	Fast, cheap, deterministic.	Brittle, lacking in nuance.
Model-based graders	Rubric-based scoring, pairwise comparison, multi-judge consensus.	Flexible, scalable, nuanced.	Non-deterministic, costs money, requires calibration.
Human graders	SME review, crowdsourced judgment, spot-check sampling, A/B testing.	Flexible, high quality, nuanced.	Slow and expensive.