Title
Source: IMG_7451
Evals for taste: Hill-climbing a slide-generation agent
The title slide shows the session running from 13:00 to 13:45 and lists Koki Yoshida as presenter.
What Are Evals?
Source: IMG_7452
Systematic tests
Evals measure how well an AI system performs on a specific domain or use case.
Tasks + grading logic
They are made up of tasks that define scenarios and encode expectations through grading logic.
Confidence bridge
They bridge the gap between "it seems to work" and "we know it works", so a team can ship confidently.
Some Famous Evals
Source: IMG_7453
The slide groups public benchmarks into three broad families.
| Category | Examples shown |
|---|---|
| Agentic coding | SWE-bench, Terminal-bench |
| Tool use and agents | tauBench, MCP Atlas, OSWorld, BrowseComp |
| Reasoning and knowledge | GPQA Diamond, MMMLU, MMMU, ARC-AGI-2 |
The benchmark table compares models across agentic coding, terminal coding, multidisciplinary reasoning, agentic search, scaled tool use, computer use, financial analysis, cybersecurity vulnerability reproduction, graduate-level reasoning, visual reasoning, and multilingual Q&A.
Why Are Evals Important?
Sources: IMG_7454, IMG_7455
Without evals
Teams are flying blind and get stuck in reactive loops:
- Catching issues only in production.
- Fixing one failure creates others.
- Cannot distinguish genuine feedback from noise.
- No way to verify improvements or regressions except guess and check.
With evals
Teams can streamline AI system development:
- Forcing clarity: what does success look like?
- Iterating on optimal agent configs.
- Adopting new models fast, gaining insights into performance, latency, cost, error rates, etc.
- Making problems visible before launch, upholding a consistent quality bar.
Evals In The Prompt Engineering Lifecycle
Sources: IMG_7456, IMG_7457
- Develop eval test cases, also called tasks.
- Write a preliminary prompt or agent config.
- Run the prompt or agent against tasks.
- Refine the prompt or agent config.
- Ship the polished prompt or agent config.
Evals sit in the iteration loop between running the agent against tasks and refining the prompt or agent config.
Agent config means architecture, system prompt, tool design, context engineering techniques, and related choices.
Graders: Code-Based
Source: IMG_7458
- String match, regex, fuzzy.
- Unit tests: fail-to-pass, pass-to-pass.
- Static analysis: lint, type.
- Final state and tool call checks.
Strength
Fast, cheap, deterministic.
Weakness
Brittle, lacking in nuance.
Graders: Code, Model, Human
Source: IMG_7459
| Grader type | Methods | Strength | Weakness |
|---|---|---|---|
| Code-based graders | String match, regex, fuzzy; unit tests; static analysis; final state and tool call checks. | Fast, cheap, deterministic. | Brittle, lacking in nuance. |
| Model-based graders | Rubric-based scoring, pairwise comparison, multi-judge consensus. | Flexible, scalable, nuanced. | Non-deterministic, costs money, requires calibration. |
| Human graders | SME review, crowdsourced judgment, spot-check sampling, A/B testing. | Flexible, high quality, nuanced. | Slow and expensive. |