Systematic testing and measurement for AI applications. Understand what evals are, why they matter, and how AI Stack helps you ship with confidence.

AI Evals Overview

Your AI is only as good as your evals. Without systematic evaluation, you're shipping on vibes — and vibes don't scale.

What are AI Evals?

AI evaluations (evals) are systematic tests that measure the quality, accuracy, and reliability of AI outputs. Unlike traditional unit tests with binary pass/fail outcomes, evals deal with the inherent nondeterminism of language models.

Every time you call an LLM, you might get a different response. There's rarely a single "right" answer. And worst of all, quality can degrade silently — a prompt change that improves one case might break ten others.

Evals give you a structured way to measure what "good" looks like, track it over time, and catch regressions before your users do.

The Three Pillars of an Eval

Every evaluation consists of three components:

1. Data

Your test cases — a set of inputs paired with expected outputs (or reference answers). These can come from:

Hand-curated examples for critical scenarios
Production logs sampled from real usage
Synthetic data generated for edge cases

2. Task

The AI function you're evaluating. This is whatever you'd normally call in production — your prompt + model combination, a RAG pipeline, an agent workflow, etc.

3. Scores

Functions that measure the quality of each output. Scores can be:

Code-based: Deterministic checks like exact match, string containment, regex patterns, or JSON schema validation
LLM-as-Judge: Use another model to assess quality on dimensions like helpfulness, accuracy, or tone
Custom rubrics: Domain-specific scoring tailored to your use case

Run in CI/CD pipelines
Compare model versions
Test prompt changes
Validate with labeled data

Online Evals

Run in production against live traffic:

Monitor real-world quality
Catch distribution shifts
Score a sample of production responses
Feed results back into your offline datasets

The best teams use both. Offline evals gate deployments; online evals monitor production.

Next Steps

Quickstart — Get your first eval running in 5 minutes
Running Evaluations — CLI usage, configuration, and the eval UI
Scorers — Code-based, LLM-as-Judge, and custom scoring
Datasets — Building and managing test data
CI/CD Integration — Automate evals in your deployment pipeline

AI Evals Overview

AI Evals Overview

What are AI Evals?

The Three Pillars of an Eval

1. Data

2. Task

3. Scores

Why Evals Matter

Vibes don't scale

Silent regressions are real

Model migrations need proof

Production drift happens

You need a baseline

Offline vs. Online Evaluation

Offline Evals

Online Evals

Next Steps

On this page