AI Stack Docs
AI Evals

AI Evals Overview

Systematic testing and measurement for AI applications. Understand what evals are, why they matter, and how AI Stack helps you ship with confidence.

AI Evals Overview

Your AI is only as good as your evals. Without systematic evaluation, you're shipping on vibes — and vibes don't scale.

What are AI Evals?

AI evaluations (evals) are systematic tests that measure the quality, accuracy, and reliability of AI outputs. Unlike traditional unit tests with binary pass/fail outcomes, evals deal with the inherent nondeterminism of language models.

Every time you call an LLM, you might get a different response. There's rarely a single "right" answer. And worst of all, quality can degrade silently — a prompt change that improves one case might break ten others.

Evals give you a structured way to measure what "good" looks like, track it over time, and catch regressions before your users do.

The Three Pillars of an Eval

Every evaluation consists of three components:

1. Data

Your test cases — a set of inputs paired with expected outputs (or reference answers). These can come from:

  • Hand-curated examples for critical scenarios
  • Production logs sampled from real usage
  • Synthetic data generated for edge cases

2. Task

The AI function you're evaluating. This is whatever you'd normally call in production — your prompt + model combination, a RAG pipeline, an agent workflow, etc.

3. Scores

Functions that measure the quality of each output. Scores can be:

  • Code-based: Deterministic checks like exact match, string containment, regex patterns, or JSON schema validation
  • LLM-as-Judge: Use another model to assess quality on dimensions like helpfulness, accuracy, or tone
  • Custom rubrics: Domain-specific scoring tailored to your use case

Why Evals Matter

Vibes don't scale

When you have 3 prompts and test them by hand, intuition works. When you have 30 prompts serving 10,000 users, you need data.

Silent regressions are real

Model updates, prompt tweaks, and context changes can degrade quality without any error being thrown. Evals catch these.

Model migrations need proof

Switching from GPT-4 to Claude or to a fine-tuned model? Without evals, you're guessing whether the new model is actually better.

Production drift happens

The inputs your AI sees in production evolve over time. Evals built from production data help you keep up.

You need a baseline

Before you can improve, you need to know where you stand. Evals establish a measurable baseline for iteration.

Offline vs. Online Evaluation

Offline Evals

Run before deployment against a fixed dataset. Think of these as your test suite:

  • Run in CI/CD pipelines
  • Compare model versions
  • Test prompt changes
  • Validate with labeled data

Online Evals

Run in production against live traffic:

  • Monitor real-world quality
  • Catch distribution shifts
  • Score a sample of production responses
  • Feed results back into your offline datasets

The best teams use both. Offline evals gate deployments; online evals monitor production.

Next Steps