AI Evals

Your AI is only as good
as your evals

AI systems are unpredictable. The same prompt can produce different outputs every time. Evals are how you measure, test, and systematically improve quality — so you ship with confidence instead of hope.

01 — The problem

AI doesn't work like regular software

Traditional code is deterministic — the same input always gives the same output. AI isn't. This changes everything about how you test and maintain quality.

Input"Summarize this article"
AI
Concise, accurate summary
Verbose but mostly correct
Hallucinated facts included

Nondeterministic

Same input, different outputs. Run a prompt twice and you might get two different answers. Unlike f(x) = y, AI output varies every time.

No single right answer

"Good enough" is subjective. A summary can be accurate but too long, or concise but missing key points. Without measurement, quality is a guess.

Changes break things silently

A prompt tweak that improves one use case can degrade another. Without evals, you won't know until users complain.

02 — Anatomy of an eval

Three building blocks. Every time.

No matter the framework, every evaluation follows the same structure. Understand these three parts and you understand evals.

{ }

Data

A dataset of test cases with inputs and expected outputs. Built from production logs, user feedback, or manual curation.

Task

The AI function you're evaluating. A prompt, a multi-step agent, a retrieval pipeline — any logic that produces output.

Scores

Functions that measure quality by comparing outputs to expected results. Code-based, LLM-judged, or a combination of both.

03 — Why it matters

Without evals, you're guessing

These are the problems every team building with AI hits eventually. Evals are the systematic answer.

01

"Vibes" don't scale

Manual spot-checking works for 10 outputs. Not 10,000. You need automated scoring that runs on every change, every deploy, every day.

02

Silent regressions

A prompt change that improves refund queries breaks billing queries. Aggregate metrics look fine. Individual categories suffer. You won't know until users tell you.

03

Model migrations are blind

Switching from GPT-4o to Claude or Gemini without evals is a coin flip. Same prompt, different model, wildly different results. Evals give you a comparison baseline.

04

Production drift

User inputs evolve. New edge cases appear. Quality degrades over time if you're not continuously measuring. By the time you notice, damage is done.

05

No baseline, no progress

Without evals, you can't prove you're improving. Or that you haven't gotten worse. Stakeholders ask "is the AI better now?" and you shrug.

04 — Two modes of evaluation

Test before deploy. Monitor after.

Effective evaluation happens at two stages. Offline to validate before shipping, and online to catch what you missed in production.

Offline

Pre-deployment evaluation

Run against known datasets before code reaches production. Results are reproducible and comparable over time.

  • Controlled inputs with expected outputs
  • Code-based or LLM-as-a-judge scoring
  • Run in CI/CD on every pull request
Best forPrompt iteration, model comparison, regression gates
+
Online

Production monitoring

Score live traffic automatically as traces are logged. Asynchronous — zero impact on latency.

  • No ground truth — LLM-as-a-judge scoring
  • Configurable sampling rates
  • Catches drift and edge cases in real time
Best forContinuous monitoring, surfacing new failure patterns
1Prototype
2Experiment
3CI / CD
4Production
5Feedback
05 — Scoring approaches

Three ways to measure quality

Different outputs need different scoring strategies. Choose based on how structured your expected output is.

</>

Code-based Scorers

Exact match, regex, JSON schema validation, string containment. Fast, cheap, and completely deterministic.

output === expected ? 1.0 : 0.0
Best for structured outputs — classifications, extractions, yes/no answers.

LLM-as-a-Judge

Use a frontier model to grade outputs against custom criteria. Handles nuance, context, and subjective quality that code can't capture.

score: 0.87 · "Accurate but too verbose"
Best for open-ended generation — summaries, conversations, creative writing.

Custom Rubrics

Domain-specific evaluation criteria tailored to your use case. Combine code checks with LLM scoring for maximum coverage.

medical_accuracy: 0.95 · compliance: 1.0
Best for regulated domains — healthcare, legal, finance, brand compliance.
06 — The solution

Evals, built into the stack

AI Stack Evals isn't a standalone tool you bolt on. It's integrated into every layer — Gateway, Memory, Playground — so evaluation happens everywhere, automatically.

Coming Soon
01

LLM-as-Judge

Built-in frontier model scoring with structured rubrics. Define criteria in plain language. No separate eval framework needed — scoring runs inside the platform.

02

Regression Testing

Automated test suites that run on every prompt or model change. Compare against baselines. Catch regressions before they reach production — not after.

03

Custom Rubrics

Visual builder for domain-specific evaluation criteria. Define scoring dimensions, weight them, and run them across your entire dataset. Code or no-code.

04

Continuous Monitoring

Score production traces 24/7 with configurable sampling. Not just in CI — in production, on real traffic, catching drift the moment it starts.

AI Stack Evals works with Gateway, Memory, and Playground — one platform, not five tools stitched together.

Start evaluating
with confidence

We're onboarding early teams now. Drop your email to get early access — no credit card, no commitment.

Early access for the first 500 developers