AI Evals
Quickstart
Get your first AI eval running in under 5 minutes with AI Stack.
Quickstart
Get your first eval running in under 5 minutes.
Prerequisites
- Node.js 18+ installed
- An AI Stack account (sign up)
- An API key for at least one LLM provider (OpenAI, Anthropic, etc.)
1. Install the SDK
npm install @aistack/evalsOr with other package managers:
# yarn
yarn add @aistack/evals
# pnpm
pnpm add @aistack/evals
# bun
bun add @aistack/evals2. Set Your API Key
export AISTACK_API_KEY="your-api-key-here"You can get your API key from the AI Stack dashboard.
3. Write Your First Eval
Create a file called my-first-eval.ts:
import { Eval, currentExperiment } from "@aistack/evals";
Eval("Summarization Quality", {
data: () => [
{
input: "Explain quantum computing in one sentence.",
expected: "A clear, accurate one-sentence explanation of quantum computing.",
},
{
input: "What is the capital of France?",
expected: "Paris",
},
{
input: "Summarize the benefits of exercise in 2 sentences.",
expected: "A concise, accurate summary of exercise benefits.",
},
],
task: async (input) => {
// Replace with your actual AI call
const response = await callYourModel(input);
return response;
},
scores: {
relevance: (output, expected) => {
// Simple check: does the output address the input?
return output.length > 10 ? 1 : 0;
},
length: (output) => {
// Penalize very short or very long responses
const words = output.split(" ").length;
if (words < 5) return 0;
if (words > 100) return 0.5;
return 1;
},
},
});4. Run the Eval
npx aistack evals run my-first-eval.tsYou'll see output like:
AI Stack Evals v0.1.0
Running: Summarization Quality
✓ Case 1/3 — relevance: 1.0, length: 1.0
✓ Case 2/3 — relevance: 1.0, length: 1.0
✓ Case 3/3 — relevance: 1.0, length: 0.8
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Results: 3/3 passed
Average scores: relevance=1.00, length=0.93
Experiment URL: https://aistack.run/evals/exp_abc1235. View Results in the Dashboard
Click the experiment URL to see detailed results in the AI Stack dashboard, including:
- Score distributions per metric
- Individual case results with inputs, outputs, and scores
- Comparison with previous runs
- Trends over time
What's Next?
- Running Evaluations — Learn about CLI flags, watch mode, and experiment configuration
- Scorers — Use LLM-as-Judge and custom rubrics for more nuanced evaluation
- Datasets — Build robust test sets from production data