Quickstart

Get your first eval running in under 5 minutes.

Prerequisites

Node.js 18+ installed
An AI Stack account (sign up)
An API key for at least one LLM provider (OpenAI, Anthropic, etc.)

1. Install the SDK

npm install @aistack/evals

Or with other package managers:

# yarn
yarn add @aistack/evals

# pnpm
pnpm add @aistack/evals

# bun
bun add @aistack/evals

2. Set Your API Key

export AISTACK_API_KEY="your-api-key-here"

You can get your API key from the AI Stack dashboard.

3. Write Your First Eval

Create a file called my-first-eval.ts:

import { Eval, currentExperiment } from "@aistack/evals";

Eval("Summarization Quality", {
  data: () => [
    {
      input: "Explain quantum computing in one sentence.",
      expected: "A clear, accurate one-sentence explanation of quantum computing.",
    },
    {
      input: "What is the capital of France?",
      expected: "Paris",
    },
    {
      input: "Summarize the benefits of exercise in 2 sentences.",
      expected: "A concise, accurate summary of exercise benefits.",
    },
  ],

  task: async (input) => {
    // Replace with your actual AI call
    const response = await callYourModel(input);
    return response;
  },

  scores: {
    relevance: (output, expected) => {
      // Simple check: does the output address the input?
      return output.length > 10 ? 1 : 0;
    },
    length: (output) => {
      // Penalize very short or very long responses
      const words = output.split(" ").length;
      if (words < 5) return 0;
      if (words > 100) return 0.5;
      return 1;
    },
  },
});

4. Run the Eval

npx aistack evals run my-first-eval.ts

You'll see output like:

AI Stack Evals v0.1.0
Running: Summarization Quality

 ✓ Case 1/3 — relevance: 1.0, length: 1.0
 ✓ Case 2/3 — relevance: 1.0, length: 1.0
 ✓ Case 3/3 — relevance: 1.0, length: 0.8

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Results: 3/3 passed
Average scores: relevance=1.00, length=0.93
Experiment URL: https://aistack.run/evals/exp_abc123

5. View Results in the Dashboard

Click the experiment URL to see detailed results in the AI Stack dashboard, including:

Score distributions per metric
Individual case results with inputs, outputs, and scores
Comparison with previous runs
Trends over time

What's Next?

Running Evaluations — Learn about CLI flags, watch mode, and experiment configuration
Scorers — Use LLM-as-Judge and custom rubrics for more nuanced evaluation
Datasets — Build robust test sets from production data

Quickstart

On this page