Build, manage, and version test datasets for your AI evals. Source data from production logs, manual curation, and synthetic generation.

Datasets

Good evals start with good data. A dataset is a collection of test cases — inputs paired with expected outputs or reference answers — that represent the scenarios your AI needs to handle well.

Anatomy of a Test Case

interface TestCase {
  input: string;          // The prompt or query
  expected?: string;      // Reference answer (optional for some scorers)
  metadata?: Record<string, unknown>; // Tags, categories, difficulty, etc.
}

Not every scorer needs an expected value. Code-based checks like JSON validation or length constraints only need the output. LLM-as-Judge scorers can work with or without reference answers.

Sourcing Data

Manual Curation

Start here. Hand-pick 20–50 cases that represent your most important scenarios:

Eval("Customer Support Bot", {
  data: () => [
    {
      input: "How do I reset my password?",
      expected: "Guide the user to Settings > Security > Reset Password.",
      metadata: { category: "account", difficulty: "easy" },
    },
    {
      input: "I was charged twice for my subscription",
      expected: "Apologize, confirm the duplicate charge, initiate a refund.",
      metadata: { category: "billing", difficulty: "medium" },
    },
    // ... more cases
  ],
  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

When to use: Starting out, covering critical paths, testing edge cases you've identified.

Production Logs

Sample real inputs from production to build datasets that reflect actual usage:

import { loadDataset } from "@aistack/evals";

Eval("Production Coverage", {
  // Load a dataset stored in AI Stack
  data: () => loadDataset("production-sample-2025-q1"),

  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

You can create datasets from production logs in the AI Stack dashboard:

Navigate to Logs in your project
Filter by date range, model, or custom tags
Click Create Dataset to sample and label cases
Optionally add expected outputs manually

When to use: Ensuring your evals reflect real-world usage patterns, catching distribution shifts.

From Failed Cases

When you spot bad outputs in production or during manual review, add them to your dataset:

# Add a case from the CLI
npx aistack evals dataset add "my-dataset" \
  --input "What's your refund policy for annual plans?" \
  --expected "Explain the 30-day money-back guarantee for annual plans." \
  --metadata '{"source": "support-ticket-4521"}'

When to use: Building regression tests, ensuring fixed issues stay fixed.

Synthetic Generation

Use an LLM to generate test cases for scenarios you haven't observed yet:

import { generateTestCases } from "@aistack/evals";

const syntheticCases = await generateTestCases({
  description: "Customer support queries about billing issues",
  count: 50,
  categories: ["refunds", "upgrades", "cancellations", "invoices"],
  difficulty: ["easy", "medium", "hard"],
});

When to use: Expanding coverage, stress-testing edge cases, bootstrapping a new eval.

Dataset Management

Versioning

Datasets in AI Stack are versioned automatically. Each modification creates a new version:

# List dataset versions
npx aistack evals dataset versions "my-dataset"

# Pin an eval to a specific version
npx aistack evals run my-eval.ts --dataset-version 3

File-Based Datasets

You can also keep datasets as local files:

import { readFileSync } from "fs";

Eval("File-Based Dataset", {
  data: () => {
    const raw = readFileSync("./test-cases.json", "utf-8");
    return JSON.parse(raw);
  },
  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

Supported formats:

JSON — Array of { input, expected, metadata } objects
CSV — Columns for input, expected, and any metadata fields

Golden Datasets

Mark a dataset as "golden" to use it as your canonical test suite:

npx aistack evals dataset set-golden "my-dataset" --version 5

Golden datasets are used by default in CI/CD pipelines and serve as the baseline for experiment comparisons.

Best Practices

Start small, grow organically — 20 well-chosen cases beat 1,000 random ones
Cover your failure modes — Every bug you fix should become a test case
Balance categories — Don't let one type of query dominate your dataset
Review regularly — Remove outdated cases, add new scenarios quarterly
Version everything — Always know which dataset version produced which results

What's Next?

CI/CD Integration — Automate evals with your golden dataset in your pipeline

Datasets

On this page