AI Stack Docs
AI Evals

Datasets

Build, manage, and version test datasets for your AI evals. Source data from production logs, manual curation, and synthetic generation.

Datasets

Good evals start with good data. A dataset is a collection of test cases — inputs paired with expected outputs or reference answers — that represent the scenarios your AI needs to handle well.

Anatomy of a Test Case

interface TestCase {
  input: string;          // The prompt or query
  expected?: string;      // Reference answer (optional for some scorers)
  metadata?: Record<string, unknown>; // Tags, categories, difficulty, etc.
}

Not every scorer needs an expected value. Code-based checks like JSON validation or length constraints only need the output. LLM-as-Judge scorers can work with or without reference answers.

Sourcing Data

Manual Curation

Start here. Hand-pick 20–50 cases that represent your most important scenarios:

Eval("Customer Support Bot", {
  data: () => [
    {
      input: "How do I reset my password?",
      expected: "Guide the user to Settings > Security > Reset Password.",
      metadata: { category: "account", difficulty: "easy" },
    },
    {
      input: "I was charged twice for my subscription",
      expected: "Apologize, confirm the duplicate charge, initiate a refund.",
      metadata: { category: "billing", difficulty: "medium" },
    },
    // ... more cases
  ],
  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

When to use: Starting out, covering critical paths, testing edge cases you've identified.

Production Logs

Sample real inputs from production to build datasets that reflect actual usage:

import { loadDataset } from "@aistack/evals";

Eval("Production Coverage", {
  // Load a dataset stored in AI Stack
  data: () => loadDataset("production-sample-2025-q1"),

  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

You can create datasets from production logs in the AI Stack dashboard:

  1. Navigate to Logs in your project
  2. Filter by date range, model, or custom tags
  3. Click Create Dataset to sample and label cases
  4. Optionally add expected outputs manually

When to use: Ensuring your evals reflect real-world usage patterns, catching distribution shifts.

From Failed Cases

When you spot bad outputs in production or during manual review, add them to your dataset:

# Add a case from the CLI
npx aistack evals dataset add "my-dataset" \
  --input "What's your refund policy for annual plans?" \
  --expected "Explain the 30-day money-back guarantee for annual plans." \
  --metadata '{"source": "support-ticket-4521"}'

When to use: Building regression tests, ensuring fixed issues stay fixed.

Synthetic Generation

Use an LLM to generate test cases for scenarios you haven't observed yet:

import { generateTestCases } from "@aistack/evals";

const syntheticCases = await generateTestCases({
  description: "Customer support queries about billing issues",
  count: 50,
  categories: ["refunds", "upgrades", "cancellations", "invoices"],
  difficulty: ["easy", "medium", "hard"],
});

When to use: Expanding coverage, stress-testing edge cases, bootstrapping a new eval.

Dataset Management

Versioning

Datasets in AI Stack are versioned automatically. Each modification creates a new version:

# List dataset versions
npx aistack evals dataset versions "my-dataset"

# Pin an eval to a specific version
npx aistack evals run my-eval.ts --dataset-version 3

File-Based Datasets

You can also keep datasets as local files:

import { readFileSync } from "fs";

Eval("File-Based Dataset", {
  data: () => {
    const raw = readFileSync("./test-cases.json", "utf-8");
    return JSON.parse(raw);
  },
  task: async (input) => callYourModel(input),
  scores: { /* ... */ },
});

Supported formats:

  • JSON — Array of { input, expected, metadata } objects
  • CSV — Columns for input, expected, and any metadata fields

Golden Datasets

Mark a dataset as "golden" to use it as your canonical test suite:

npx aistack evals dataset set-golden "my-dataset" --version 5

Golden datasets are used by default in CI/CD pipelines and serve as the baseline for experiment comparisons.

Best Practices

  1. Start small, grow organically — 20 well-chosen cases beat 1,000 random ones
  2. Cover your failure modes — Every bug you fix should become a test case
  3. Balance categories — Don't let one type of query dominate your dataset
  4. Review regularly — Remove outdated cases, add new scenarios quarterly
  5. Version everything — Always know which dataset version produced which results

What's Next?