Datasets
Build, manage, and version test datasets for your AI evals. Source data from production logs, manual curation, and synthetic generation.
Datasets
Good evals start with good data. A dataset is a collection of test cases — inputs paired with expected outputs or reference answers — that represent the scenarios your AI needs to handle well.
Anatomy of a Test Case
interface TestCase {
input: string; // The prompt or query
expected?: string; // Reference answer (optional for some scorers)
metadata?: Record<string, unknown>; // Tags, categories, difficulty, etc.
}Not every scorer needs an expected value. Code-based checks like JSON validation or length constraints only need the output. LLM-as-Judge scorers can work with or without reference answers.
Sourcing Data
Manual Curation
Start here. Hand-pick 20–50 cases that represent your most important scenarios:
Eval("Customer Support Bot", {
data: () => [
{
input: "How do I reset my password?",
expected: "Guide the user to Settings > Security > Reset Password.",
metadata: { category: "account", difficulty: "easy" },
},
{
input: "I was charged twice for my subscription",
expected: "Apologize, confirm the duplicate charge, initiate a refund.",
metadata: { category: "billing", difficulty: "medium" },
},
// ... more cases
],
task: async (input) => callYourModel(input),
scores: { /* ... */ },
});When to use: Starting out, covering critical paths, testing edge cases you've identified.
Production Logs
Sample real inputs from production to build datasets that reflect actual usage:
import { loadDataset } from "@aistack/evals";
Eval("Production Coverage", {
// Load a dataset stored in AI Stack
data: () => loadDataset("production-sample-2025-q1"),
task: async (input) => callYourModel(input),
scores: { /* ... */ },
});You can create datasets from production logs in the AI Stack dashboard:
- Navigate to Logs in your project
- Filter by date range, model, or custom tags
- Click Create Dataset to sample and label cases
- Optionally add expected outputs manually
When to use: Ensuring your evals reflect real-world usage patterns, catching distribution shifts.
From Failed Cases
When you spot bad outputs in production or during manual review, add them to your dataset:
# Add a case from the CLI
npx aistack evals dataset add "my-dataset" \
--input "What's your refund policy for annual plans?" \
--expected "Explain the 30-day money-back guarantee for annual plans." \
--metadata '{"source": "support-ticket-4521"}'When to use: Building regression tests, ensuring fixed issues stay fixed.
Synthetic Generation
Use an LLM to generate test cases for scenarios you haven't observed yet:
import { generateTestCases } from "@aistack/evals";
const syntheticCases = await generateTestCases({
description: "Customer support queries about billing issues",
count: 50,
categories: ["refunds", "upgrades", "cancellations", "invoices"],
difficulty: ["easy", "medium", "hard"],
});When to use: Expanding coverage, stress-testing edge cases, bootstrapping a new eval.
Dataset Management
Versioning
Datasets in AI Stack are versioned automatically. Each modification creates a new version:
# List dataset versions
npx aistack evals dataset versions "my-dataset"
# Pin an eval to a specific version
npx aistack evals run my-eval.ts --dataset-version 3File-Based Datasets
You can also keep datasets as local files:
import { readFileSync } from "fs";
Eval("File-Based Dataset", {
data: () => {
const raw = readFileSync("./test-cases.json", "utf-8");
return JSON.parse(raw);
},
task: async (input) => callYourModel(input),
scores: { /* ... */ },
});Supported formats:
- JSON — Array of
{ input, expected, metadata }objects - CSV — Columns for
input,expected, and any metadata fields
Golden Datasets
Mark a dataset as "golden" to use it as your canonical test suite:
npx aistack evals dataset set-golden "my-dataset" --version 5Golden datasets are used by default in CI/CD pipelines and serve as the baseline for experiment comparisons.
Best Practices
- Start small, grow organically — 20 well-chosen cases beat 1,000 random ones
- Cover your failure modes — Every bug you fix should become a test case
- Balance categories — Don't let one type of query dominate your dataset
- Review regularly — Remove outdated cases, add new scenarios quarterly
- Version everything — Always know which dataset version produced which results
What's Next?
- CI/CD Integration — Automate evals with your golden dataset in your pipeline