AI Stack Docs
AI Evals

Running Evaluations

Learn how to run evals from the CLI, configure experiments, and use the AI Stack UI to explore results.

Running Evaluations

Once you've written an eval file, AI Stack gives you multiple ways to run and inspect results.

CLI Usage

The primary way to run evals is through the CLI:

npx aistack evals run <file>

Common Flags

FlagDescription
--watchRe-run evals when the file changes
--jsonOutput results as JSON for programmatic use
--no-send-logsRun locally without uploading results to AI Stack
--verboseShow detailed output for each test case
--filter <name>Only run evals matching the given name
--concurrency <n>Max parallel test cases (default: 5)

Examples

# Watch mode — great during development
npx aistack evals run my-eval.ts --watch

# Run without uploading results
npx aistack evals run my-eval.ts --no-send-logs

# JSON output for CI pipelines
npx aistack evals run my-eval.ts --json > results.json

# Run only specific evals in a file with many
npx aistack evals run my-eval.ts --filter "Summarization"

Experiment Configuration

Each eval run creates an experiment — a versioned snapshot of your eval results. You can customize experiments:

import { Eval } from "@aistack/evals";

Eval("My Eval", {
  // Tag experiments for filtering in the dashboard
  metadata: {
    model: "gpt-4o",
    promptVersion: "v2.3",
    environment: "staging",
  },

  // Set a custom experiment name (defaults to timestamp)
  experimentName: "gpt-4o-v2.3-baseline",

  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Multiple Evals in One File

You can define multiple evals in a single file. Each creates its own experiment:

import { Eval } from "@aistack/evals";

// These run sequentially
Eval("Summarization", {
  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Eval("Classification", {
  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Comparing Experiments

After running multiple experiments, you can compare them in the AI Stack dashboard:

  1. Navigate to your project's Evals page
  2. Select two or more experiments
  3. View side-by-side score comparisons
  4. Drill into individual cases where scores differ

This is especially useful when:

  • Testing a prompt change against the current baseline
  • Comparing model performance (e.g., GPT-4o vs. Claude 3.5)
  • Validating that a cost optimization doesn't sacrifice quality

The Eval Playground

The AI Stack dashboard includes an interactive playground where you can:

  • Manually run individual test cases
  • Edit inputs and re-run to test hypotheses
  • Add new test cases from the UI
  • Export cases to your eval dataset

What's Next?

  • Scorers — Build sophisticated scoring with LLM-as-Judge and custom rubrics
  • Datasets — Manage and version your test data
  • CI/CD Integration — Automate evals in your deployment pipeline