Learn how to run evals from the CLI, configure experiments, and use the AI Stack UI to explore results.

Running Evaluations

Once you've written an eval file, AI Stack gives you multiple ways to run and inspect results.

CLI Usage

The primary way to run evals is through the CLI:

npx aistack evals run <file>

Common Flags

Flag	Description
`--watch`	Re-run evals when the file changes
`--json`	Output results as JSON for programmatic use
`--no-send-logs`	Run locally without uploading results to AI Stack
`--verbose`	Show detailed output for each test case
`--filter <name>`	Only run evals matching the given name
`--concurrency <n>`	Max parallel test cases (default: 5)

Examples

# Watch mode — great during development
npx aistack evals run my-eval.ts --watch

# Run without uploading results
npx aistack evals run my-eval.ts --no-send-logs

# JSON output for CI pipelines
npx aistack evals run my-eval.ts --json > results.json

# Run only specific evals in a file with many
npx aistack evals run my-eval.ts --filter "Summarization"

Experiment Configuration

Each eval run creates an experiment — a versioned snapshot of your eval results. You can customize experiments:

import { Eval } from "@aistack/evals";

Eval("My Eval", {
  // Tag experiments for filtering in the dashboard
  metadata: {
    model: "gpt-4o",
    promptVersion: "v2.3",
    environment: "staging",
  },

  // Set a custom experiment name (defaults to timestamp)
  experimentName: "gpt-4o-v2.3-baseline",

  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Multiple Evals in One File

You can define multiple evals in a single file. Each creates its own experiment:

import { Eval } from "@aistack/evals";

// These run sequentially
Eval("Summarization", {
  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Eval("Classification", {
  data: () => [...],
  task: async (input) => {...},
  scores: {...},
});

Comparing Experiments

After running multiple experiments, you can compare them in the AI Stack dashboard:

Navigate to your project's Evals page
Select two or more experiments
View side-by-side score comparisons
Drill into individual cases where scores differ

This is especially useful when:

Testing a prompt change against the current baseline
Comparing model performance (e.g., GPT-4o vs. Claude 3.5)
Validating that a cost optimization doesn't sacrifice quality

The Eval Playground

The AI Stack dashboard includes an interactive playground where you can:

Manually run individual test cases
Edit inputs and re-run to test hypotheses
Add new test cases from the UI
Export cases to your eval dataset

What's Next?

Scorers — Build sophisticated scoring with LLM-as-Judge and custom rubrics
Datasets — Manage and version your test data
CI/CD Integration — Automate evals in your deployment pipeline

Running Evaluations

On this page