AI Evals
Running Evaluations
Learn how to run evals from the CLI, configure experiments, and use the AI Stack UI to explore results.
Running Evaluations
Once you've written an eval file, AI Stack gives you multiple ways to run and inspect results.
CLI Usage
The primary way to run evals is through the CLI:
npx aistack evals run <file>Common Flags
| Flag | Description |
|---|---|
--watch | Re-run evals when the file changes |
--json | Output results as JSON for programmatic use |
--no-send-logs | Run locally without uploading results to AI Stack |
--verbose | Show detailed output for each test case |
--filter <name> | Only run evals matching the given name |
--concurrency <n> | Max parallel test cases (default: 5) |
Examples
# Watch mode — great during development
npx aistack evals run my-eval.ts --watch
# Run without uploading results
npx aistack evals run my-eval.ts --no-send-logs
# JSON output for CI pipelines
npx aistack evals run my-eval.ts --json > results.json
# Run only specific evals in a file with many
npx aistack evals run my-eval.ts --filter "Summarization"Experiment Configuration
Each eval run creates an experiment — a versioned snapshot of your eval results. You can customize experiments:
import { Eval } from "@aistack/evals";
Eval("My Eval", {
// Tag experiments for filtering in the dashboard
metadata: {
model: "gpt-4o",
promptVersion: "v2.3",
environment: "staging",
},
// Set a custom experiment name (defaults to timestamp)
experimentName: "gpt-4o-v2.3-baseline",
data: () => [...],
task: async (input) => {...},
scores: {...},
});Multiple Evals in One File
You can define multiple evals in a single file. Each creates its own experiment:
import { Eval } from "@aistack/evals";
// These run sequentially
Eval("Summarization", {
data: () => [...],
task: async (input) => {...},
scores: {...},
});
Eval("Classification", {
data: () => [...],
task: async (input) => {...},
scores: {...},
});Comparing Experiments
After running multiple experiments, you can compare them in the AI Stack dashboard:
- Navigate to your project's Evals page
- Select two or more experiments
- View side-by-side score comparisons
- Drill into individual cases where scores differ
This is especially useful when:
- Testing a prompt change against the current baseline
- Comparing model performance (e.g., GPT-4o vs. Claude 3.5)
- Validating that a cost optimization doesn't sacrifice quality
The Eval Playground
The AI Stack dashboard includes an interactive playground where you can:
- Manually run individual test cases
- Edit inputs and re-run to test hypotheses
- Add new test cases from the UI
- Export cases to your eval dataset
What's Next?
- Scorers — Build sophisticated scoring with LLM-as-Judge and custom rubrics
- Datasets — Manage and version your test data
- CI/CD Integration — Automate evals in your deployment pipeline