AI Evals
CI/CD Integration
Automate AI evals in your CI/CD pipeline with GitHub Actions and other CI systems. Set fail thresholds to gate deployments on quality.
CI/CD Integration
The highest-leverage use of evals is running them automatically on every deploy. AI Stack integrates with your CI/CD pipeline to gate deployments on eval quality scores.
GitHub Actions
Add evals to your GitHub Actions workflow:
# .github/workflows/evals.yml
name: AI Evals
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install dependencies
run: npm ci
- name: Run evals
env:
AISTACK_API_KEY: ${{ secrets.AISTACK_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: npx aistack evals run ./evals/ --json > eval-results.json
- name: Check thresholds
run: npx aistack evals check eval-results.json --config evals.config.jsonThreshold Configuration
Define minimum score thresholds in evals.config.json:
{
"thresholds": {
"Summarization Quality": {
"relevance": 0.8,
"accuracy": 0.9
},
"Customer Support Bot": {
"helpfulness": 0.75,
"safety": 1.0
}
},
"failOnRegression": true,
"regressionThreshold": 0.05
}| Option | Description |
|---|---|
thresholds | Minimum scores per eval per scorer. Fails if any score falls below. |
failOnRegression | Fail if any score drops compared to the last passing run. |
regressionThreshold | Allowed score drop before triggering a regression failure (e.g., 0.05 = 5% drop allowed). |
Other CI Systems
Generic Script
For any CI system, the core commands are:
# Run evals and output JSON
npx aistack evals run ./evals/ --json > eval-results.json
# Check results against thresholds (exits with code 1 on failure)
npx aistack evals check eval-results.json --config evals.config.jsonGitLab CI
# .gitlab-ci.yml
evals:
stage: test
image: node:20
script:
- npm ci
- npx aistack evals run ./evals/ --json > eval-results.json
- npx aistack evals check eval-results.json --config evals.config.json
variables:
AISTACK_API_KEY: $AISTACK_API_KEY
OPENAI_API_KEY: $OPENAI_API_KEYCircleCI
# .circleci/config.yml
jobs:
evals:
docker:
- image: cimg/node:20.0
steps:
- checkout
- run: npm ci
- run:
name: Run AI Evals
command: |
npx aistack evals run ./evals/ --json > eval-results.json
npx aistack evals check eval-results.json --config evals.config.json
environment:
AISTACK_API_KEY: ${AISTACK_API_KEY}PR Comments
When running in a GitHub Actions pull request, AI Stack automatically posts a comment with:
- Score summary per eval
- Comparison against the base branch
- Links to the full experiment in the dashboard
- Pass/fail status for each threshold
To enable PR comments, add the GITHUB_TOKEN to your workflow:
- name: Run evals
env:
AISTACK_API_KEY: ${{ secrets.AISTACK_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: npx aistack evals run ./evals/ --json > eval-results.jsonBest Practices
- Start with a smoke test — Even one eval in CI is infinitely better than zero
- Set conservative thresholds initially — You can tighten them as your evals mature
- Use
failOnRegression— Catching regressions is often more valuable than absolute thresholds - Cache model responses in CI — Use
--no-send-logsduring initial testing to reduce costs - Run expensive evals nightly — Put fast code-based evals in PR checks, full LLM-judge suites in nightly runs
What's Next?
- AI Evals Overview — Revisit the fundamentals
- Scorers — Build the right scoring strategy for your CI gates