AI Stack Docs
AI Evals

CI/CD Integration

Automate AI evals in your CI/CD pipeline with GitHub Actions and other CI systems. Set fail thresholds to gate deployments on quality.

CI/CD Integration

The highest-leverage use of evals is running them automatically on every deploy. AI Stack integrates with your CI/CD pipeline to gate deployments on eval quality scores.

GitHub Actions

Add evals to your GitHub Actions workflow:

# .github/workflows/evals.yml
name: AI Evals

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install dependencies
        run: npm ci

      - name: Run evals
        env:
          AISTACK_API_KEY: ${{ secrets.AISTACK_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npx aistack evals run ./evals/ --json > eval-results.json

      - name: Check thresholds
        run: npx aistack evals check eval-results.json --config evals.config.json

Threshold Configuration

Define minimum score thresholds in evals.config.json:

{
  "thresholds": {
    "Summarization Quality": {
      "relevance": 0.8,
      "accuracy": 0.9
    },
    "Customer Support Bot": {
      "helpfulness": 0.75,
      "safety": 1.0
    }
  },
  "failOnRegression": true,
  "regressionThreshold": 0.05
}
OptionDescription
thresholdsMinimum scores per eval per scorer. Fails if any score falls below.
failOnRegressionFail if any score drops compared to the last passing run.
regressionThresholdAllowed score drop before triggering a regression failure (e.g., 0.05 = 5% drop allowed).

Other CI Systems

Generic Script

For any CI system, the core commands are:

# Run evals and output JSON
npx aistack evals run ./evals/ --json > eval-results.json

# Check results against thresholds (exits with code 1 on failure)
npx aistack evals check eval-results.json --config evals.config.json

GitLab CI

# .gitlab-ci.yml
evals:
  stage: test
  image: node:20
  script:
    - npm ci
    - npx aistack evals run ./evals/ --json > eval-results.json
    - npx aistack evals check eval-results.json --config evals.config.json
  variables:
    AISTACK_API_KEY: $AISTACK_API_KEY
    OPENAI_API_KEY: $OPENAI_API_KEY

CircleCI

# .circleci/config.yml
jobs:
  evals:
    docker:
      - image: cimg/node:20.0
    steps:
      - checkout
      - run: npm ci
      - run:
          name: Run AI Evals
          command: |
            npx aistack evals run ./evals/ --json > eval-results.json
            npx aistack evals check eval-results.json --config evals.config.json
          environment:
            AISTACK_API_KEY: ${AISTACK_API_KEY}

PR Comments

When running in a GitHub Actions pull request, AI Stack automatically posts a comment with:

  • Score summary per eval
  • Comparison against the base branch
  • Links to the full experiment in the dashboard
  • Pass/fail status for each threshold

To enable PR comments, add the GITHUB_TOKEN to your workflow:

- name: Run evals
  env:
    AISTACK_API_KEY: ${{ secrets.AISTACK_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  run: npx aistack evals run ./evals/ --json > eval-results.json

Best Practices

  1. Start with a smoke test — Even one eval in CI is infinitely better than zero
  2. Set conservative thresholds initially — You can tighten them as your evals mature
  3. Use failOnRegression — Catching regressions is often more valuable than absolute thresholds
  4. Cache model responses in CI — Use --no-send-logs during initial testing to reduce costs
  5. Run expensive evals nightly — Put fast code-based evals in PR checks, full LLM-judge suites in nightly runs

What's Next?