AI Stack Docs
AI Evals

Scorers

Score AI outputs with code-based checks, LLM-as-Judge, and custom rubrics. Learn the tradeoffs and when to use each approach.

Scorers

Scorers are functions that measure the quality of AI outputs. AI Stack supports three approaches, and the best eval suites combine all three.

Code-Based Scorers

Deterministic functions that check specific properties of the output. Fast, cheap, and reliable — use these wherever possible.

import { Eval } from "@aistack/evals";

Eval("Code-Based Scoring", {
  data: () => [
    { input: "What is 2+2?", expected: "4" },
    { input: "Capital of Japan?", expected: "Tokyo" },
  ],

  task: async (input) => callYourModel(input),

  scores: {
    // Exact match (case-insensitive)
    exactMatch: (output, expected) => {
      return output.toLowerCase().includes(expected.toLowerCase()) ? 1 : 0;
    },

    // Check output is valid JSON
    validJson: (output) => {
      try {
        JSON.parse(output);
        return 1;
      } catch {
        return 0;
      }
    },

    // Length within bounds
    reasonable_length: (output) => {
      const words = output.split(/\s+/).length;
      return words >= 5 && words <= 200 ? 1 : 0;
    },

    // Regex pattern matching
    noPersonalInfo: (output) => {
      const emailPattern = /\b[\w.-]+@[\w.-]+\.\w+\b/;
      const phonePattern = /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/;
      return !emailPattern.test(output) && !phonePattern.test(output) ? 1 : 0;
    },
  },
});

Best for: Format validation, safety checks, factual recall, structured outputs.

LLM-as-Judge

Use a language model to evaluate the output. More expensive and slower, but can assess subjective qualities like helpfulness, tone, and nuance.

import { Eval, LLMJudge } from "@aistack/evals";

Eval("LLM Judge Scoring", {
  data: () => [
    {
      input: "Explain machine learning to a 10-year-old.",
      expected: "A simple, accurate, engaging explanation.",
    },
  ],

  task: async (input) => callYourModel(input),

  scores: {
    // Built-in judge for helpfulness
    helpfulness: LLMJudge({
      model: "gpt-4o",
      prompt: `You are evaluating an AI response for helpfulness.

Input: {{input}}
Expected: {{expected}}
Output: {{output}}

Rate the helpfulness from 0.0 to 1.0 where:
- 0.0 = completely unhelpful or wrong
- 0.5 = partially helpful but missing key information
- 1.0 = very helpful and complete

Return only a JSON object: {"score": <number>, "reason": "<brief explanation>"}`,
    }),

    // Judge for age-appropriate language
    simplicity: LLMJudge({
      model: "gpt-4o",
      prompt: `Would a 10-year-old understand this explanation?

Explanation: {{output}}

Rate from 0.0 to 1.0 where:
- 0.0 = uses advanced jargon, too complex
- 0.5 = mostly understandable but some hard parts
- 1.0 = perfectly age-appropriate

Return only a JSON object: {"score": <number>, "reason": "<brief explanation>"}`,
    }),
  },
});

Best for: Subjective quality (helpfulness, tone, creativity), complex reasoning assessment, cases where "correct" is hard to define programmatically.

LLM-as-Judge Tips

  • Use the strongest model available for judging — GPT-4o or Claude 3.5 Sonnet work well
  • Be specific in your rubric — vague prompts lead to inconsistent scores
  • Include examples in the judge prompt for the scores you expect
  • Test your judge — run the same cases multiple times to check consistency
  • Watch costs — each judge call is an API call, so budget accordingly

Custom Rubrics

For domain-specific scoring, combine code-based checks with LLM judges and custom logic:

import { Eval, LLMJudge } from "@aistack/evals";

Eval("Medical Summary Eval", {
  data: () => loadMedicalDataset(),

  task: async (input) => callYourModel(input),

  scores: {
    // Code: Must not include disclaimers in certain contexts
    noBoilerplate: (output) => {
      const boilerplate = ["I'm not a doctor", "consult a physician", "not medical advice"];
      return boilerplate.some((b) => output.toLowerCase().includes(b)) ? 0 : 1;
    },

    // Code: Check required sections are present
    hasRequiredSections: (output) => {
      const required = ["Diagnosis", "Treatment", "Follow-up"];
      const present = required.filter((s) => output.includes(s));
      return present.length / required.length;
    },

    // LLM: Assess medical accuracy (use domain expert prompt)
    accuracy: LLMJudge({
      model: "gpt-4o",
      prompt: `You are a medical documentation expert. Evaluate this summary for accuracy...
      
{{output}}

Score from 0.0 to 1.0. Return {"score": <number>, "reason": "<explanation>"}`,
    }),
  },
});

Choosing Your Approach

ApproachSpeedCostBest For
Code-basedFastFreeFormat, safety, factual checks
LLM-as-JudgeSlow$$Subjective quality, nuance
Custom rubricMixedMixedDomain-specific requirements

Start with code-based scorers for everything you can check deterministically, then layer in LLM judges for subjective dimensions.

What's Next?