Score AI outputs with code-based checks, LLM-as-Judge, and custom rubrics. Learn the tradeoffs and when to use each approach.

Scorers

Scorers are functions that measure the quality of AI outputs. AI Stack supports three approaches, and the best eval suites combine all three.

Code-Based Scorers

Deterministic functions that check specific properties of the output. Fast, cheap, and reliable — use these wherever possible.

import { Eval } from "@aistack/evals";

Eval("Code-Based Scoring", {
  data: () => [
    { input: "What is 2+2?", expected: "4" },
    { input: "Capital of Japan?", expected: "Tokyo" },
  ],

  task: async (input) => callYourModel(input),

  scores: {
    // Exact match (case-insensitive)
    exactMatch: (output, expected) => {
      return output.toLowerCase().includes(expected.toLowerCase()) ? 1 : 0;
    },

    // Check output is valid JSON
    validJson: (output) => {
      try {
        JSON.parse(output);
        return 1;
      } catch {
        return 0;
      }
    },

    // Length within bounds
    reasonable_length: (output) => {
      const words = output.split(/\s+/).length;
      return words >= 5 && words <= 200 ? 1 : 0;
    },

    // Regex pattern matching
    noPersonalInfo: (output) => {
      const emailPattern = /\b[\w.-]+@[\w.-]+\.\w+\b/;
      const phonePattern = /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/;
      return !emailPattern.test(output) && !phonePattern.test(output) ? 1 : 0;
    },
  },
});

Best for: Format validation, safety checks, factual recall, structured outputs.

LLM-as-Judge

Use a language model to evaluate the output. More expensive and slower, but can assess subjective qualities like helpfulness, tone, and nuance.

import { Eval, LLMJudge } from "@aistack/evals";

Eval("LLM Judge Scoring", {
  data: () => [
    {
      input: "Explain machine learning to a 10-year-old.",
      expected: "A simple, accurate, engaging explanation.",
    },
  ],

  task: async (input) => callYourModel(input),

  scores: {
    // Built-in judge for helpfulness
    helpfulness: LLMJudge({
      model: "gpt-4o",
      prompt: `You are evaluating an AI response for helpfulness.

Input: {{input}}
Expected: {{expected}}
Output: {{output}}

Rate the helpfulness from 0.0 to 1.0 where:
- 0.0 = completely unhelpful or wrong
- 0.5 = partially helpful but missing key information
- 1.0 = very helpful and complete

Return only a JSON object: {"score": <number>, "reason": "<brief explanation>"}`,
    }),

    // Judge for age-appropriate language
    simplicity: LLMJudge({
      model: "gpt-4o",
      prompt: `Would a 10-year-old understand this explanation?

Explanation: {{output}}

Rate from 0.0 to 1.0 where:
- 0.0 = uses advanced jargon, too complex
- 0.5 = mostly understandable but some hard parts
- 1.0 = perfectly age-appropriate

Return only a JSON object: {"score": <number>, "reason": "<brief explanation>"}`,
    }),
  },
});

Best for: Subjective quality (helpfulness, tone, creativity), complex reasoning assessment, cases where "correct" is hard to define programmatically.

LLM-as-Judge Tips

Use the strongest model available for judging — GPT-4o or Claude 3.5 Sonnet work well
Be specific in your rubric — vague prompts lead to inconsistent scores
Include examples in the judge prompt for the scores you expect
Test your judge — run the same cases multiple times to check consistency
Watch costs — each judge call is an API call, so budget accordingly

Custom Rubrics

For domain-specific scoring, combine code-based checks with LLM judges and custom logic:

import { Eval, LLMJudge } from "@aistack/evals";

Eval("Medical Summary Eval", {
  data: () => loadMedicalDataset(),

  task: async (input) => callYourModel(input),

  scores: {
    // Code: Must not include disclaimers in certain contexts
    noBoilerplate: (output) => {
      const boilerplate = ["I'm not a doctor", "consult a physician", "not medical advice"];
      return boilerplate.some((b) => output.toLowerCase().includes(b)) ? 0 : 1;
    },

    // Code: Check required sections are present
    hasRequiredSections: (output) => {
      const required = ["Diagnosis", "Treatment", "Follow-up"];
      const present = required.filter((s) => output.includes(s));
      return present.length / required.length;
    },

    // LLM: Assess medical accuracy (use domain expert prompt)
    accuracy: LLMJudge({
      model: "gpt-4o",
      prompt: `You are a medical documentation expert. Evaluate this summary for accuracy...
      
{{output}}

Score from 0.0 to 1.0. Return {"score": <number>, "reason": "<explanation>"}`,
    }),
  },
});

Choosing Your Approach

Approach	Speed	Cost	Best For
Code-based	Fast	Free	Format, safety, factual checks
LLM-as-Judge	Slow	$$	Subjective quality, nuance
Custom rubric	Mixed	Mixed	Domain-specific requirements

Start with code-based scorers for everything you can check deterministically, then layer in LLM judges for subjective dimensions.

What's Next?

Datasets — Build and manage your test data
CI/CD Integration — Automate scoring in your pipeline

Scorers

Scorers

Code-Based Scorers

LLM-as-Judge

LLM-as-Judge Tips

Custom Rubrics

Choosing Your Approach

What's Next?

On this page