Regrada Documentation

CI gate for LLM behavior — record real model traffic, turn it into test cases, and block regressions in CI.

> Records LLM API calls via an HTTP proxy (regrada record)

> Converts recorded traces into portable YAML cases + baseline snapshots (regrada accept)

> Runs cases repeatedly, diffs vs baselines, and enforces configurable policies (regrada test)

> Produces CI-friendly reports (stdout summary, Markdown, JUnit) and a GitHub Action

> Syncs results and traces to the Regrada dashboard for centralized visibility

Installation

macOS / Linux

curl -fsSL https://downloads.regrada.com/install.sh | sh

wget -qO- https://downloads.regrada.com/install.sh | shregrada version

The installer downloads a prebuilt binary and installs it to /usr/local/bin/regrada when that directory is writable. Otherwise it falls back to ~/.local/bin/regrada. If regrada isn't found, add the printed install directory to your PATH.

curl -fsSL https://downloads.regrada.com/install.sh | sudo env REGRADA_INSTALL_DIR=/usr/local/bin sh

Windows

The installer targets macOS/Linux. On Windows, run Regrada via WSL.

Build from source

mkdir -p bingo build -o ./bin/regrada ../bin/regrada version

Start Here

Pick the path that matches where you are. The lowest-friction path is to validate the CLI locally with the mock provider, then switch to a real provider or live traffic once the workflow feels good.

1. Fastest Smoke Test

Best first run. No API key. No live traffic. Just prove the install, generated files, baselines, and reports work on your machine.

regrada init --non-interactive

regrada baselineregrada test

The generated config keeps the mock provider and uses local baselines, so the first run works without an API key or a baseline branch.

2. Run Real Evals

Use this when you already know you want real model responses, not a mock smoke test.

export OPENAI_API_KEY="..."

Edit regrada.yml:

providers:
  default: openai
  openai:
    model: gpt-4o-mini

regrada baselineregrada test --explain

Keep local baselines while iterating. Switch to baseline.mode: git once snapshots belong on your baseline branch and in CI.

3. Capture an Existing App

Start here if your app already makes LLM calls and you want to turn real traffic into test cases with minimal changes.

regrada ca initregrada ca installregrada record -- npm testregrada acceptregrada test

Regrada injects proxy environment variables for the wrapped command, records the captured session, and preserves the wrapped command's exit code.

Recommended rollout

Validate the workflow with the mock provider first. Then move to your real provider. Then promote baselines to git mode once the snapshots are worth reviewing and protecting in CI.

Core Concepts

Cases

A case is a YAML file (default: regrada/cases/**/*.yml) containing a prompt (chat messages or structured input) plus optional assertions.

Assertions vs Policies

Case assertions (assert: in a case file) mark individual runs as pass/fail and feed metrics like pass_rate.
Policies (policies: in regrada.yml) decide what counts as a warning or error in CI.

To fail CI on failed assertions, add an assertions policy with severity: error.

Baselines

A baseline is a stored snapshot (golden output + aggregate metrics) used for regression checks.

Regrada stores baselines under the snapshot directory (default: .regrada/snapshots/), keyed by:

Case ID
Provider + model
Sampling params (temperature / top_p / max tokens / stop)
System prompt content

Changing any of these produces a different baseline key and requires regenerating the snapshot.

CLI Commands

These are the commands most users touch repeatedly. If you only remember five, remember init, record, accept, baseline, and test.

regrada init

Creates regrada.yml, an example case, and runtime directories.

regrada init

Use --non-interactive for a fast mock-provider setup, or the default interactive flow if you want help choosing provider and CI defaults.

Flags: --path, --force, --non-interactive

regrada record

Starts an HTTP proxy to capture LLM traffic. Defaults to forward proxy with HTTPS MITM. When a subcommand is provided, the proxy exits when that process finishes.

regrada recordregrada record -- python app.pyregrada record -- npm test

For forward proxy mode, run regrada ca init and regrada ca install once on the machine first.

Traces written to .regrada/traces/ (JSONL). Sessions written to .regrada/sessions/. Auto-detects provider by host (OpenAI, Anthropic, Azure, Bedrock).

regrada accept

Converts traces from the latest (or specified) session into cases and baselines.

regrada acceptregrada accept --session .regrada/sessions/20250101-120000.json

regrada baseline

Runs all discovered cases once and writes baseline snapshots to the local snapshot directory.

regrada baseline

Use this before regrada test in local mode, or when refreshing the snapshots you commit on your baseline branch.

Supports --output text|json so scripts can consume the written case IDs and snapshot directory directly.

regrada test / regrada check

Runs cases, diffs against baselines, evaluates policies, and writes reports. regrada check is an alias.

regrada testregrada test --concurrency 4

Baseline behavior comes from baseline.mode: local compares against local snapshots, while git compares against the configured git ref (default: origin/main).

Flag --concurrency N controls parallel case execution (default: 1).

regrada ca

Manages the local Root CA required for forward-proxy HTTPS interception.

regrada ca initregrada ca installregrada ca statusregrada ca uninstall

regrada migrate

Compares two model targets across the same test suite so you can judge a model migration before you flip traffic.

regrada migrate --from openai/gpt-4o-mini --to anthropic/claude-3-5-sonnet-20241022regrada migrate --cases support.refund,safety.pii --output json

Outputs Markdown by default, JSON when requested, and exits non-zero if regressions are detected.

regrada fuzz

Generates adversarial mutations for your cases and measures whether the model still behaves acceptably under those variants.

regrada fuzz --case safety.pii_redactionregrada fuzz --categories prompt_injection,jailbreak --threshold 0.9

Returns exit code 2 when any case falls below the robustness threshold. Use Markdown for human review or JSON for automation.

Configuration (regrada.yml)

Minimal working config:

version: 1

providers:
  default: openai
  openai:
    model: gpt-4o-mini

baseline:
  mode: local

policies:
  - id: assertions
    severity: error
    check:
      type: assertions
      min_pass_rate: 1.0

Case Discovery

Defaults (can be overridden under cases:):

Roots: ["regrada/cases"]
Include globs: ["**/*.yml", "**/*.yaml"]
Exclude globs: ["**/README.*"]

Execution Mode

execution:
  mode: replay  # replay (default) or live

replay — compare against baselines (requires snapshots to exist)
live — run without baselines; enforce invariants and checks only

Case Defaults

cases:
  defaults:
    runs: 3
    timeout_ms: 30000
    concurrency: 8

Baseline Modes

Git baseline config (recommended for CI):

baseline:
  mode: git
  git:
    ref: origin/main
    snapshot_dir: .regrada/snapshots

Reports

Enable JUnit output for CI:

report:
  format: [summary, markdown, junit]
  junit:
    path: .regrada/junit.xml

CI Behavior

By default, Regrada fails on any severity: error violation. To also fail on warnings:

ci:
  fail_on:
    - severity: error
    - severity: warn

Providers

All four providers are fully implemented and ready to use.

OpenAI

providers:
  default: openai
  openai:
    model: gpt-4o-mini
    api_key_env: OPENAI_API_KEY  # default

Credential resolution: api_key_env → api_key → OPENAI_API_KEY

Anthropic

providers:
  default: anthropic
  anthropic:
    model: claude-3-5-sonnet-20241022
    api_key_env: ANTHROPIC_API_KEY  # default

Credential resolution: api_key_env → api_key → ANTHROPIC_API_KEY

Azure OpenAI

providers:
  default: azure_openai
  azure_openai:
    endpoint: https://my-resource.openai.azure.com
    deployment: gpt-4o-mini
    api_version: 2024-02-15-preview  # default
    api_key_env: AZURE_OPENAI_API_KEY

The deployment field is required. The endpoint URL is also resolved from AZURE_OPENAI_ENDPOINT if not set inline.

Calls {endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}

AWS Bedrock

providers:
  default: bedrock
  bedrock:
    model_id: anthropic.claude-3-5-sonnet-20241022-v2:0
    region: us-east-1
    # Optional: explicit credentials
    access_key_env: AWS_ACCESS_KEY_ID
    secret_key_env: AWS_SECRET_ACCESS_KEY

Uses the Bedrock Converse API. If explicit credentials are not set, falls back to the AWS SDK default credential chain (instance profile, environment, shared credentials file). Region resolves from region → AWS_REGION → AWS_DEFAULT_REGION.

Mock

providers:
  default: mock

Returns a fixed "mock response" string. Useful for wiring up tests without real API calls.

Case Format

Example test case (regrada/cases/**/*.yml):

id: greeting.hello
tags: [smoke]

request:
  messages:
    - role: system
      content: You are a concise assistant.
    - role: user
      content: Say hello and ask for a name.
  params:
    temperature: 0.2
    top_p: 1.0
    max_output_tokens: 256

assert:
  text:
    contains: ["hello"]
    not_contains: ["error"]
    max_chars: 120
  metrics:
    max_latency_ms: 5000

> request must specify either messages or input (a YAML map)

> Roles must be system, user, assistant, or tool

> assert.json.schema and assert.json.path are parsed/validated but not enforced yet by the runner

Policies

Policies turn runs/diffs into CI gates. Common setup:

policies:
  - id: assertions
    severity: error
    check:
      type: assertions
      min_pass_rate: 1.0

  - id: no_pii
    severity: error
    check:
      type: pii_leak
      detector: pii_strict
      max_incidents: 0

  - id: stable_text
    severity: warn
    check:
      type: variance
      metric: token_jaccard
      max_p95: 0.35

  - id: fast_responses
    severity: warn
    check:
      type: latency
      p95_ms:
        max: 5000

Policy Scoping

Scope policies to a subset of cases by tags, IDs, or providers:

policies:
  - id: smoke_assertions
    severity: error
    scope:
      tags: [smoke]
    check:
      type: assertions
      min_pass_rate: 1.0

Supported Policy Types

assertions — validates case-level assertion pass rate

Required: min_pass_rate

json_valid — ensures model output is valid JSON

Optional: min_pass_rate (default: 1.0)

text_contains — required phrase matching

Required: phrases. Optional: min_pass_rate

text_not_contains — negative phrase matching

Required: phrases. Optional: max_incidents (default: 0)

pii_leak — detects PII in model output

Required: detector. Optional: max_incidents

variance — controls output stability via token Jaccard similarity

Required: metric, max_p95

refusal_rate — monitors how often the model refuses to respond

Required: max and/or max_delta

latency — enforces P95 latency thresholds

Required: p95_ms.max and/or p95_ms.max_delta

json_schema — schema validation (scaffolded, not implemented yet)

Recording Workflow

Forward Proxy (Recommended)

1. Generate and trust the local CA:

regrada ca initregrada ca install

2. Configure the proxy in regrada.yml:

capture:
  enabled: true
  proxy:
    mode: forward
    listen: 127.0.0.1:8080
    allow_hosts:
      - api.openai.com
      - api.anthropic.com
  redact:
    enabled: true
    presets: [pii_basic, secrets]

3. Run your app/tests through the proxy:

regrada record -- ./run-my-tests.sh

4. Convert the latest session into cases + baselines:

regrada accept

Reverse Proxy (No MITM)

Set capture.proxy.mode: reverse and configure upstream URLs. Your application must point its LLM base URL at the proxy instead of the real API.

capture:
  proxy:
    mode: reverse
    listen: 127.0.0.1:4141
    upstream:
      openai_base_url: https://api.openai.com
      anthropic_base_url: https://api.anthropic.com

Baselines in Git (Recommended for CI)

1. Version-control your snapshot directory

By default, regrada init adds .regrada/ to .gitignore. Un-ignore the snapshots directory:

.regrada/*
!.regrada/snapshots/
!.regrada/snapshots/**

2. Generate and commit snapshots on your baseline branch

regrada baselinegit add .regrada/snapshots regrada/cases regrada.ymlgit commit -m "Update Regrada baselines"

3. In PR branches/CI, run tests with git mode

Set baseline.mode: git and baseline.git.ref: origin/main.

baseline:
  mode: git
  git:
    ref: origin/main
    snapshot_dir: .regrada/snapshots

Dashboard Integration

Regrada can sync traces and test results to the Regrada web dashboard, giving you a centralized view of LLM behavior across branches and over time.

Setup

Create an API key in the dashboard under Settings → API Keys
Add the key to your environment (e.g., REGRADA_API_KEY)
Enable backend upload in regrada.yml

project:
  name: my-project       # required: the dashboard project to sync to

backend:
  enabled: true
  api_key_env: REGRADA_API_KEY
  upload:
    traces: true         # upload traces captured during recording
    test_results: true   # upload test run results after regrada test

What Gets Synced

Traces (regrada record):

Trace ID, timestamp, provider, model
Full request (messages + params) and response
Token counts and latency
Git SHA and branch at time of recording

Test runs (regrada test):

Run ID, timestamp, git SHA, branch, commit message
CI provider (GitHub Actions, CircleCI, Jenkins)
Per-case results: pass rate, P95 latency
Diff vs baseline: metric deltas
All policy violations with severity and evidence

API Endpoints

The CLI communicates with https://api.regrada.com (override with REGRADA_API_URL):

Method	Endpoint	Description
POST	/v1/ingest/traces/batch	Upload a batch of captured traces
POST	/v1/ingest/test-runs	Upload a test run result

All requests use Authorization: Bearer <api_key>. Trace uploads during recording are non-blocking (async, background goroutine).

GitHub Action

Example workflow configuration:

name: Regrada
on:
  pull_request:

jobs:
  regrada:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # required for baseline.mode=git

      - uses: regrada-ai/regrada@v1
        with:
          config: regrada.yml
          comment-on-pr: true
          working-directory: .
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          REGRADA_API_KEY: ${{ secrets.REGRADA_API_KEY }}

Action Inputs

Input	Description	Default
config	Path to regrada.yml/regrada.yaml	regrada.yml
comment-on-pr	Post .regrada/report.md as a PR comment	true
working-directory	Directory to run regrada test in	.

Action Outputs

total — Total number of cases

passed — Number of passed cases

warned — Number of warned cases

failed — Number of failed cases

result — success, warning, or failure

Exit Codes

regrada test uses exit codes to help CI distinguish failure modes:

0 — No failing policy violations

1 — Internal error (provider / report / etc.)

2 — Policy violations (as configured by ci.fail_on)

3 — Invalid config / no cases discovered

4 — Missing baseline snapshot

5 — Evaluation error (provider call failed, timeout, etc.)

Troubleshooting

"config not found"

Create regrada.yml by running regrada init or pass --config to specify a different path.

Exit code 4 / baseline missing

Run regrada baseline on your baseline ref and commit snapshots. Ensure CI fetches baseline.git.ref with fetch-depth: 0.

Provider auth errors

OpenAI: set OPENAI_API_KEY or configure providers.openai.api_key
Anthropic: set ANTHROPIC_API_KEY or configure providers.anthropic.api_key
Azure OpenAI: set AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT
Bedrock: ensure AWS_REGION is set and valid credentials are available via the AWS credential chain

Recording HTTPS fails

Run regrada ca init + regrada ca install, and confirm capture.proxy.allow_hosts includes your provider host (e.g., api.openai.com, api.anthropic.com).

Dashboard upload fails

Verify REGRADA_API_KEY is set and that project.name is configured in regrada.yml. Check that backend.enabled: true and the relevant upload flags are set.