Regrada Documentation
CI gate for LLM behavior — record real model traffic, turn it into test cases, and block regressions in CI.
> Records LLM API calls via an HTTP proxy (regrada record)
> Converts recorded traces into portable YAML cases + baseline snapshots (regrada accept)
> Runs cases repeatedly, diffs vs baselines, and enforces configurable policies (regrada test)
> Produces CI-friendly reports (stdout summary, Markdown, JUnit) and a GitHub Action
> Syncs results and traces to the Regrada dashboard for centralized visibility
Installation
macOS / Linux
curl -fsSL https://downloads.regrada.com/install.sh | shwget -qO- https://downloads.regrada.com/install.sh | shregrada versionThe installer downloads a prebuilt binary and installs it to /usr/local/bin/regrada when that directory is writable. Otherwise it falls back to ~/.local/bin/regrada. If regrada isn't found, add the printed install directory to your PATH.
curl -fsSL https://downloads.regrada.com/install.sh | sudo env REGRADA_INSTALL_DIR=/usr/local/bin shWindows
The installer targets macOS/Linux. On Windows, run Regrada via WSL.
Build from source
mkdir -p bingo build -o ./bin/regrada ../bin/regrada versionStart Here
Pick the path that matches where you are. The lowest-friction path is to validate the CLI locally with the mock provider, then switch to a real provider or live traffic once the workflow feels good.
1. Fastest Smoke Test
Best first run. No API key. No live traffic. Just prove the install, generated files, baselines, and reports work on your machine.
regrada init --non-interactiveregrada baselineregrada testThe generated config keeps the mock provider and uses local baselines, so the first run works without an API key or a baseline branch.
2. Run Real Evals
Use this when you already know you want real model responses, not a mock smoke test.
export OPENAI_API_KEY="..."Edit regrada.yml:
providers:
default: openai
openai:
model: gpt-4o-miniregrada baselineregrada test --explainKeep local baselines while iterating. Switch to baseline.mode: git once snapshots belong on your baseline branch and in CI.
3. Capture an Existing App
Start here if your app already makes LLM calls and you want to turn real traffic into test cases with minimal changes.
regrada ca initregrada ca installregrada record -- npm testregrada acceptregrada testRegrada injects proxy environment variables for the wrapped command, records the captured session, and preserves the wrapped command's exit code.
Recommended rollout
Validate the workflow with the mock provider first. Then move to your real provider. Then promote baselines to git mode once the snapshots are worth reviewing and protecting in CI.
Core Concepts
Cases
A case is a YAML file (default: regrada/cases/**/*.yml) containing a prompt (chat messages or structured input) plus optional assertions.
Assertions vs Policies
- Case assertions (
assert:in a case file) mark individual runs as pass/fail and feed metrics likepass_rate. - Policies (
policies:inregrada.yml) decide what counts as a warning or error in CI.
To fail CI on failed assertions, add an assertions policy with severity: error.
Baselines
A baseline is a stored snapshot (golden output + aggregate metrics) used for regression checks.
Regrada stores baselines under the snapshot directory (default: .regrada/snapshots/), keyed by:
- Case ID
- Provider + model
- Sampling params (temperature / top_p / max tokens / stop)
- System prompt content
Changing any of these produces a different baseline key and requires regenerating the snapshot.
CLI Commands
These are the commands most users touch repeatedly. If you only remember five, remember init, record, accept, baseline, and test.
regrada init
Creates regrada.yml, an example case, and runtime directories.
regrada initUse --non-interactive for a fast mock-provider setup, or the default interactive flow if you want help choosing provider and CI defaults.
Flags: --path, --force, --non-interactive
regrada record
Starts an HTTP proxy to capture LLM traffic. Defaults to forward proxy with HTTPS MITM. When a subcommand is provided, the proxy exits when that process finishes.
regrada recordregrada record -- python app.pyregrada record -- npm testFor forward proxy mode, run regrada ca init and regrada ca install once on the machine first.
Traces written to .regrada/traces/ (JSONL). Sessions written to .regrada/sessions/. Auto-detects provider by host (OpenAI, Anthropic, Azure, Bedrock).
regrada accept
Converts traces from the latest (or specified) session into cases and baselines.
regrada acceptregrada accept --session .regrada/sessions/20250101-120000.jsonregrada baseline
Runs all discovered cases once and writes baseline snapshots to the local snapshot directory.
regrada baselineUse this before regrada test in local mode, or when refreshing the snapshots you commit on your baseline branch.
Supports --output text|json so scripts can consume the written case IDs and snapshot directory directly.
regrada test / regrada check
Runs cases, diffs against baselines, evaluates policies, and writes reports. regrada check is an alias.
regrada testregrada test --concurrency 4Baseline behavior comes from baseline.mode: local compares against local snapshots, while git compares against the configured git ref (default: origin/main).
Flag --concurrency N controls parallel case execution (default: 1).
regrada ca
Manages the local Root CA required for forward-proxy HTTPS interception.
regrada ca initregrada ca installregrada ca statusregrada ca uninstallregrada migrate
Compares two model targets across the same test suite so you can judge a model migration before you flip traffic.
regrada migrate --from openai/gpt-4o-mini --to anthropic/claude-3-5-sonnet-20241022regrada migrate --cases support.refund,safety.pii --output jsonOutputs Markdown by default, JSON when requested, and exits non-zero if regressions are detected.
regrada fuzz
Generates adversarial mutations for your cases and measures whether the model still behaves acceptably under those variants.
regrada fuzz --case safety.pii_redactionregrada fuzz --categories prompt_injection,jailbreak --threshold 0.9Returns exit code 2 when any case falls below the robustness threshold. Use Markdown for human review or JSON for automation.
Configuration (regrada.yml)
Minimal working config:
version: 1
providers:
default: openai
openai:
model: gpt-4o-mini
baseline:
mode: local
policies:
- id: assertions
severity: error
check:
type: assertions
min_pass_rate: 1.0Case Discovery
Defaults (can be overridden under cases:):
- Roots:
["regrada/cases"] - Include globs:
["**/*.yml", "**/*.yaml"] - Exclude globs:
["**/README.*"]
Execution Mode
execution: mode: replay # replay (default) or live
replay— compare against baselines (requires snapshots to exist)live— run without baselines; enforce invariants and checks only
Case Defaults
cases:
defaults:
runs: 3
timeout_ms: 30000
concurrency: 8Baseline Modes
Git baseline config (recommended for CI):
baseline:
mode: git
git:
ref: origin/main
snapshot_dir: .regrada/snapshotsReports
Enable JUnit output for CI:
report:
format: [summary, markdown, junit]
junit:
path: .regrada/junit.xmlCI Behavior
By default, Regrada fails on any severity: error violation. To also fail on warnings:
ci:
fail_on:
- severity: error
- severity: warnProviders
All four providers are fully implemented and ready to use.
OpenAI
providers:
default: openai
openai:
model: gpt-4o-mini
api_key_env: OPENAI_API_KEY # defaultCredential resolution: api_key_env → api_key → OPENAI_API_KEY
Anthropic
providers:
default: anthropic
anthropic:
model: claude-3-5-sonnet-20241022
api_key_env: ANTHROPIC_API_KEY # defaultCredential resolution: api_key_env → api_key → ANTHROPIC_API_KEY
Azure OpenAI
providers:
default: azure_openai
azure_openai:
endpoint: https://my-resource.openai.azure.com
deployment: gpt-4o-mini
api_version: 2024-02-15-preview # default
api_key_env: AZURE_OPENAI_API_KEYThe deployment field is required. The endpoint URL is also resolved from AZURE_OPENAI_ENDPOINT if not set inline.
Calls {endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}
AWS Bedrock
providers:
default: bedrock
bedrock:
model_id: anthropic.claude-3-5-sonnet-20241022-v2:0
region: us-east-1
# Optional: explicit credentials
access_key_env: AWS_ACCESS_KEY_ID
secret_key_env: AWS_SECRET_ACCESS_KEYUses the Bedrock Converse API. If explicit credentials are not set, falls back to the AWS SDK default credential chain (instance profile, environment, shared credentials file). Region resolves from region → AWS_REGION → AWS_DEFAULT_REGION.
Mock
providers: default: mock
Returns a fixed "mock response" string. Useful for wiring up tests without real API calls.
Case Format
Example test case (regrada/cases/**/*.yml):
id: greeting.hello
tags: [smoke]
request:
messages:
- role: system
content: You are a concise assistant.
- role: user
content: Say hello and ask for a name.
params:
temperature: 0.2
top_p: 1.0
max_output_tokens: 256
assert:
text:
contains: ["hello"]
not_contains: ["error"]
max_chars: 120
metrics:
max_latency_ms: 5000> request must specify either messages or input (a YAML map)
> Roles must be system, user, assistant, or tool
> assert.json.schema and assert.json.path are parsed/validated but not enforced yet by the runner
Policies
Policies turn runs/diffs into CI gates. Common setup:
policies:
- id: assertions
severity: error
check:
type: assertions
min_pass_rate: 1.0
- id: no_pii
severity: error
check:
type: pii_leak
detector: pii_strict
max_incidents: 0
- id: stable_text
severity: warn
check:
type: variance
metric: token_jaccard
max_p95: 0.35
- id: fast_responses
severity: warn
check:
type: latency
p95_ms:
max: 5000Policy Scoping
Scope policies to a subset of cases by tags, IDs, or providers:
policies:
- id: smoke_assertions
severity: error
scope:
tags: [smoke]
check:
type: assertions
min_pass_rate: 1.0Supported Policy Types
assertions — validates case-level assertion pass rate
Required: min_pass_rate
json_valid — ensures model output is valid JSON
Optional: min_pass_rate (default: 1.0)
text_contains — required phrase matching
Required: phrases. Optional: min_pass_rate
text_not_contains — negative phrase matching
Required: phrases. Optional: max_incidents (default: 0)
pii_leak — detects PII in model output
Required: detector. Optional: max_incidents
variance — controls output stability via token Jaccard similarity
Required: metric, max_p95
refusal_rate — monitors how often the model refuses to respond
Required: max and/or max_delta
latency — enforces P95 latency thresholds
Required: p95_ms.max and/or p95_ms.max_delta
json_schema — schema validation (scaffolded, not implemented yet)
Recording Workflow
Forward Proxy (Recommended)
1. Generate and trust the local CA:
regrada ca initregrada ca install2. Configure the proxy in regrada.yml:
capture:
enabled: true
proxy:
mode: forward
listen: 127.0.0.1:8080
allow_hosts:
- api.openai.com
- api.anthropic.com
redact:
enabled: true
presets: [pii_basic, secrets]3. Run your app/tests through the proxy:
regrada record -- ./run-my-tests.sh4. Convert the latest session into cases + baselines:
regrada acceptReverse Proxy (No MITM)
Set capture.proxy.mode: reverse and configure upstream URLs. Your application must point its LLM base URL at the proxy instead of the real API.
capture:
proxy:
mode: reverse
listen: 127.0.0.1:4141
upstream:
openai_base_url: https://api.openai.com
anthropic_base_url: https://api.anthropic.comBaselines in Git (Recommended for CI)
1. Version-control your snapshot directory
By default, regrada init adds .regrada/ to .gitignore. Un-ignore the snapshots directory:
.regrada/* !.regrada/snapshots/ !.regrada/snapshots/**
2. Generate and commit snapshots on your baseline branch
regrada baselinegit add .regrada/snapshots regrada/cases regrada.ymlgit commit -m "Update Regrada baselines"3. In PR branches/CI, run tests with git mode
Set baseline.mode: git and baseline.git.ref: origin/main.
baseline:
mode: git
git:
ref: origin/main
snapshot_dir: .regrada/snapshotsDashboard Integration
Regrada can sync traces and test results to the Regrada web dashboard, giving you a centralized view of LLM behavior across branches and over time.
Setup
- Create an API key in the dashboard under Settings → API Keys
- Add the key to your environment (e.g.,
REGRADA_API_KEY) - Enable backend upload in
regrada.yml
project:
name: my-project # required: the dashboard project to sync to
backend:
enabled: true
api_key_env: REGRADA_API_KEY
upload:
traces: true # upload traces captured during recording
test_results: true # upload test run results after regrada testWhat Gets Synced
Traces (regrada record):
- Trace ID, timestamp, provider, model
- Full request (messages + params) and response
- Token counts and latency
- Git SHA and branch at time of recording
Test runs (regrada test):
- Run ID, timestamp, git SHA, branch, commit message
- CI provider (GitHub Actions, CircleCI, Jenkins)
- Per-case results: pass rate, P95 latency
- Diff vs baseline: metric deltas
- All policy violations with severity and evidence
API Endpoints
The CLI communicates with https://api.regrada.com (override with REGRADA_API_URL):
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/ingest/traces/batch | Upload a batch of captured traces |
| POST | /v1/ingest/test-runs | Upload a test run result |
All requests use Authorization: Bearer <api_key>. Trace uploads during recording are non-blocking (async, background goroutine).
GitHub Action
Example workflow configuration:
name: Regrada
on:
pull_request:
jobs:
regrada:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # required for baseline.mode=git
- uses: regrada-ai/regrada@v1
with:
config: regrada.yml
comment-on-pr: true
working-directory: .
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
REGRADA_API_KEY: ${{ secrets.REGRADA_API_KEY }}Action Inputs
| Input | Description | Default |
|---|---|---|
| config | Path to regrada.yml/regrada.yaml | regrada.yml |
| comment-on-pr | Post .regrada/report.md as a PR comment | true |
| working-directory | Directory to run regrada test in | . |
Action Outputs
total — Total number of cases
passed — Number of passed cases
warned — Number of warned cases
failed — Number of failed cases
result — success, warning, or failure
Exit Codes
regrada test uses exit codes to help CI distinguish failure modes:
0 — No failing policy violations
1 — Internal error (provider / report / etc.)
2 — Policy violations (as configured by ci.fail_on)
3 — Invalid config / no cases discovered
4 — Missing baseline snapshot
5 — Evaluation error (provider call failed, timeout, etc.)
Troubleshooting
"config not found"
Create regrada.yml by running regrada init or pass --config to specify a different path.
Exit code 4 / baseline missing
Run regrada baseline on your baseline ref and commit snapshots. Ensure CI fetches baseline.git.ref with fetch-depth: 0.
Provider auth errors
- OpenAI: set
OPENAI_API_KEYor configureproviders.openai.api_key - Anthropic: set
ANTHROPIC_API_KEYor configureproviders.anthropic.api_key - Azure OpenAI: set
AZURE_OPENAI_API_KEYandAZURE_OPENAI_ENDPOINT - Bedrock: ensure
AWS_REGIONis set and valid credentials are available via the AWS credential chain
Recording HTTPS fails
Run regrada ca init + regrada ca install, and confirm capture.proxy.allow_hosts includes your provider host (e.g., api.openai.com, api.anthropic.com).
Dashboard upload fails
Verify REGRADA_API_KEY is set and that project.name is configured in regrada.yml. Check that backend.enabled: true and the relevant upload flags are set.