Every AI coding assistant — Cursor, Copilot, Windsurf, Claude — writes code confidently. The problem is that confidence doesn’t correlate with correctness. Hallucinated APIs that don’t exist. Logic that technically compiles but contradicts the original intent. Security patterns that introduce vulnerabilities. Compliance violations that only surface during an audit.

runlit is the eval layer that catches these before they reach production. Here’s how it works under the hood.

The pipeline

When a pull request is opened or updated, the eval pipeline runs in three stages:

Ingestion — parse the diff, detect the language, identify AI-attribution signals
Evaluation — score the diff across four independent signals in parallel
Delivery — post the score as a PR comment, update the check status, log the eval

The entire pipeline runs in under 4 seconds for a typical PR. No developer action required.

Four signals, one score

Every eval produces four independent signal scores, each between 0 and 1:

Signal	Weight	What it catches
Hallucination	0.30	Phantom APIs, deprecated methods, non-existent packages
Intent	0.35	Code that works but doesn’t match the issue/prompt
Security	0.25	OWASP top-10 patterns, injection, hardcoded secrets
Compliance	0.10	PCI-DSS, HIPAA, SOC2, EU AI Act rule violations

The composite score is a weighted combination. Scores below 50 are blocked (merge prevented), 50–70 are warned (advisory), and 70+ are passed.

Stage 1: Ingestion

The ingestion service is written in Rust. It receives the raw unified diff and does three things:

Diff parsing — a state machine walks the diff --git format, extracting added/removed lines, file paths, and hunk boundaries. We support diffs up to 1MB.

Language detection — file extensions map to one of 14 supported languages: Python, TypeScript, JavaScript, Go, Rust, Java, Kotlin, Ruby, C#, PHP, Swift, Scala, Solidity, and Bash.

Static analysis — structural AST matching runs provider-specific rules against the code. This is where our 139-rule detection library comes in — 12 packs covering OpenAI, Anthropic, LangChain, Stripe, Google AI, HuggingFace, LlamaIndex, Pinecone, Vercel AI, plus crypto, injection, and secrets security packs. These rules catch known-bad patterns before the LLM evaluation even runs.

Stage 2: Evaluation

The engine receives the parsed diff and static findings, then fans out four parallel evaluations:

Hallucination detection cross-references every API call, method signature, and import against real documentation. When the AI writes stripe.charges.create() but the current SDK uses stripe.paymentIntents.create(), the hallucination signal catches it. The static rule engine handles known deprecated patterns; the LLM handles novel hallucinations that rules can’t catch.

Intent matching compares the diff against the original issue description, PR body, or prompt. If a ticket says “add cursor-based pagination” but the code adds offset pagination, the intent signal flags the mismatch. This prevents the most insidious class of AI bugs — code that passes tests but doesn’t do what was asked.

Security scanning runs OWASP top-10 patterns across the diff: SQL injection via string interpolation, command injection, XSS through innerHTML, hardcoded secrets, insecure deserialization, weak cryptographic patterns, and more. AI-generated code is particularly prone to these — it copies patterns from training data that may have been insecure.

Compliance checking activates per-repo rule packs. If your .runlit.yml specifies compliance_packs: [pci-dss], the engine loads the PCI-DSS rule pack and evaluates the diff against 18 specific controls — credential handling, injection prevention, cryptographic standards, and logging restrictions. We currently ship four packs: PCI-DSS (18 rules), HIPAA (20 rules), SOC2, and EU AI Act (18 rules).

All four signals run in parallel. The engine merges results, computes the composite score, and returns.

Stage 3: Delivery

The API service receives the scored eval and delivers it:

PR comment — a Markdown table with signal breakdown, composite score, grade (PASS/WARN/BLOCK), and a link to the full eval in the dashboard
Check status — if the plan supports merge blocking and the score is below threshold, the PR check fails. Otherwise, advisory only.
Eval log — the full eval is stored in PostgreSQL with the diff hash, signal scores, token counts, model reasoning, and latency

This works across all four VCS providers we support: GitHub, GitLab, Azure DevOps, and Bitbucket. Each has its own webhook handler, REST client, and comment formatter.

The config file

A .runlit.yml in your repo root controls behavior:

threshold: 70
compliance_packs:
  - pci-dss
  - hipaa
signals:
  hallucination: true
  intent: true
  security: true
  compliance: true

The CLI reads this file from the working directory. The GitHub App fetches it via the API. The webhook handler pulls it from the repo before running the eval.

What’s next

We’re working on:

Domain-specific fine-tuning — training models on labeled eval data to improve accuracy for specific frameworks and languages
Autonomous agent evaluation — scoring output from long-running AI agents (Devin, Claude Code, Copilot Agents) where intent drift is a bigger risk
Custom rule authoring — letting teams write their own detection rules in the same YAML format as our community packs

runlit is free to start — 500 evals/month, no credit card. Install the GitHub App and your first eval runs in 30 seconds.

runlit is the eval layer that catches these before they reach production. Here’s how it works under the hood.

The pipeline

When a pull request is opened or updated, the eval pipeline runs in three stages:

Ingestion — parse the diff, detect the language, identify AI-attribution signals
Evaluation — score the diff across four independent signals in parallel
Delivery — post the score as a PR comment, update the check status, log the eval

The entire pipeline runs in under 4 seconds for a typical PR. No developer action required.

Four signals, one score

Every eval produces four independent signal scores, each between 0 and 1:

Signal	Weight	What it catches
Hallucination	0.30	Phantom APIs, deprecated methods, non-existent packages
Intent	0.35	Code that works but doesn’t match the issue/prompt
Security	0.25	OWASP top-10 patterns, injection, hardcoded secrets
Compliance	0.10	PCI-DSS, HIPAA, SOC2, EU AI Act rule violations

The composite score is a weighted combination. Scores below 50 are blocked (merge prevented), 50–70 are warned (advisory), and 70+ are passed.

Stage 1: Ingestion

The ingestion service is written in Rust. It receives the raw unified diff and does three things:

Diff parsing — a state machine walks the diff --git format, extracting added/removed lines, file paths, and hunk boundaries. We support diffs up to 1MB.

Language detection — file extensions map to one of 14 supported languages: Python, TypeScript, JavaScript, Go, Rust, Java, Kotlin, Ruby, C#, PHP, Swift, Scala, Solidity, and Bash.

Stage 2: Evaluation

The engine receives the parsed diff and static findings, then fans out four parallel evaluations:

All four signals run in parallel. The engine merges results, computes the composite score, and returns.

Stage 3: Delivery

The API service receives the scored eval and delivers it:

PR comment — a Markdown table with signal breakdown, composite score, grade (PASS/WARN/BLOCK), and a link to the full eval in the dashboard
Check status — if the plan supports merge blocking and the score is below threshold, the PR check fails. Otherwise, advisory only.
Eval log — the full eval is stored in PostgreSQL with the diff hash, signal scores, token counts, model reasoning, and latency

This works across all four VCS providers we support: GitHub, GitLab, Azure DevOps, and Bitbucket. Each has its own webhook handler, REST client, and comment formatter.

The config file

A .runlit.yml in your repo root controls behavior:

threshold: 70
compliance_packs:
  - pci-dss
  - hipaa
signals:
  hallucination: true
  intent: true
  security: true
  compliance: true

The CLI reads this file from the working directory. The GitHub App fetches it via the API. The webhook handler pulls it from the repo before running the eval.

What’s next

We’re working on:

Domain-specific fine-tuning — training models on labeled eval data to improve accuracy for specific frameworks and languages
Autonomous agent evaluation — scoring output from long-running AI agents (Devin, Claude Code, Copilot Agents) where intent drift is a bigger risk
Custom rule authoring — letting teams write their own detection rules in the same YAML format as our community packs

runlit is free to start — 500 evals/month, no credit card. Install the GitHub App and your first eval runs in 30 seconds.

How runlit works

The pipeline

Four signals, one score

Stage 1: Ingestion

Stage 2: Evaluation

Stage 3: Delivery

The config file

What’s next

How runlit works

The pipeline

Four signals, one score

Stage 1: Ingestion

Stage 2: Evaluation

Stage 3: Delivery

The config file

What’s next