Article
RAG Quality Evaluation and Safety Controls: From Rule-Based Evaluation to Release Gates
A release-quality guide for retrieval, citation, and answer evaluation, six safety layers, privacy-aware telemetry, and public RAG launch gates.
Introduction: A RAG System That Can Answer Is Not Automatically Safe to Publish
The first two guides covered architecture and retrieval implementation. Before opening a public /chat entry point, the system needs to answer a different question: when it fails, can you tell whether the failure came from retrieval, citation, answer generation, or safety policy?
A typical failure looks harmless at first. A user asks how to contact the site owner. The answer is fluent, but it cites an outdated page and points to the wrong entry point. If the only evaluation criterion is “does the answer read well?”, this failure may pass.
Public RAG needs diagnostic evidence. Retrieval quality, citation quality, answer quality, safety gates, and release checks should all leave artifacts that can be inspected after the fact.
This guide defines a quality and safety framework for a knowledge hub RAG assistant:
- three quality evaluation layers;
- a fixed evaluation set;
- rule-based evaluation;
- debug reports;
- six safety layers;
- privacy-aware telemetry;
- closeout and reverse-audit gates.
Scope: this guide fits knowledge hub assistants, documentation search, and site-level Q&A. It does not claim to cover enterprise multi-tenant authorization, PII compliance, internal knowledge hub permissions, or formal regulatory audits.
The quality, safety, and release loops converge like this:
Three-Layer Quality Evaluation
Many RAG evaluations collapse everything into one question: “Was the answer correct?” That is not enough for a public system.
The system needs to locate the failing layer:
- Did retrieval return the right sources?
- Were citations fresh, unique, and public?
- Did the final answer cover the required facts without inventing unsupported details?
Layer 1: Retrieval Quality
Retrieval quality asks whether the system found the right external knowledge.
A test case should define expected sources and required facts:
interface RetrievalTest {
query: string;
expectedSources: string[];
requiredFacts: string[];
minRecallRate: number;
}
const retrievalTests: RetrievalTest[] = [
{
query: "How does AI-TDD govern AI output?",
expectedSources: [
"https://hlluan.com/en/blog/ai-tdd-framework/"
],
requiredFacts: [
"Manifest",
"six mental models",
"entry gate or delivery gate"
],
minRecallRate: 0.8
}
];
The exact thresholds should match the corpus and use case. The important part is that retrieval evaluation does not wait for answer generation.
Layer 2: Citation Quality
Citation quality asks whether selected sources are usable as public evidence.
Useful checks include:
- every cited URL is public and canonical;
- duplicate citations are removed;
- URLs use HTTPS;
- cited chunks are not stale;
- current-page mode cites only the current page;
- citation titles match the source manifest.
Example rule output:
{
"citationQuality": {
"ok": true,
"rules": [
{ "id": "citation.url_https", "status": "pass" },
{ "id": "citation.no_duplicates", "status": "pass" },
{ "id": "citation.current_page_scope", "status": "pass" }
]
}
}
This layer prevents a fluent answer from hiding weak or outdated evidence.
Layer 3: Answer Quality
Answer quality asks whether the generated answer used the evidence correctly.
For a public assistant, useful answer rules include:
- all required facts are covered;
- unsupported claims are absent;
- uncertainty is stated when evidence is incomplete;
- the language matches the user request or page locale;
- the answer does not expose internal files, prompts, or private draft paths;
- the answer falls back when evidence is insufficient.
The rule set can remain simple at first:
interface AnswerRuleResult {
id: string;
status: "pass" | "fail";
reason?: string;
}
function evaluateRequiredFacts(answer: string, facts: string[]): AnswerRuleResult {
const missing = facts.filter((fact) => !answer.toLowerCase().includes(fact.toLowerCase()));
return missing.length === 0
? { id: "answer.required_facts", status: "pass" }
: {
id: "answer.required_facts",
status: "fail",
reason: `Missing facts: ${missing.join(", ")}`
};
}
This does not replace human review, but it catches repeatable failure modes before release.
Evaluation Set Design
An evaluation set is a fixed list of questions that should be run before release and after relevant retrieval changes.
For a knowledge hub, the set should cover high-value and high-risk questions:
- site purpose and positioning;
- contact or owner information;
- current-page summary;
- topic lookup;
- exact error-code lookup;
- implementation detail lookup;
- cross-article synthesis;
- no-answer or weak-evidence fallback;
- safety and prompt-injection attempts.
The purpose is not to maximize the number of questions. The purpose is to preserve a stable regression surface.
Evaluation Case Schema
A compact schema is enough:
interface EvaluationCase {
id: string;
query: string;
locale: "zh-CN" | "en";
mode: "site" | "current_page";
pageUrl?: string;
expectedSources: string[];
requiredFacts: string[];
forbiddenClaims: string[];
}
Each case should explain what would count as failure. That makes review faster and prevents the team from approving a plausible but unsupported answer.
Rule-Based Evaluation
Rule-based evaluation is not a replacement for model-based scoring. It is the first release gate because it is deterministic, cheap, and easy to debug.
The rule engine should produce structured results:
{
"caseId": "rag-ai-tdd-governance",
"ok": false,
"failedRules": [
{
"id": "retrieval.expected_source",
"reason": "AI-TDD source was not retrieved"
}
],
"selectedSources": [
"/en/guides/rag-system-architecture/"
]
}
This makes the next step obvious: fix retrieval, not answer prose.
Debug Reports
Every failed evaluation case should produce a debug report that includes:
- normalized query;
- detected intent;
- Vectorize candidates;
- FTS5 candidates;
- fused ranking;
- rerank status;
- selected evidence;
- generated answer;
- failed rule IDs;
- timing and fallback state.
The report should be small enough to inspect, but complete enough to locate the failing layer.
Six Safety Layers
Quality evaluation asks whether answers are good. Safety controls decide whether a request should continue downstream at all.
Layer 1: Kill Switch
The first safety layer is a public-chat switch:
if (env.RAG_CHAT_ENABLED !== "true") {
return json({ error: "chat_disabled" }, 503);
}
This switch lets the site deploy the widget without enabling model calls before the release gate is ready.
Layer 2: Origin Check
The worker should only accept requests from allowed public origins:
function isAllowedOrigin(origin: string | null, allowed: string[]) {
if (!origin) return false;
return allowed.includes(origin);
}
For local development, allowlists can support explicit localhost patterns. Production should remain narrow.
Layer 3: Rate Limit
Rate limits protect the system from accidental loops and simple abuse:
const key = `rate:${clientId}:${dateHour}`;
const count = await env.KV.get(key);
if (Number(count ?? 0) >= hourlyLimit) {
return json({ error: "rate_limited" }, 429);
}
The limit should be visible in telemetry and should fail before expensive model calls.
Layer 4: Daily Budget
A daily budget is different from a per-client rate limit. It protects the owner from total cost exposure.
The budget check should happen before embedding, rerank, and answer generation.
Layer 5: Circuit Breaker
If downstream providers fail repeatedly, the worker should stop calling them for a short period:
if (await isCircuitOpen(env.KV, "model-provider")) {
return json({ error: "service_temporarily_unavailable" }, 503);
}
This prevents failure storms from turning into cost storms.
Layer 6: Input Validation
Input validation should check length, content type, empty queries, and obvious prompt-injection patterns. It should not try to solve all security problems with a single regex.
Useful checks include:
- maximum query length;
- JSON shape validation;
- page URL validation;
- locale validation;
- rejection of private path requests;
- rejection of attempts to reveal prompts, secrets, or internal files.
Privacy-Aware Telemetry
Debuggability should not require excessive data retention.
Prefer storing:
- case ID;
- intent;
- failed rule IDs;
- selected source URLs;
- timing;
- fallback state;
- redacted query hashes where possible.
Avoid storing:
- raw secrets;
- API keys;
- private URLs;
- full prompts unless explicitly needed for a controlled debug artifact;
- personal data that is not necessary for quality evaluation.
The goal is to explain failures without expanding the privacy surface.
Governance and Audit
RAG release governance should answer three questions:
- Which requirement was tested?
- Which command or artifact proves the result?
- Which gate allowed the release to proceed?
Requirement Traceability Matrix
A minimal trace matrix can connect requirements to evidence:
| Requirement | Evidence | Gate |
|---|---|---|
| Public chat is disabled by default | config audit | safety gate |
| Evaluation set passes | answer-quality report | closeout gate |
| Index manifest matches dist corpus | ingest verify report | index gate |
| English pages use English SVG assets | i18n audit | content gate |
The matrix does not need to be complex. It needs to be current and tied to actual commands.
Closeout Gate
The closeout gate should block release unless core checks pass:
- build succeeds;
- i18n contract passes;
- static diagram contract passes;
- RAG corpus export succeeds;
- RAG corpus audit passes;
- ingest dry run succeeds;
- answer-quality evaluation passes;
- public safety configuration is reviewed.
The gate should not rely on screenshots or informal approval.
Reverse Audit
Reverse audit asks the uncomfortable question: if this release were wrong, how would we know?
Examples:
- If the AI-TDD answer cites the wrong source, which rule fails?
- If current-page summary drifts into site-wide content, which test fails?
- If the English guide references a Chinese SVG, which i18n check fails?
- If rerank times out, where is that fallback recorded?
- If budget is exceeded, which response proves downstream calls stopped?
If no artifact can answer those questions, the release is not yet auditable.
Launch Checklist
Before enabling the public endpoint, verify:
- the public-chat switch is explicit;
- allowed origins are configured;
- rate limits and daily budget are active;
- circuit-breaker state is observable;
- private paths are excluded from the corpus;
- the evaluation set covers high-value questions;
- failed cases produce debug reports;
- citation URLs are public and canonical;
- current-page mode is scoped;
- no-answer fallback is tested;
- release artifacts are stored with command evidence.
This checklist should be treated as a release gate, not as documentation after the fact.
Summary
A public RAG system needs more than retrieval and generation.
It needs:
- retrieval tests that prove the right sources were found;
- citation tests that prove the sources are usable;
- answer tests that prove required facts are covered;
- safety controls that stop risky or expensive requests early;
- debug reports that identify the failing layer;
- closeout evidence that proves the release is ready.
The system does not need to be perfect before launch. It does need to be inspectable, bounded, and reversible.
Continue Reading
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions