Article
RAG Retrieval Implementation Deep Dive: Chunking, Hybrid Retrieval, and Intent Routing
An implementation guide for chunking, stable chunk IDs, hybrid retrieval, rule-based intent routing, current-page summaries, rerank, fallback, and verification.
Introduction: Retrieval Quality Limits RAG Quality
The previous guide covered the system architecture. This guide moves into the implementation path: how to split long public articles into useful chunks, how to combine semantic retrieval with exact keyword search, and how to route high-value user intents without hiding the logic inside a black box.
In a knowledge hub assistant, many wrong answers start before the model generates anything. The retrieval layer may send the wrong page, a stale chunk, or a chunk that is semantically close but operationally irrelevant. A question about 429, text-embedding-v4, or chunkSize=800 can fail if exact signals are lost. A request to “summarize this page” can drift into a site-wide answer if scope is not detected explicitly.
This guide uses a public RAG Worker baseline to explain how 800 character chunks, 200 character overlap, RRF fusion, rule-based intent routing, and current-page summaries work together. The numbers are a starting point for this corpus, not a universal optimum.
Part position: the first part explained architecture and indexing boundaries. This part explains the retrieval implementation. The third part covers quality evaluation, safety gates, and release evidence.
The retrieval path looks like this:
Chunking Strategy: Why Start With 800 Characters?
Chunking is the indexing step that shapes every later retrieval result. A chunk that is too small loses context. A chunk that is too large mixes topics and weakens ranking.
For this knowledge hub corpus, the baseline uses:
CHUNK_SIZE = 800OVERLAP_SIZE = 200
That gives roughly 25 percent overlap. It usually keeps one explanation or a short code example intact while reducing the risk that unrelated sections land in the same chunk.
The important point is not that 800 is universally correct. The important point is that the value is tied to an evaluation set and can be re-tested when the corpus changes.
Boundary Detection
Fixed character slicing is not enough. The indexer should prefer semantic boundaries.
One simple approach is to search for boundaries near the preferred end point:
const CHUNK_SIZE = 800;
const OVERLAP_SIZE = 200;
function findChunkEnd(text: string, start: number, preferredEnd: number): number {
const window = text.slice(start, preferredEnd + 100);
const candidates = [
"\n\n",
"\n#",
"```",
". ",
"; ",
"\n",
];
for (const marker of candidates) {
const index = window.lastIndexOf(marker);
if (index > CHUNK_SIZE * 0.65) {
return start + index + marker.length;
}
}
return Math.min(preferredEnd, text.length);
}
This kind of function is not perfect, but it encodes a useful preference order: paragraph, heading, code block, sentence, and then line break.
Why Overlap Is Necessary
Overlap reduces boundary loss. If a concept starts at the end of one chunk and finishes at the beginning of the next, retrieval has a better chance of returning enough context.
The tradeoff is cost:
- larger overlap improves continuity;
- larger overlap increases chunk count;
- more chunks increase vector storage and embedding cost;
- more chunks can also increase duplicate evidence.
The practical rule is to treat overlap as a measured parameter. If evaluation failures show missing cross-boundary context, increase overlap. If retrieval returns too many near-duplicates, reduce overlap or strengthen per-document caps.
Stable Chunk IDs
Stable IDs are required for incremental indexing. If IDs change on every build, the system cannot distinguish a real content update from a rebuild artifact.
A useful chunk ID should include stable source identity and chunk position:
import { createHash } from "node:crypto";
function stableChunkId(url: string, locale: string, index: number) {
const key = `${locale}:${url}:${index}`;
return createHash("sha256").update(key).digest("hex").slice(0, 24);
}
The chunk content hash is separate:
function contentHash(content: string) {
return createHash("sha256").update(content).digest("hex");
}
The ID tells the system which logical chunk it is. The hash tells the system whether the content changed.
Hybrid Retrieval: From Theory to Code
The worker should not force all queries through one retrieval method.
Vector search is strong for conceptual similarity. Keyword search is strong for exact operational signals. Hybrid retrieval combines them.
The query path is:
- embed the query;
- retrieve semantic candidates from Vectorize;
- run a keyword query against D1 FTS5;
- fuse both ranked lists with RRF;
- apply per-document caps;
- optionally rerank;
- build the final answer prompt from the selected evidence.
Reciprocal Rank Fusion
RRF is useful because it does not require scores from different systems to be directly comparable. It rewards items that rank well in one or more lists.
type Candidate = {
chunkId: string;
documentId: string;
rank: number;
source: "vector" | "keyword";
};
function fuseWithRrf(lists: Candidate[][], k = 60) {
const byChunk = new Map<string, Candidate & { score: number }>();
for (const list of lists) {
for (const item of list) {
const current = byChunk.get(item.chunkId);
const score = (current?.score ?? 0) + 1 / (k + item.rank);
byChunk.set(item.chunkId, { ...item, score });
}
}
return [...byChunk.values()].sort((a, b) => b.score - a.score);
}
This keeps the retrieval layer explainable. Debug reports can show which chunks came from vector recall, keyword recall, or both.
Per-Document Caps
Without caps, one long source document can dominate the final evidence set.
function capPerDocument<T extends { documentId: string }>(
items: T[],
maxChunksPerDocument = 2
) {
const counts = new Map<string, number>();
const output: T[] = [];
for (const item of items) {
const count = counts.get(item.documentId) ?? 0;
if (count >= maxChunksPerDocument) continue;
counts.set(item.documentId, count + 1);
output.push(item);
}
return output;
}
This rule improves evidence diversity and makes the answer less likely to overfit one article.
FTS5 Query Construction
Keyword search should be conservative. User input must be normalized and escaped before it becomes an FTS query.
function toFtsQuery(input: string) {
return input
.trim()
.split(/\s+/)
.map((term) => term.replace(/["']/g, ""))
.filter((term) => term.length >= 2)
.map((term) => `"${term}"`)
.join(" OR ");
}
In production code, this function should be backed by tests for punctuation, model names, numbers, multilingual queries, and empty input.
Intent Routing: Make High-Value Paths Explicit
Some user intents should not be treated as generic retrieval.
Examples:
- “summarize this page”;
- “how can I contact you”;
- “what is this site about”;
- “show AI-TDD related content”;
- “find articles about RAG indexing”;
- “what changed recently”;
- “help me understand this guide”.
Rule-based intent routing is a good first step because the behavior is inspectable and testable.
type Intent =
| { type: "current_page_summary"; pageUrl: string }
| { type: "site_search" }
| { type: "contact" }
| { type: "topic_lookup"; topic: string };
function detectIntent(query: string, pageUrl?: string): Intent {
const normalized = query.toLowerCase();
if (pageUrl && /\b(this page|current page|summarize)\b/.test(normalized)) {
return { type: "current_page_summary", pageUrl };
}
if (/\b(contact|email|reach you)\b/.test(normalized)) {
return { type: "contact" };
}
if (/\b(ai-tdd|rag|indexing|retrieval)\b/.test(normalized)) {
return { type: "topic_lookup", topic: normalized };
}
return { type: "site_search" };
}
This is not a claim that regular expressions are the final answer. It is a claim that the first production boundary should be explicit, testable, and debuggable.
Current-Page Summaries: Keep Scope Narrow
“Summarize this page” is different from “search the site.”
If the widget passes pageUrl or page context, the worker should first look for chunks from that page. If no matching indexed content exists, it can use a controlled fallback:
- answer that the current page was not found in the index;
- ask the user to try a site-wide question;
- avoid pretending that unrelated site content is the current page.
The prompt should also differ:
function buildCurrentPagePrompt(question: string, evidence: string[]) {
return `
You are answering from the current page only.
Do not use unrelated site-wide content.
If the provided evidence is insufficient, say that the current page evidence is insufficient.
Question:
${question}
Evidence:
${evidence.join("\n\n")}
`;
}
The key is scope discipline. Current-page mode is not a ranking preference. It is a different contract.
Rerank and Fallback
Reranking can improve final precision, but it must not become a single point of failure.
The worker should set a short timeout and fall back to fused RRF results when rerank fails:
async function rerankWithFallback(candidates: Candidate[]) {
try {
return await withTimeout(callRerank(candidates), 1500);
} catch {
return candidates.slice(0, 6);
}
}
The fallback should be visible in debug output:
{
"rerank": {
"enabled": true,
"used": false,
"fallback": "rrf_top_6",
"reason": "timeout"
}
}
This makes degraded behavior auditable.
Answer Generation: Evidence First
The final prompt should make evidence boundaries explicit:
- answer only from provided sources;
- cite the selected sources;
- mark uncertainty when evidence is weak;
- return a no-answer state when required facts are missing;
- do not invent site policies, contact paths, or implementation details.
A useful answer-generation payload includes:
type AnswerContext = {
question: string;
intent: Intent["type"];
evidence: Array<{
title: string;
url: string;
chunkId: string;
content: string;
}>;
fallbackPolicy: "no_answer_on_weak_evidence";
};
The answer model is the last step, not the source of truth. The retrieval and evidence payload define the boundary.
Testing and Verification
The retrieval implementation needs tests at several levels.
Chunking Tests
Test that chunking preserves boundaries and stable IDs:
test("chunking prefers paragraph boundaries", () => {
const chunks = chunkDocument(sampleArticle);
expect(chunks.every((chunk) => chunk.content.length <= 900)).toBe(true);
expect(new Set(chunks.map((chunk) => chunk.id)).size).toBe(chunks.length);
});
Hybrid Retrieval Tests
Test that exact signals are not lost:
test("hybrid retrieval keeps exact error-code matches", async () => {
const result = await retrieve("why did embedding return 429?");
expect(result.debug.keywordHits.some((hit) => hit.content.includes("429"))).toBe(true);
});
Intent Tests
Test that current-page requests do not become site-wide answers:
test("current-page summary stays scoped", () => {
const intent = detectIntent("summarize this page", "https://example.com/post");
expect(intent.type).toBe("current_page_summary");
});
Summary
The implementation path has four practical lessons:
- Chunking is an evaluation-controlled contract, not a magic number.
- Hybrid retrieval is necessary because vector recall and exact keyword recall fail in different ways.
- Intent routing should be explicit for high-value paths such as current-page summaries.
- Rerank and model generation must have controlled fallback behavior.
The next part explains how to evaluate whether the system is good enough to expose as a public /chat entry point.
Continue Reading
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions