Article
RAG System Architecture: Edge Runtime, Hybrid Retrieval, and Incremental Indexing
A practical architecture guide for a knowledge hub RAG assistant using Cloudflare Workers, Vectorize, D1 FTS5, KV, hybrid retrieval, and incremental indexing.
Introduction: Retrieval Is Often the Real Bottleneck
The difficult part of a public RAG system is not connecting a vector database to a model. The difficult part is making the answer stay inside the content boundary that the site can actually support.
In a knowledge hub assistant, a user may ask, “How does AI-TDD govern AI output?” A general model may know test-driven development, but it does not automatically know how this site connects Manifest contracts, gates, trace matrices, and delivery evidence. That missing site context is the gap RAG is supposed to close.
RAG matters because the model context window is limited while the available knowledge hub keeps growing. Retrieval-Augmented Generation injects selected external evidence into the model context so that answers are grounded in site content instead of relying only on the model’s general memory.
This guide reviews the architecture of a knowledge hub RAG assistant built around Cloudflare Workers, Vectorize, D1 FTS5, KV, hybrid retrieval, incremental indexing, and release gates. The design is based on the engineering baseline for this site. It is not presented as a universal enterprise reference architecture.
Scope: this guide fits personal knowledge hub Q&A, public content Q&A, documentation sites, and small knowledge hubs. It does not cover multi-tenant authorization, internal confidential documents, compliance reporting, or strict row-level permission filtering. Those concerns require additional identity, policy, audit, and governance layers.
Part position: this first part explains system boundaries and the indexing loop. The second part goes into chunking, hybrid retrieval, and intent routing. The third part covers quality evaluation, safety controls, and release gates.
Start with the overall shape:
Architecture Choice: Why Not a Traditional Server Stack?
Before settling on an edge-first design, there were two reasonable directions.
Traditional server stack: run an application service on VMs or containers, add OpenSearch or Elasticsearch, add Redis, and manage the retrieval service yourself. The control surface is familiar, but operations expand quickly: search cluster maintenance, cache tuning, scaling, regional latency, and deployment coordination.
Edge-first stack: use Cloudflare Workers plus Vectorize, D1, and KV. The operational surface is smaller, but cost and capability depend on request volume, vector count, model calls, account plan, region, and provider limits.
For a knowledge hub, the edge-first stack is a better starting point because the traffic is public, the data is public, and the write path can be controlled by release scripts instead of a live editorial backend.
Benefit 1: The Public Entry Point Is Close to Readers
Workers are useful because they let the /chat entry point run near the public traffic path. A blog hosted as static files can remain static, while the RAG widget calls a worker endpoint.
That gives three useful boundaries:
- the static site does not need to become a full application server;
- Worker bindings can keep Vectorize, D1, and KV access inside one runtime boundary;
- small bursty traffic does not require a permanent self-hosted service.
Benefit 2: Cost Elasticity for Public Knowledge Hubs
Cloudflare’s managed services are a practical way to validate a small public RAG assistant. The important caveat is that “managed” does not mean “free in production.”
The actual cost model includes:
- Worker requests and CPU time;
- Vectorize storage and query volume;
- D1 reads, writes, and storage;
- KV operations for counters and safety state;
- embedding, rerank, and answer-generation calls from model providers.
The architecture should therefore keep a clear cost boundary: public chat can be enabled only after dry-run indexing, quality checks, rate limits, budgets, and fallback behavior are verified.
Benefit 3: Low-Intrusion Integration With a Static Site
A static site can attach the assistant through a widget:
<script
src="https://rag-worker.example.com/widget.js"
data-endpoint="https://rag-worker.example.com">
</script>
That integration is simple, but the security boundary is not optional. Production rollout still needs origin allowlists, CORS headers, rate limits, a public-chat switch, and failure behavior that is safe by default.
Data Layer: Why Three Stores Instead of One?
A common early mistake is to force every RAG responsibility into a single database. That makes the implementation look simpler, but it mixes three different workloads:
- semantic nearest-neighbor retrieval;
- transactional metadata and keyword search;
- lightweight runtime counters and circuit-breaker state.
The cleaner design is to separate those responsibilities.
| Layer | Responsibility | Why It Exists |
|---|---|---|
| Vectorize | semantic recall | Finds conceptually similar chunks from embeddings. |
| D1 SQLite + FTS5 | manifest, metadata, keyword recall | Stores authoritative index metadata and exact-match searchable text. |
| KV | rate limits, daily budget, circuit state | Keeps fast mutable counters outside the retrieval corpus. |
Vectorize: Semantic Recall
Vectorize should answer one question: which indexed chunks are semantically close to this query?
It should not be treated as the authoritative source for all document metadata. The vector store is optimized for vector similarity. The authoritative manifest belongs in D1, where the indexer can reason about source URL, content hash, chunk hash, locale, update status, and latest indexing state.
The vector payload should remain compact:
interface VectorMetadata {
chunkId: string;
documentId: string;
url: string;
title: string;
locale: "zh-CN" | "en";
contentHash: string;
}
The embedding vector answers “similar to what?” The metadata only gives enough identity for the worker to join, cite, deduplicate, and debug.
D1 FTS5: Transactional Metadata and Keyword Search
D1 plays two roles in this architecture.
First, it is the manifest store. The indexer needs to know what was previously indexed, what changed, what can be updated in place, what must be deleted, and what requires re-embedding.
Second, it provides keyword retrieval through FTS5. This is important because vector retrieval can underperform on exact signals such as:
- error codes like
429; - model identifiers like
text-embedding-v4; - configuration names such as
chunkSize=800; - route names, command names, and file names.
A minimal chunk table can look like this:
CREATE TABLE rag_chunks (
chunk_id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
url TEXT NOT NULL,
title TEXT NOT NULL,
locale TEXT NOT NULL,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
updated_at TEXT NOT NULL
);
CREATE VIRTUAL TABLE rag_chunks_fts
USING fts5(chunk_id UNINDEXED, title, content);
The important design point is not this exact schema. The important point is that D1 owns the state that must be auditable and reproducible.
KV: Runtime Counters and Circuit State
KV is a good fit for small mutable state that should not live inside the retrieval corpus:
- request counters;
- daily budget counters;
- circuit-breaker state;
- temporary safety state for degraded runtime behavior.
Those values should be cheap to read and update, but they do not need the same lifecycle as indexed documents. The hard public-chat switch still belongs in Worker configuration, not in the retrieval corpus.
Hybrid Retrieval: Why Vector Search Is Not Enough
Pure vector retrieval works well when the query is conceptual. It is weaker when the query contains exact identifiers or operational signals.
For example, a question about “429 during embedding” should not depend only on semantic similarity. It should also reward chunks that contain the literal error code, the affected endpoint, or the relevant rate-limit explanation.
The retrieval layer therefore uses multiple signals:
- Vectorize returns semantic candidates.
- D1 FTS5 returns keyword candidates.
- The worker fuses results with Reciprocal Rank Fusion.
- The worker applies a per-document cap.
- Optional rerank narrows the final evidence set.
A simple RRF implementation is enough to explain the mechanism:
type RankedItem = {
chunkId: string;
rank: number;
source: "vector" | "keyword";
};
function reciprocalRankFusion(lists: RankedItem[][], k = 60) {
const scores = new Map<string, number>();
for (const list of lists) {
for (const item of list) {
const current = scores.get(item.chunkId) ?? 0;
scores.set(item.chunkId, current + 1 / (k + item.rank));
}
}
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.map(([chunkId, score]) => ({ chunkId, score }));
}
The benefit is practical: exact matches and semantic matches can reinforce each other without pretending that one retrieval method is always superior.
Per-Document Caps Prevent Topic Monopoly
Long articles can dominate retrieval if the system blindly returns many adjacent chunks from the same source. That often makes answers appear confident while narrowing the evidence set too much.
The worker should cap how many chunks a single document can contribute:
function applyDocumentCap<T extends { documentId: string }>(
items: T[],
maxPerDocument = 2
) {
const counts = new Map<string, number>();
const output: T[] = [];
for (const item of items) {
const count = counts.get(item.documentId) ?? 0;
if (count >= maxPerDocument) continue;
counts.set(item.documentId, count + 1);
output.push(item);
}
return output;
}
This is not just a ranking trick. It is an evidence-diversity rule.
Incremental Indexing: From Full Rebuilds to Controlled Updates
Early RAG systems often use full rebuilds because they are easy to reason about. Full rebuilds are acceptable for prototypes, but they become wasteful and risky once the corpus grows.
The incremental indexer should distinguish at least five states:
| State | Meaning | Action |
|---|---|---|
| unchanged | content hash and metadata did not change | no write |
| metadata-only | URL, title, or routing metadata changed | update metadata without re-embedding |
| changed | content hash changed | re-chunk and re-embed |
| added | new source document | insert chunks and vectors |
| removed | source disappeared from public output | delete chunks and vectors |
This state model matters because not every site change should trigger expensive embedding work. A domain migration may need metadata updates. A corrected paragraph needs new chunks. A deleted page needs cleanup.
Two Dry Runs Before Writes
Writes should not be the first time the system discovers the diff.
The safer release path is:
- export the public
distcorpus; - run a local dry run and inspect diff plus estimated writes;
- run a worker-authoritative dry run against the remote state;
- apply incremental ingest;
- verify manifest, stats, and answer-quality checks.
This gives both local visibility and remote authority. The local script can catch obvious mistakes. The worker dry run confirms the state that the live bindings actually see.
Metadata-Only Updates Are a First-Class Path
When only metadata changes, re-embedding is unnecessary and can even be harmful because it creates avoidable churn.
Examples:
- canonical URL changes after a domain migration;
- title or locale metadata is corrected;
- route prefixes change between Chinese and English content;
- source attribution needs to point to a new public path.
The indexer should be able to update those fields without changing vector content.
Chunk Boundaries Belong in Architecture, Not Only in Code
Chunk size is often treated as a magic constant. That is a mistake.
The architecture needs to freeze the validation loop, not merely the number. A baseline such as 800 characters with 200 characters of overlap is useful only if it is tied to evaluation evidence.
The minimum contract should answer:
- Which corpus was used to choose the baseline?
- Which evaluation questions were tested?
- Which failure cases appeared when chunks were too small?
- Which failure cases appeared when chunks were too large?
- Which command proves the current index was built with the expected parameters?
In other words, chunking is not a local utility detail. It affects retrieval quality, cost, evidence traceability, and release confidence.
Public Entry Boundary: Fail Closed by Default
A public /chat endpoint must assume failure will happen.
The safe default is fail-closed:
- if public chat is disabled, return a controlled message;
- if origin is not allowed, reject the request;
- if daily budget is exceeded, stop downstream model calls;
- if retrieval returns weak evidence, answer with a scoped fallback;
- if model generation fails, return a clear no-answer state instead of fabricating.
The key rule is simple: a public assistant is allowed to be unavailable, but it should not be confidently wrong because an upstream dependency failed.
Telemetry Is Part of the Architecture Boundary
Telemetry should help diagnose failures without turning the assistant into a data collection surface.
At minimum:
- do not log raw private user input unless there is a deliberate retention policy;
- store failure categories and rule IDs before storing full prompts;
- redact obvious secrets and tokens;
- keep debug reports tied to release evidence rather than ad hoc screenshots;
- document which telemetry fields are required for quality gates.
Pre-Release Metrics Worth Recording
Before enabling the public entry point, record a small set of metrics:
- corpus document count;
- chunk count by locale;
- vector count;
- D1 manifest count;
- deleted, added, changed, and metadata-only counts for the latest ingest;
- local dry-run result;
- remote dry-run result;
- answer-quality evaluation result;
- safety gate configuration;
- final ingest verification result.
These metrics make later troubleshooting possible. Without them, a future retrieval failure becomes guesswork.
Evolution Path
The current architecture is intentionally modest. It can evolve in several directions:
- query-aware reranking for higher precision;
- per-topic evaluation sets;
- multilingual retrieval with locale-aware fallback;
- richer citation scoring;
- separate admin workflows for reindex, repair, and rollback;
- stronger privacy-preserving telemetry;
- human review of failed high-value questions.
The important part is to evolve from a stable baseline rather than from an uncontrolled prototype.
Summary
The architecture has three core decisions:
- Run the public entry point at the edge, but keep safety gates in front of downstream work.
- Split storage responsibilities across Vectorize, D1 FTS5, and KV instead of forcing all workloads into one layer.
- Treat indexing as a release loop: export, dry-run, ingest, verify, and only then enable public use.
The next part moves from architecture to implementation details: chunking, stable chunk IDs, hybrid retrieval, intent routing, current-page summaries, and fallback behavior.
Continue Reading
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions