Personal Site LLM Chat with RAG: Architecture, Tradeoffs, and Production Behavior

I added a retrieval-grounded chat interface to /chat on this site. The assistant answers only from indexed public content (posts, projects, books, albums, photos, resume when enabled, and public semantic notes), and it links citations back to real pages when sources have URL-backed entries.

This write-up is the technical version: concrete request flow, retrieval/scoring behavior, caching design, SSE protocol, persistence model, and failure handling. If you're building similar "small but real" AI features, this is the stuff that actually matters.

This is part of the same direction I started in What Even Is This: I am constantly improving the site and building new experiments, and I will keep writing blog posts about how these features evolve.

Problem Definition and Constraints

Personal sites are easy to browse but awkward to query. I wanted users to ask normal questions like "What did you write about Django performance?" and get a grounded answer fast.

I set a few hard constraints up front:

Grounded only: no open-ended model answers without retrieval context.
Progressive enhancement: JavaScript should improve UX, not define core functionality.
Operational visibility: latency and cache behavior need to be inspectable in logs.
Honest failure modes: explicit errors instead of silent retries or fake confidence.

Request Lifecycle (Both Paths)

The chat feature has two endpoints:

GET/POST /chat/ for server-rendered page + non-streaming submit path.
POST /api/chat/stream/ for token streaming via server-sent events (SSE).

High-level flow:

Gate checks: feature flag, Turnstile (when enabled), semantic search enabled.
Try answer cache (question + conversation history signature).
On miss, load retrieval context (with its own cache keyed by normalized question).
Build prompt from bounded conversation history + bounded source blocks.
Call model (streaming or non-streaming), then persist session + DB + cache.
Render citations as links when URL-backed sources exist.

This gives me one reliable SSR baseline and one richer streaming path without splitting product behavior into two different systems.

Retrieval Pipeline: Hybrid Search, Not Guessing

Retrieval comes from SemanticDocument and uses hybrid ranking (Postgres text rank + vector similarity). The combined score is weighted and thresholded, then top results are used to build prompt context.

combined_score = (rank * SEMANTIC_SEARCH_RANK_WEIGHT)
               + (vector_score * SEMANTIC_SEARCH_VECTOR_WEIGHT)

default tuning:
- RAG_TOP_K = 8
- RAG_MAX_CONTEXT_CHARS = 9000
- SEMANTIC_SEARCH_RANK_WEIGHT = 0.45
- SEMANTIC_SEARCH_VECTOR_WEIGHT = 0.55

Context is assembled into numbered blocks so citations can map directly:

[1] Title
URL: https://example
Content: snippet...

If no relevant documents are found, the assistant says that directly rather than fabricating an answer.

Prompt Contract and Citation Semantics

The system prompt is intentionally strict: answer using only provided sources, and explicitly say "I don't know" when sources do not support the answer.

Citation markers are constrained to direct quotes. That keeps the output readable and avoids noisy "[1][2][3]" citation spam on every sentence.

Both backend and frontend linkify citation markers like [1]. Source entries missing title or URL are filtered out before rendering, so the UI never emits dead citation links.

Streaming Protocol with SSE

Streaming responses are sent as text/event-stream with explicit event types:

event: token
data: {"text":"partial answer"}

event: sources
data: [{"title":"...", "url":"...", "snippet":"..."}]

event: error
data: {"message":"..."}

event: done
data: {}

The frontend state machine starts with a "Thinking..." assistant bubble, appends token chunks, then retrofits citation anchors after the sources event arrives. It also handles stream failures with a visible error message while preserving the rest of the conversation UI.

One detail I like: cached answers are still emitted as chunked token events (fixed-size chunks), so repeated queries feel consistent instead of jarringly instant.

Caching Strategy: Split by Responsibility

I use separate caches for answer generation and retrieval context:

Answer cache: key = SHA-256(question + normalized history signature). This preserves conversational correctness.
Context cache: key = SHA-256(normalized question). This avoids paying retrieval cost repeatedly for the same query text.

Default TTLs:

RAG_CACHE_TTL_SECONDS = 3600
RAG_QUERY_CONTEXT_CACHE_TTL_SECONDS = 900
RAG_EMBED_QUERY_CACHE_TTL_SECONDS = 3600

The important design choice is that context reuse is allowed across conversation states, but final answer reuse is not unless history matches. That avoids stale conversational answers while still reducing retrieval overhead.

State and Persistence Model

Conversation state is intentionally dual-layer:

Session: lightweight working history used for fast continuation.
Database: durable records in ChatConversation and ChatMessage.

ChatConversation can attach to session key, authenticated user, and request fingerprint; ChatMessage stores role, text, sources JSON, and an is_error flag. History is trimmed to a bounded turn window so prompt size stays predictable.

Security and Guardrails

There are multiple explicit gates before model calls:

Feature flag: aaronspindler_chat_interface can disable the surface immediately.
Turnstile: enforced when the feature flag is enabled and a secret key is configured, with direct user-facing error messages; verification API failures fail open.
Semantic search gate: refuses to answer if retrieval is disabled.
Provider exceptions: normalized to "try again later" messages; streaming requests persist error turns, while non-stream requests render inline errors without appending history.

Failure messaging is aligned across both request paths, while persistence differs by design between streaming and non-streaming flows.

Observability and Cost Tracking

Every request emits structured latency logs including cache lookup, retrieval, prompt build, provider timing, first-token timing, total duration, source count, and error classification.

OpenAI usage is also persisted (input/output tokens) for both embeddings and chat calls. This makes it possible to analyze cost and latency regressions from real traffic instead of guesses.

Tradeoffs I Chose Deliberately

SSE over websockets: simpler infra and sufficient for one-way token streams.
Bounded context over max recall: predictable latency/cost beats trying to stuff every possible source.
Strict grounding over creativity: better trust for site-specific Q&A.
Explicit errors over hidden retries: easier to operate and easier for users to understand.

Those choices are intentionally conservative. This is a content-grounded assistant, not a general chatbot product.

What I Want to Improve Next

Retrieval quality evaluation set (fixed queries + expected sources) to measure ranking changes.
Better follow-up/coreference handling ("that post", "the second one").
UI-level source confidence and rank visibility.
Intent-gap analytics for unanswered but high-frequency query classes.

Try It

Open /chat and ask:

Which posts mention performance bottlenecks?
What projects involve automation pipelines?
What did you write about cutting frontend bloat?

If it works correctly, you'll get fast responses with grounded citations, and clear failure messages when the system has insufficient context. That's the contract: useful, traceable, and honest.

0008 Personal Site LLM Chat with RAG
Times Read	63
Reader Map Open to load reader locations. Map shows unique readers with geolocation; Times Read counts total tracked views.

AARON SPINDLER