0008 Personal Site LLM Chat with RAG
Times Read 63
Reader Map
Open to load reader locations. Map shows unique readers with geolocation; Times Read counts total tracked views.

Personal Site LLM Chat with RAG: Architecture, Tradeoffs, and Production Behavior

I added a retrieval-grounded chat interface to /chat on this site. The assistant answers only from indexed public content (posts, projects, books, albums, photos, resume when enabled, and public semantic notes), and it links citations back to real pages when sources have URL-backed entries.

This write-up is the technical version: concrete request flow, retrieval/scoring behavior, caching design, SSE protocol, persistence model, and failure handling. If you're building similar "small but real" AI features, this is the stuff that actually matters.

This is part of the same direction I started in What Even Is This: I am constantly improving the site and building new experiments, and I will keep writing blog posts about how these features evolve.

Problem Definition and Constraints

Personal sites are easy to browse but awkward to query. I wanted users to ask normal questions like "What did you write about Django performance?" and get a grounded answer fast.

I set a few hard constraints up front:

Request Lifecycle (Both Paths)

The chat feature has two endpoints:

High-level flow:

  1. Gate checks: feature flag, Turnstile (when enabled), semantic search enabled.
  2. Try answer cache (question + conversation history signature).
  3. On miss, load retrieval context (with its own cache keyed by normalized question).
  4. Build prompt from bounded conversation history + bounded source blocks.
  5. Call model (streaming or non-streaming), then persist session + DB + cache.
  6. Render citations as links when URL-backed sources exist.

This gives me one reliable SSR baseline and one richer streaming path without splitting product behavior into two different systems.

Retrieval Pipeline: Hybrid Search, Not Guessing

Retrieval comes from SemanticDocument and uses hybrid ranking (Postgres text rank + vector similarity). The combined score is weighted and thresholded, then top results are used to build prompt context.

combined_score = (rank * SEMANTIC_SEARCH_RANK_WEIGHT)
               + (vector_score * SEMANTIC_SEARCH_VECTOR_WEIGHT)

default tuning:
- RAG_TOP_K = 8
- RAG_MAX_CONTEXT_CHARS = 9000
- SEMANTIC_SEARCH_RANK_WEIGHT = 0.45
- SEMANTIC_SEARCH_VECTOR_WEIGHT = 0.55

Context is assembled into numbered blocks so citations can map directly:

[1] Title
URL: https://example
Content: snippet...

If no relevant documents are found, the assistant says that directly rather than fabricating an answer.

Prompt Contract and Citation Semantics

The system prompt is intentionally strict: answer using only provided sources, and explicitly say "I don't know" when sources do not support the answer.

Citation markers are constrained to direct quotes. That keeps the output readable and avoids noisy "[1][2][3]" citation spam on every sentence.

Both backend and frontend linkify citation markers like [1]. Source entries missing title or URL are filtered out before rendering, so the UI never emits dead citation links.

Streaming Protocol with SSE

Streaming responses are sent as text/event-stream with explicit event types:

event: token
data: {"text":"partial answer"}

event: sources
data: [{"title":"...", "url":"...", "snippet":"..."}]

event: error
data: {"message":"..."}

event: done
data: {}

The frontend state machine starts with a "Thinking..." assistant bubble, appends token chunks, then retrofits citation anchors after the sources event arrives. It also handles stream failures with a visible error message while preserving the rest of the conversation UI.

One detail I like: cached answers are still emitted as chunked token events (fixed-size chunks), so repeated queries feel consistent instead of jarringly instant.

Caching Strategy: Split by Responsibility

I use separate caches for answer generation and retrieval context:

Default TTLs:

The important design choice is that context reuse is allowed across conversation states, but final answer reuse is not unless history matches. That avoids stale conversational answers while still reducing retrieval overhead.

State and Persistence Model

Conversation state is intentionally dual-layer:

ChatConversation can attach to session key, authenticated user, and request fingerprint; ChatMessage stores role, text, sources JSON, and an is_error flag. History is trimmed to a bounded turn window so prompt size stays predictable.

Security and Guardrails

There are multiple explicit gates before model calls:

Failure messaging is aligned across both request paths, while persistence differs by design between streaming and non-streaming flows.

Observability and Cost Tracking

Every request emits structured latency logs including cache lookup, retrieval, prompt build, provider timing, first-token timing, total duration, source count, and error classification.

OpenAI usage is also persisted (input/output tokens) for both embeddings and chat calls. This makes it possible to analyze cost and latency regressions from real traffic instead of guesses.

Tradeoffs I Chose Deliberately

Those choices are intentionally conservative. This is a content-grounded assistant, not a general chatbot product.

What I Want to Improve Next

Try It

Open /chat and ask:

If it works correctly, you'll get fast responses with grounded citations, and clear failure messages when the system has insufficient context. That's the contract: useful, traceable, and honest.

Comments (0)

Leave a Comment

You can comment anonymously or login to use your account.

Your email will not be displayed publicly.
Maximum 2000 characters. Markdown is supported. 0 / 2000

Complete the security check before posting comments, replies, or votes.

No comments yet. Be the first to share your thoughts!