How I Built a Full AI Stack

There’s a chatbot on this site. You might have already noticed it. If you haven’t, it’s in the bottom right corner, and it knows things about me. Not because I hardcoded a FAQ, but because it runs a full RAG pipeline with vector search, BM25 fallback retrieval, and either a local LLM running in your browser via WebGPU or a cloud model proxied through Cloudflare Workers.

The total monthly cost to operate all of this: zero dollars.

Let me explain how.

The Stack

Here’s what’s running:

GitHub Pages for hosting (free)
Cloudflare Workers for the API proxy (free tier: 100k requests/day)
Groq for cloud inference (free tier: generous rate limits on Llama 3.3 70B)
WebLLM for in-browser inference via WebGPU (free, runs on your GPU)
Snowflake Arctic Embed for in-browser RAG embeddings (also WebGPU, also free)
Python + sentence-transformers for pre-computing embeddings at build time
Jekyll + Vite + TypeScript tying it all together

No servers. No databases. No monthly bills. Static files, edge functions, and your GPU doing the heavy lifting.

The chat widget gives you a choice: cloud or local.

Cloud mode sends your message through a Cloudflare Worker that proxies to Groq’s API. The worker exists for one reason: to keep the API key server-side. Groq runs Llama 3.3 70B, which is a genuinely capable model for conversational Q&A. The latency is excellent because Groq runs on custom LPU hardware that’s optimized for inference speed.

Local mode is where it gets interesting. When you pick a local model, WebLLM downloads a quantized LLM directly into your browser’s cache and runs inference on your GPU using WebGPU. No server involved. Your question, the context, the generation: all happening on your machine. The models range from 3B parameters (~1.8GB download) to 8B (~4.5GB). They’re quantized to 4-bit, so quality takes a hit, but for answering questions about blog posts? Good enough.

The tradeoff is honest: cloud gives you 70B-parameter quality with sub-second responses. Local gives you privacy and independence at the cost of a multi-gigabyte download and a model that’s… trying its best.

RAG: Making Small Models Useful

A 3B parameter model doesn’t know anything about me. It barely knows anything about anything. But it doesn’t have to, because the RAG pipeline feeds it exactly the context it needs before it generates a response.

Here’s how the retrieval works:

At build time, a Python script processes my blog content into chunks, generates embeddings using sentence-transformers, and writes them to a JSON file. This file ships with the site as a static asset. No vector database, no Pinecone, no Weaviate. Just a JSON file.

At runtime, when you ask a question, the browser loads Snowflake Arctic Embed (a small embedding model) via WebGPU, embeds your query, and does cosine similarity against the pre-computed vectors. BM25 keyword search runs as a fallback. The top chunks get stuffed into the system prompt, and the LLM generates a response grounded in actual content.

The embedding model is ~130MB. It loads once, caches in IndexedDB, and runs entirely in-browser. No API calls for retrieval.

The Cloudflare Worker

The entire “backend” is a Cloudflare Worker. Here’s roughly what it does:

Receives a chat request from the browser
Attaches the Groq API key (stored as a Worker secret)
Forwards the request to Groq’s API
Streams the response back via Server-Sent Events

That’s it. Maybe 50 lines of code. It exists solely to keep the API key out of the client bundle. Cloudflare’s free tier gives you 100,000 requests per day, which is more than enough for a personal blog that gets… let’s be honest, probably dozens of visitors.

The Worker runs on Cloudflare’s edge network, so it’s fast regardless of where the visitor is. No cold starts because Workers use V8 isolates instead of containers.

The System Prompt Problem

There’s a catch with client-side AI: your system prompt ships to the user. If you’re using a local model, the browser has to send the system prompt to WebLLM for inference. That means anyone who opens DevTools can read your instructions.

For a blog chatbot, this isn’t exactly a national security concern. But the system prompt contains personality instructions, content boundaries, and behavioral guidelines that I’d rather not have trivially inspectable.

The solution: double obfuscation. The prompt is encoded and split across multiple locations in the bundle, reassembled at runtime. Is it unbreakable? No. A determined person with a debugger can extract it. But it’s not sitting in plaintext in a JavaScript file waiting to be copied, and that’s the bar I was aiming for.

Pre-Computing Embeddings as a Build Step

This is the part I’m most pleased with. The embeddings generation runs as a build step, which means:

No runtime cost for embedding blog content
No external embedding API needed in production
The embeddings update automatically when content changes
Sensitive source material stays in the build environment, not the client

The Python script uses HuggingFace’s sentence-transformers library. It chunks the content, generates embeddings with a local model, and outputs a JSON file that Vite bundles as a static asset. GitHub Actions runs this as part of the deploy pipeline. The HuggingFace model downloads are cached between builds so it’s not pulling gigabytes every deploy.

What It Costs

Let me be specific:

Service	Free Tier	My Usage
GitHub Pages	Unlimited for public repos	~50MB static site
Cloudflare Workers	100k requests/day	Maybe 100 on a good day
Groq API	Rate-limited free tier	Well under limits
WebLLM/WebGPU	Runs on visitor’s hardware	$0
Domain (ellyseum.me)	Not free	~$10/year

The domain is the only line item. Everything else is genuinely free, not “free trial” free.

Why This Matters

I built this partially as a flex, partially to prove a point. The point: you can ship a real AI product on zero infrastructure budget. Not a demo. Not a prototype. A system with cloud and local inference, retrieval-augmented generation, streaming responses, model selection, and graceful degradation between multiple providers.

Five years ago this would have required a GPU server, a vector database, an inference API, a backend framework, and probably a DevOps engineer to keep it all running. Now it’s a static site with some clever plumbing.

The tools are free. The models are open. The browser is the runtime. The hard part isn’t access to compute anymore: it’s knowing how to wire it all together. That’s still a human skill, and it’s the same skill it’s always been. See a system, understand the protocol, automate it, make it do things it wasn’t designed to do.

Some of us have been doing that since 2400 baud.