TECHNICAL BLOG
Deep Dives for Engineers
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.
A comprehensive guide to building, testing, and deploying production-quality AI chatbots using LangChain's retrieval-augmented generation, FastAPI, and vector databases.
Building a chatbot that impresses in a Jupyter notebook is straightforward. Building one that handles 10,000 daily queries reliably, returns factually grounded answers, maintains conversation context correctly, and fails gracefully when it should not answer — that requires serious engineering. Retrieval-Augmented Generation (RAG) is the architecture that makes the difference: instead of relying on an LLM's parametric knowledge alone, RAG retrieves relevant documents at inference time and grounds the model's response in that retrieved context.
A production RAG system has five core components:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["
", "
", ". ", " "]
)
chunks = splitter.split_documents(docs)
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectorstore.persist()
Choose your chunk size based on your LLM's context window and the density of your source documents. 512 tokens with 64-token overlap is a robust starting point for technical documentation.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
prompt = ChatPromptTemplate.from_messages([
("system", '''You are an expert technical assistant for Worksprout.
Answer using only the provided context. If the answer is not in the context,
say you do not have information on this topic.
Context:
{context}'''),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1, streaming=True)
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import AsyncGenerator
app = FastAPI()
class ChatRequest(BaseModel):
message: str
session_id: str
async def stream_response(query: str, history: list) -> AsyncGenerator:
async for chunk in rag_chain.astream({"input": query, "chat_history": history}):
if answer := chunk.get("answer"):
yield f"data: {answer}
"
yield "data: [DONE]
"
@app.post("/chat")
async def chat(req: ChatRequest):
history = get_session_history(req.session_id)
return StreamingResponse(
stream_response(req.message, history),
media_type="text/event-stream"
)
A demo chatbot becomes a production system when you address these concerns:
RAG with LangChain and FastAPI gives you a principled architecture for production chatbots that are grounded in your organisation's knowledge, auditable through source citations, and extensible as your document corpus grows. The patterns described here are the same ones we apply at Worksprout for client-facing AI assistants — start with them and customise for your domain rather than building from scratch.
Continue reading — handpicked articles you might enjoy