TECHNICAL BLOG

Deep Dives for Engineers

Detailed technical articles covering the real problems we solve in embedded systems, AI, and robotics engineering.

Building Production AI Chatbots with LangChain, FastAPI, and RAG
Machine Learning

Building Production AI Chatbots with LangChain, FastAPI, and RAG

Worksprout Team Aug 12, 2024 13 min read

A comprehensive guide to building, testing, and deploying production-quality AI chatbots using LangChain's retrieval-augmented generation, FastAPI, and vector databases.

Beyond the Demo: What Production Chatbots Actually Require

Building a chatbot that impresses in a Jupyter notebook is straightforward. Building one that handles 10,000 daily queries reliably, returns factually grounded answers, maintains conversation context correctly, and fails gracefully when it should not answer — that requires serious engineering. Retrieval-Augmented Generation (RAG) is the architecture that makes the difference: instead of relying on an LLM's parametric knowledge alone, RAG retrieves relevant documents at inference time and grounds the model's response in that retrieved context.

Architecture Overview

A production RAG system has five core components:

  1. Document ingestion pipeline — loads, chunks, and embeds source documents into a vector database
  2. Vector database — stores embeddings and enables semantic similarity search (Chroma, Pinecone, Weaviate, Qdrant)
  3. Retriever — given a user query, fetches the most semantically relevant document chunks
  4. LLM with prompt template — generates a response conditioned on the retrieved context and conversation history
  5. API layer — exposes the chatbot as a streaming HTTP endpoint with session management

Document Ingestion with LangChain

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["

", "
", ". ", " "]
)
chunks = splitter.split_documents(docs)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectorstore.persist()

Choose your chunk size based on your LLM's context window and the density of your source documents. 512 tokens with 64-token overlap is a robust starting point for technical documentation.

The RAG Chain

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

prompt = ChatPromptTemplate.from_messages([
    ("system", '''You are an expert technical assistant for Worksprout.
Answer using only the provided context. If the answer is not in the context,
say you do not have information on this topic.

Context:
{context}'''),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1, streaming=True)
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

FastAPI Streaming Endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import AsyncGenerator

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    session_id: str

async def stream_response(query: str, history: list) -> AsyncGenerator:
    async for chunk in rag_chain.astream({"input": query, "chat_history": history}):
        if answer := chunk.get("answer"):
            yield f"data: {answer}

"
    yield "data: [DONE]

"

@app.post("/chat")
async def chat(req: ChatRequest):
    history = get_session_history(req.session_id)
    return StreamingResponse(
        stream_response(req.message, history),
        media_type="text/event-stream"
    )

Production Hardening

A demo chatbot becomes a production system when you address these concerns:

  • Query routing — classify queries before retrieval; route off-topic questions to a fallback without burning LLM tokens
  • Hallucination mitigation — require source citations; implement a faithfulness checker that verifies the answer against retrieved context
  • Latency — cache embeddings for frequent queries; use streaming so users see the first token within 500 ms
  • Rate limiting — protect your LLM API budget with per-session and per-IP limits via FastAPI middleware
  • Evaluation — use RAGAS (Retrieval Augmented Generation Assessment) to score context relevance, faithfulness, and answer correctness on a test set

Conclusion

RAG with LangChain and FastAPI gives you a principled architecture for production chatbots that are grounded in your organisation's knowledge, auditable through source citations, and extensible as your document corpus grows. The patterns described here are the same ones we apply at Worksprout for client-facing AI assistants — start with them and customise for your domain rather than building from scratch.

Share

Worksprout Team

The Worksprout engineering team specialises in embedded Linux, RDK-B broadband platforms, edge AI, and robotics systems. Based in Rajshahi, Bangladesh, we design and deploy production embedded intelligence for clients across South Asia and beyond.

Related Posts

Continue reading — handpicked articles you might enjoy