Building a RAG Chatbot from Scratch with Python
Most people meet RAG systems through polished products: a chat window, a clean UI, answers that reference internal docs. Behind that is a simple idea -- give an LLM the right context at the right time. In this guide, we will build a minimal but real RAG chatbot in Python, step by step.
What We Are Building
At a high level, our chatbot will:
- Ingest a small set of documents
- Split them into chunks
- Convert chunks to embeddings
- Store embeddings in a simple vector store
- At query time, retrieve the most relevant chunks
- Feed those chunks into an LLM prompt to answer the user
We will use Python, open-source tools, and a simple in-memory vector store so you can run everything locally.
Tech Stack
- Python 3.10+
sentence-transformersfor embeddingsfaiss-cpuas vector indextransformersor an API client for the LLM
Step 1 -- Setting Up the Environment
Install the dependencies:
pip install "sentence-transformers>=3.0.0" faiss-cpu "transformers>=4.38.0" torch
If you use an external LLM API (OpenAI, Anthropic, etc.), install the relevant SDK instead of transformers or in addition to it.
A basic project layout:
rag_chatbot/
data/
docs/
doc1.txt
doc2.txt
rag/
__init__.py
loader.py
chunker.py
embeddings.py
store.py
retriever.py
llm.py
chatbot.py
main.py
We will not fill every file exhaustively, but this shows how you might organize a growing RAG codebase.
Step 2 -- Loading and Chunking Documents
Retrieval performance lives and dies by chunking. Good chunks are:
- small enough to fit in the context window comfortably
- large enough to preserve meaning
- aligned with natural document structure when possible
For a beginner system, a simple character-based chunking with overlap works fine.
Create rag/loader.py:
from pathlib import Path
from typing import List
def load_documents(folder_path: str) -> List[str]:
folder = Path(folder_path)
docs = []
for path in folder.glob("*.txt"):
text = path.read_text(encoding="utf-8")
docs.append(text)
return docs
Create rag/chunker.py:
from typing import List
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunk = text[start:end]
chunks.append(chunk)
if end == text_length:
break
start = end - overlap
return chunks
def chunk_documents(docs: List[str], chunk_size: int = 500, overlap: int = 100) -> List[str]:
all_chunks = []
for doc in docs:
all_chunks.extend(chunk_text(doc, chunk_size, overlap))
return all_chunks
This is intentionally simple. In production, you will likely want token-based chunking and structure-aware splitting.
Step 3 -- Embedding Text Chunks
Embeddings convert text into numerical vectors that capture semantic meaning. There is a tradeoff between quality, latency, and privacy when choosing an embedding model.
For a local beginner setup, we will use sentence-transformers with an open model.
Create rag/embeddings.py:
from typing import List
from sentence_transformers import SentenceTransformer
class EmbeddingModel:
def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def encode(self, texts: List[str]) -> List[list]:
# Returns a list of embedding vectors
embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=True)
return embeddings
This gives us a reusable embedding component. If you later move to an API-based embedding model, you can keep the same interface.
Step 4 -- Building a Simple Vector Store with FAISS
Vector databases such as Pinecone, Weaviate or Milvus add scalability and durability, but for a small chatbot you can start with faiss in memory.
Create rag/store.py:
from typing import List, Tuple
import faiss
import numpy as np
class VectorStore:
def __init__(self, dim: int):
self.index = faiss.IndexFlatL2(dim)
self.texts: List[str] = []
def add(self, embeddings: np.ndarray, texts: List[str]):
if embeddings.ndim == 1:
embeddings = embeddings.reshape(1, -1)
self.index.add(embeddings.astype("float32"))
self.texts.extend(texts)
def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float]]:
if query_embedding.ndim == 1:
query_embedding = query_embedding.reshape(1, -1)
distances, indices = self.index.search(query_embedding.astype("float32"), k)
results = []
for idx, dist in zip(indices[0], distances[0]):
if idx == -1:
continue
text = self.texts[idx]
results.append((text, float(dist)))
return results
FAISS uses Euclidean distance by default in this configuration. For many sentence embedding models cosine similarity works well, but for small experiments L2 is usually fine. You can normalize vectors to approximate cosine.
Step 5 -- Wiring Up a Retriever
The retriever is the component that, given a query, finds the most relevant chunks. It combines the embedding model and the vector store.
Create rag/retriever.py:
from typing import List, Tuple
import numpy as np
from .embeddings import EmbeddingModel
from .store import VectorStore
class Retriever:
def __init__(self, embedding_model: EmbeddingModel, vector_store: VectorStore):
self.embedding_model = embedding_model
self.vector_store = vector_store
def add_documents(self, chunks: List[str]):
embeddings = self.embedding_model.encode(chunks)
embeddings = np.array(embeddings)
self.vector_store.add(embeddings, chunks)
def retrieve(self, query: str, k: int = 5) -> List[Tuple[str, float]]:
query_emb = self.embedding_model.encode([query])
query_emb = np.array(query_emb)
results = self.vector_store.search(query_emb, k=k)
return results
Notice how each piece is small and composable. This composability becomes important as your RAG system grows in complexity.
Step 6 -- Adding the LLM Layer
For the LLM, you have two main choices:
- Local open-source model via
transformersand PyTorch - Hosted API such as OpenAI, Anthropic, etc.
For beginners and laptops, an API is often easier. For privacy-sensitive applications, the choice matters more since you may be sending user data to a third party.
Below is an example using a generic transformers model, but you can replace the LLMClient with your own API client.
Create rag/llm.py:
from typing import List
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class LLMClient:
def __init__(self, model_name: str = "microsoft/Phi-3-mini-4k-instruct"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
self.model.eval()
@torch.no_grad()
def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.2,
)
text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Simple heuristic to remove the prompt from the output
return text[len(prompt):].strip()
If you prefer an API model, keep the interface:
class LLMClient:
def __init__(self, api_key: str):
self.api_key = api_key
# init SDK client
def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
# call your provider here
return "mock answer"
This separation lets you switch models easily, which is useful when you start evaluating retrieval quality.
Step 7 -- Building the Chatbot Orchestrator
Now we tie everything together into a simple chatbot class that:
- Takes a user question
- Retrieves top-k relevant chunks
- Builds a prompt with those chunks as context
- Calls the LLM
- Returns the answer
Create rag/chatbot.py:
from typing import List, Tuple
from .retriever import Retriever
from .llm import LLMClient
SYSTEM_PROMPT = """You are a helpful assistant.
Use only the provided context to answer the question.
If the answer is not in the context, say you do not know.
Provide concise, clear answers.
"""
def build_prompt(question: str, context_chunks: List[str]) -> str:
context_text = "\n\n".join(context_chunks)
prompt = f"{SYSTEM_PROMPT}\n\nContext:\n{context_text}\n\nQuestion: {question}\nAnswer:"
return prompt
class RAGChatbot:
def __init__(self, retriever: Retriever, llm: LLMClient, k: int = 5):
self.retriever = retriever
self.llm = llm
self.k = k
def answer(self, question: str) -> Tuple[str, List[str]]:
results = self.retriever.retrieve(question, k=self.k)
context_chunks = [text for text, _dist in results]
prompt = build_prompt(question, context_chunks)
answer = self.llm.generate(prompt)
return answer, context_chunks
Step 8 -- Wiring Everything In main.py
Now we instantiate the components, index our documents, and run a simple chat loop.
Create main.py:
from rag.loader import load_documents
from rag.chunker import chunk_documents
from rag.embeddings import EmbeddingModel
from rag.store import VectorStore
from rag.retriever import Retriever
from rag.llm import LLMClient
from rag.chatbot import RAGChatbot
import numpy as np
def build_rag_chatbot() -> RAGChatbot:
# 1. Load documents
docs = load_documents("data/docs")
# 2. Chunk documents
chunks = chunk_documents(docs, chunk_size=500, overlap=100)
print(f"Loaded {len(docs)} docs and created {len(chunks)} chunks")
# 3. Embedding model
embedding_model = EmbeddingModel()
# 4. Vector store
sample_emb = embedding_model.encode(["test"])
dim = np.array(sample_emb).shape[1]
vector_store = VectorStore(dim=dim)
# 5. Retriever
retriever = Retriever(embedding_model, vector_store)
retriever.add_documents(chunks)
# 6. LLM client
llm = LLMClient()
# 7. Chatbot
chatbot = RAGChatbot(retriever, llm, k=5)
return chatbot
def main():
chatbot = build_rag_chatbot()
print("RAG Chatbot ready. Type 'exit' to quit.")
while True:
question = input("You: ")
if question.lower() in {"exit", "quit"}:
break
answer, _ctx = chatbot.answer(question)
print("Bot:", answer)
if __name__ == "__main__":
main()
At this point, you can copy a few .txt files into data/docs, run python main.py, and start asking questions. The answers will be grounded in your local documents instead of the model's pretraining alone.
Where To Go Next
Once you have this minimal chatbot working, you can gradually improve it:
- Add a simple API layer using FastAPI and Docker to serve the chatbot over HTTP.
- Introduce retrieval-specific evaluation criteria so you can measure improvements instead of guessing.
- Explore multimodal inputs if your documents contain images or mixed media.
- Consider adding tool use and planning on top of retrieval for more autonomous behavior.
RAG systems are powerful precisely because they are composable. Once you understand each block -- loading, chunking, embedding, storing, retrieving, prompting -- you can iterate quickly and adapt the system to real-world constraints.
Key Takeaways
- A RAG chatbot is mostly plumbing: wiring retrieval and generation together cleanly.
- Start small with local files, simple chunking,
sentence-transformers, and FAISS before jumping to complex infrastructure. - Good chunking has a huge impact on answer quality; experiment with sizes and overlaps.
- Keep components modular: loader, chunker, embeddings, store, retriever, LLM, and chatbot should be separable.
- The LLM is often the easiest part to swap, so design an abstraction like
LLMClientfrom the start. - Evaluation and monitoring matter once you move beyond experiments.
- Privacy considerations should be addressed early if you ingest sensitive data.
Related Articles
LLM Evaluation Frameworks: Beyond Perplexity
Go beyond perplexity with practical LLM evaluation: task metrics, judge models, rubrics, RAG-specific checks, and production feedback loops.
11 min read · intermediateRAG SystemsChunking Strategies for RAG Pipelines
Learn practical chunking strategies for RAG pipelines, from basic splits to adaptive and hybrid methods, with code and evaluation tips.
11 min read · intermediateRAG SystemsHybrid Search: Combining Dense and Sparse Retrieval
Learn how to design and implement hybrid search that combines dense and sparse retrieval, with practical patterns, tradeoffs, and Python code examples.
12 min read · advanced