Hélain Zimmermann

Building a RAG Chatbot from Scratch with Python

Most people meet RAG systems through polished products: a chat window, a clean UI, answers that reference internal docs. Behind that is a simple idea -- give an LLM the right context at the right time. In this guide, we will build a minimal but real RAG chatbot in Python, step by step.

What We Are Building

At a high level, our chatbot will:

  1. Ingest a small set of documents
  2. Split them into chunks
  3. Convert chunks to embeddings
  4. Store embeddings in a simple vector store
  5. At query time, retrieve the most relevant chunks
  6. Feed those chunks into an LLM prompt to answer the user

We will use Python, open-source tools, and a simple in-memory vector store so you can run everything locally.

Tech Stack

  • Python 3.10+
  • sentence-transformers for embeddings
  • faiss-cpu as vector index
  • transformers or an API client for the LLM

Step 1 -- Setting Up the Environment

Install the dependencies:

pip install "sentence-transformers>=3.0.0" faiss-cpu "transformers>=4.38.0" torch

If you use an external LLM API (OpenAI, Anthropic, etc.), install the relevant SDK instead of transformers or in addition to it.

A basic project layout:

rag_chatbot/
  data/
    docs/
      doc1.txt
      doc2.txt
  rag/
    __init__.py
    loader.py
    chunker.py
    embeddings.py
    store.py
    retriever.py
    llm.py
    chatbot.py
  main.py

We will not fill every file exhaustively, but this shows how you might organize a growing RAG codebase.

Step 2 -- Loading and Chunking Documents

Retrieval performance lives and dies by chunking. Good chunks are:

  • small enough to fit in the context window comfortably
  • large enough to preserve meaning
  • aligned with natural document structure when possible

For a beginner system, a simple character-based chunking with overlap works fine.

Create rag/loader.py:

from pathlib import Path
from typing import List


def load_documents(folder_path: str) -> List[str]:
    folder = Path(folder_path)
    docs = []
    for path in folder.glob("*.txt"):
        text = path.read_text(encoding="utf-8")
        docs.append(text)
    return docs

Create rag/chunker.py:

from typing import List


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        if end == text_length:
            break
        start = end - overlap

    return chunks


def chunk_documents(docs: List[str], chunk_size: int = 500, overlap: int = 100) -> List[str]:
    all_chunks = []
    for doc in docs:
        all_chunks.extend(chunk_text(doc, chunk_size, overlap))
    return all_chunks

This is intentionally simple. In production, you will likely want token-based chunking and structure-aware splitting.

Step 3 -- Embedding Text Chunks

Embeddings convert text into numerical vectors that capture semantic meaning. There is a tradeoff between quality, latency, and privacy when choosing an embedding model.

For a local beginner setup, we will use sentence-transformers with an open model.

Create rag/embeddings.py:

from typing import List
from sentence_transformers import SentenceTransformer


class EmbeddingModel:
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def encode(self, texts: List[str]) -> List[list]:
        # Returns a list of embedding vectors
        embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=True)
        return embeddings

This gives us a reusable embedding component. If you later move to an API-based embedding model, you can keep the same interface.

Step 4 -- Building a Simple Vector Store with FAISS

Vector databases such as Pinecone, Weaviate or Milvus add scalability and durability, but for a small chatbot you can start with faiss in memory.

Create rag/store.py:

from typing import List, Tuple
import faiss
import numpy as np


class VectorStore:
    def __init__(self, dim: int):
        self.index = faiss.IndexFlatL2(dim)
        self.texts: List[str] = []

    def add(self, embeddings: np.ndarray, texts: List[str]):
        if embeddings.ndim == 1:
            embeddings = embeddings.reshape(1, -1)
        self.index.add(embeddings.astype("float32"))
        self.texts.extend(texts)

    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float]]:
        if query_embedding.ndim == 1:
            query_embedding = query_embedding.reshape(1, -1)
        distances, indices = self.index.search(query_embedding.astype("float32"), k)

        results = []
        for idx, dist in zip(indices[0], distances[0]):
            if idx == -1:
                continue
            text = self.texts[idx]
            results.append((text, float(dist)))
        return results

FAISS uses Euclidean distance by default in this configuration. For many sentence embedding models cosine similarity works well, but for small experiments L2 is usually fine. You can normalize vectors to approximate cosine.

Step 5 -- Wiring Up a Retriever

The retriever is the component that, given a query, finds the most relevant chunks. It combines the embedding model and the vector store.

Create rag/retriever.py:

from typing import List, Tuple
import numpy as np

from .embeddings import EmbeddingModel
from .store import VectorStore


class Retriever:
    def __init__(self, embedding_model: EmbeddingModel, vector_store: VectorStore):
        self.embedding_model = embedding_model
        self.vector_store = vector_store

    def add_documents(self, chunks: List[str]):
        embeddings = self.embedding_model.encode(chunks)
        embeddings = np.array(embeddings)
        self.vector_store.add(embeddings, chunks)

    def retrieve(self, query: str, k: int = 5) -> List[Tuple[str, float]]:
        query_emb = self.embedding_model.encode([query])
        query_emb = np.array(query_emb)
        results = self.vector_store.search(query_emb, k=k)
        return results

Notice how each piece is small and composable. This composability becomes important as your RAG system grows in complexity.

Step 6 -- Adding the LLM Layer

For the LLM, you have two main choices:

  • Local open-source model via transformers and PyTorch
  • Hosted API such as OpenAI, Anthropic, etc.

For beginners and laptops, an API is often easier. For privacy-sensitive applications, the choice matters more since you may be sending user data to a third party.

Below is an example using a generic transformers model, but you can replace the LLMClient with your own API client.

Create rag/llm.py:

from typing import List
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


class LLMClient:
    def __init__(self, model_name: str = "microsoft/Phi-3-mini-4k-instruct"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
        self.model.eval()

    @torch.no_grad()
    def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.2,
        )
        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Simple heuristic to remove the prompt from the output
        return text[len(prompt):].strip()

If you prefer an API model, keep the interface:

class LLMClient:
    def __init__(self, api_key: str):
        self.api_key = api_key
        # init SDK client

    def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
        # call your provider here
        return "mock answer"

This separation lets you switch models easily, which is useful when you start evaluating retrieval quality.

Step 7 -- Building the Chatbot Orchestrator

Now we tie everything together into a simple chatbot class that:

  1. Takes a user question
  2. Retrieves top-k relevant chunks
  3. Builds a prompt with those chunks as context
  4. Calls the LLM
  5. Returns the answer

Create rag/chatbot.py:

from typing import List, Tuple

from .retriever import Retriever
from .llm import LLMClient


SYSTEM_PROMPT = """You are a helpful assistant.
Use only the provided context to answer the question.
If the answer is not in the context, say you do not know.
Provide concise, clear answers.
"""


def build_prompt(question: str, context_chunks: List[str]) -> str:
    context_text = "\n\n".join(context_chunks)
    prompt = f"{SYSTEM_PROMPT}\n\nContext:\n{context_text}\n\nQuestion: {question}\nAnswer:"
    return prompt


class RAGChatbot:
    def __init__(self, retriever: Retriever, llm: LLMClient, k: int = 5):
        self.retriever = retriever
        self.llm = llm
        self.k = k

    def answer(self, question: str) -> Tuple[str, List[str]]:
        results = self.retriever.retrieve(question, k=self.k)
        context_chunks = [text for text, _dist in results]
        prompt = build_prompt(question, context_chunks)
        answer = self.llm.generate(prompt)
        return answer, context_chunks

Step 8 -- Wiring Everything In main.py

Now we instantiate the components, index our documents, and run a simple chat loop.

Create main.py:

from rag.loader import load_documents
from rag.chunker import chunk_documents
from rag.embeddings import EmbeddingModel
from rag.store import VectorStore
from rag.retriever import Retriever
from rag.llm import LLMClient
from rag.chatbot import RAGChatbot

import numpy as np


def build_rag_chatbot() -> RAGChatbot:
    # 1. Load documents
    docs = load_documents("data/docs")

    # 2. Chunk documents
    chunks = chunk_documents(docs, chunk_size=500, overlap=100)
    print(f"Loaded {len(docs)} docs and created {len(chunks)} chunks")

    # 3. Embedding model
    embedding_model = EmbeddingModel()

    # 4. Vector store
    sample_emb = embedding_model.encode(["test"])
    dim = np.array(sample_emb).shape[1]
    vector_store = VectorStore(dim=dim)

    # 5. Retriever
    retriever = Retriever(embedding_model, vector_store)
    retriever.add_documents(chunks)

    # 6. LLM client
    llm = LLMClient()

    # 7. Chatbot
    chatbot = RAGChatbot(retriever, llm, k=5)
    return chatbot


def main():
    chatbot = build_rag_chatbot()

    print("RAG Chatbot ready. Type 'exit' to quit.")
    while True:
        question = input("You: ")
        if question.lower() in {"exit", "quit"}:
            break
        answer, _ctx = chatbot.answer(question)
        print("Bot:", answer)


if __name__ == "__main__":
    main()

At this point, you can copy a few .txt files into data/docs, run python main.py, and start asking questions. The answers will be grounded in your local documents instead of the model's pretraining alone.

Where To Go Next

Once you have this minimal chatbot working, you can gradually improve it:

  • Add a simple API layer using FastAPI and Docker to serve the chatbot over HTTP.
  • Introduce retrieval-specific evaluation criteria so you can measure improvements instead of guessing.
  • Explore multimodal inputs if your documents contain images or mixed media.
  • Consider adding tool use and planning on top of retrieval for more autonomous behavior.

RAG systems are powerful precisely because they are composable. Once you understand each block -- loading, chunking, embedding, storing, retrieving, prompting -- you can iterate quickly and adapt the system to real-world constraints.

Key Takeaways

  • A RAG chatbot is mostly plumbing: wiring retrieval and generation together cleanly.
  • Start small with local files, simple chunking, sentence-transformers, and FAISS before jumping to complex infrastructure.
  • Good chunking has a huge impact on answer quality; experiment with sizes and overlaps.
  • Keep components modular: loader, chunker, embeddings, store, retriever, LLM, and chatbot should be separable.
  • The LLM is often the easiest part to swap, so design an abstraction like LLMClient from the start.
  • Evaluation and monitoring matter once you move beyond experiments.
  • Privacy considerations should be addressed early if you ingest sensitive data.

Related Articles

All Articles