RAG Systems

Knowledge Graphs Meet LLMs: Structured RAG Architectures

By Hélain ZimmermannCo-Founder & CTO @ Ailog · ex-INRIA researcherFeb 9, 2026Updated Mar 30, 2026

13 min readadvanced

RAGKnowledge GraphsLLMsPythonVector DatabasesNLP

LLMs are good at pattern matching and synthesis, but they still struggle with one thing traditional knowledge systems excel at: structure. Real-world domains have entities, relations, constraints, and provenance. Plain vector-based RAG often flattens all of that into anonymous chunks.

Knowledge graphs provide explicit structure, typed relations, and queryable context. Combining them with LLMs yields structured RAG architectures: retrieval pipelines that know what they are retrieving and why.

In this post I will walk through practical patterns for integrating knowledge graphs with LLM-centric systems, with a focus on production-grade designs.

Why vector-only RAG hits a ceiling

If you have built vanilla vector-based RAG before, you have probably seen some of these issues in practice:

Hallucinated relationships, even when the raw facts are present
Inconsistent use of domain vocabulary and ontology
Poor handling of multi-hop questions ("What depends on X that is used by Y?")
Difficulty enforcing business constraints (e.g. legal rules, pricing logic)
Limited explainability, because you only retrieve chunks, not structured concepts

Better chunking and hybrid retrieval improve recall and ranking, but retrieval is still untyped text.

Knowledge graphs address this by:

Explicitly modeling entities (Product, Customer, Contract)
Encoding relations (uses, belongs_to, depends_on, contradicts)
Capturing provenance (source document, timestamp, confidence)
Supporting graph queries (paths, neighborhoods, constraints)

Structured RAG is what you get when you:

Use LLMs to build and maintain a knowledge graph from raw data
Use that graph to drive retrieval and reasoning for LLM responses
Keep vectors in the loop for fuzzy matching and semantic recall

Architectural overview: Structured RAG

At a high level, a structured RAG pipeline augments the usual "query - retrieve - generate" loop with explicit graph steps.

High-level flow

Ingestion
- Raw documents are chunked and embedded as usual
- An LLM (or extraction model) identifies entities and relations
- A knowledge graph store is updated with nodes, edges, and metadata
Query understanding
- User query is parsed into an intent and graph-centric representation
- Optionally: a symbolic query is generated, like Cypher or SPARQL
Graph retrieval
- Graph query fetches relevant entities, relations, and neighborhoods
- Optional vector search enriches the candidate set with semantically similar nodes or chunks
Context assembly
- Retrieved graph fragments are turned into structured context (tables, triples, JSON)
- Relevant text snippets from documents are attached as evidence
LLM generation
- LLM answers using both the graph context and raw snippets
- Optionally: LLM outputs an updated graph diff, which is validated and applied

You can implement this with:

A graph database (Neo4j, ArangoDB, JanusGraph, or RDF store)
A vector database for embedding storage and similarity search
An LLM for extraction and generation

The key design choice is how tight the coupling between the LLM and graph is: do you keep the LLM "graph-aware" through structured prompts, or do you use it mostly as a text backend behind a symbolic reasoning layer?

From text to knowledge graph

The ingestion pipeline is where most of the engineering work happens. Here is a practical pattern that has worked well for me.

1. Chunk and embed as usual

Reuse your existing RAG ingestion setup.

from uuid import uuid4
from dataclasses import dataclass

@dataclass
class Chunk:
    id: str
    doc_id: str
    text: str
    metadata: dict


def chunk_document(doc_id: str, text: str, max_tokens: int = 512) -> list[Chunk]:
    # Placeholder: use your tokenizer and chunking logic
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_tokens):
        chunk_text = " ".join(words[i:i+max_tokens])
        chunks.append(Chunk(id=str(uuid4()), doc_id=doc_id, text=chunk_text, metadata={}))
    return chunks

Then embed chunks and store them in your vector database.

2. LLM-based entity and relation extraction

We use an LLM in structured output mode to turn text into candidate nodes and edges.

import json
from typing import Any

EXTRACTION_PROMPT = """
You extract a knowledge graph from text.

Return JSON with:
- entities: [{"id": string, "type": string, "name": string, "attributes": object}]
- relations: [{"source_id": string, "target_id": string, "type": string, "attributes": object}]

Use stable local IDs within the chunk. Do not invent facts.
Text:
"""{text}"""
"""


def extract_kg_from_chunk(llm_client, chunk: Chunk) -> dict[str, Any]:
    prompt = EXTRACTION_PROMPT.format(text=chunk.text)
    response = llm_client(prompt)  # replace with your LLM call
    # Ideally use function calling or JSON mode
    try:
        data = json.loads(response)
    except json.JSONDecodeError:
        # Fallback, or log for review
        data = {"entities": [], "relations": []}
    return data

I strongly recommend using function calling or tools if your provider supports it.

3. Canonicalization and deduplication

If every chunk creates fresh entities, your graph will explode. You need entity resolution.

Simple pattern:

For each extracted entity, do a vector search over existing entity embeddings
If similarity > threshold (e.g. 0.85) and types match, link to that existing node
Otherwise create a new node

def resolve_or_create_entity(db, embed_fn, entity: dict) -> str:
    # entity: {"type": "Product", "name": "AlphaWidget", ...}
    query_vec = embed_fn(entity["name"])

    # Search in entity index, filtered by type
    candidates = db.search_entities(embedding=query_vec, type=entity["type"], k=5)

    for cand in candidates:
        if cand["score"] > 0.85:
            return cand["id"]

    # Create new graph node
    new_id = db.create_entity(
        type=entity["type"],
        name=entity["name"],
        attributes=entity.get("attributes", {}),
    )

    # Store embedding for future resolution
    db.store_entity_embedding(new_id, query_vec)
    return new_id

This fusion of graph and vector search follows the same principles as hybrid retrieval: combine symbolic precision with semantic flexibility.

4. Writing to the graph store

Once entities are resolved, you translate chunk-local IDs into global graph IDs and create edges.

def ingest_chunk_to_graph(db, embed_fn, chunk: Chunk, kg_data: dict):
    id_map: dict[str, str] = {}

    # Resolve entities
    for ent in kg_data.get("entities", []):
        global_id = resolve_or_create_entity(db, embed_fn, ent)
        id_map[ent["id"]] = global_id

    # Create relations
    for rel in kg_data.get("relations", []):
        src = id_map.get(rel["source_id"])
        tgt = id_map.get(rel["target_id"])
        if not src or not tgt:
            continue
        db.create_relation(
            source=src,
            target=tgt,
            type=rel["type"],
            attributes={
                **rel.get("attributes", {}),
                "source_chunk_id": chunk.id,
                "doc_id": chunk.doc_id,
            },
        )

At this point, your ingestion pipeline is producing both:

Chunk-level embeddings for classic RAG
A structured, growing knowledge graph

Query-time: Orchestrating graph and vector retrieval

Once you have a graph, the main challenge is coordination. You need to avoid building a system that is impossible to debug.

A practical design is a small orchestrator function that decides:

When to use graph queries vs plain vector search
How to construct graph queries from natural language
How to merge results into a single, well-structured context

1. Intent and query type detection

Use a lightweight classifier or LLM prompt to tag the query as:

factoid (single hop)
multi-hop reasoning
aggregation / summarization
exploratory / open ended

For the first two, the graph is strongly valuable.

QUERY_ROUTING_PROMPT = """
Classify the user query into one of: [factoid, multi_hop, aggregation, open_ended].
Return only the label.

Query: "{query}"
"""


def classify_query(llm_client, query: str) -> str:
    resp = llm_client(QUERY_ROUTING_PROMPT.format(query=query)).strip().lower()
    if resp not in {"factoid", "multi_hop", "aggregation", "open_ended"}:
        return "open_ended"
    return resp

2. From natural language to graph query

For graph-friendly queries, you want a symbolic query. You can either:

Expose a fixed set of graph query templates
Let the LLM generate a Cypher or SPARQL query then validate it

For systems with strong safety or privacy requirements, I usually prefer a template-based or constrained approach.

Example: generate a Cypher query for multi-hop dependency questions.

CYPHER_PROMPT = """
You translate natural language questions into Cypher, querying a graph with nodes
(type, name) and relationships (type) such as USES, DEPENDS_ON, OWNED_BY.

Return only the Cypher query.

Question: "{query}"
"""


def nl_to_cypher(llm_client, query: str) -> str:
    cypher = llm_client(CYPHER_PROMPT.format(query=query)).strip()
    # Add minimal validation or sandboxing here
    return cypher

Example orchestrator:

def retrieve_context(db_graph, db_vector, llm_client, embed_fn, query: str) -> dict:
    q_type = classify_query(llm_client, query)

    # 1. Vector search over documents or entity names
    query_vec = embed_fn(query)
    vector_results = db_vector.search(query_vec, k=10)

    graph_results = []

    if q_type in {"factoid", "multi_hop", "aggregation"}:
        cypher = nl_to_cypher(llm_client, query)
        graph_results = db_graph.run_cypher(cypher)

    return {
        "query_type": q_type,
        "vector_results": vector_results,
        "graph_results": graph_results,
    }

3. Building a structured prompt context

You do not want to just paste Cypher rows into the prompt. Instead, translate graph results into a compact, structured representation.

For example:

A list of entities with their attributes
A list of relations between those entities
A short explanation of which paths were used

import textwrap


def format_graph_context(graph_rows: list[dict]) -> str:
    entities = {}
    relations = []

    for row in graph_rows:
        # Assuming Neo4j style rows: {"n": node, "m": node, "r": rel}
        n = row.get("n")
        m = row.get("m")
        r = row.get("r")
        if n:
            entities[n.id] = {"type": n["type"], "name": n["name"], **n}
        if m:
            entities[m.id] = {"type": m["type"], "name": m["name"], **m}
        if r:
            relations.append({
                "source": r.start_node.id,
                "target": r.end_node.id,
                "type": r["type"],
            })

    entity_lines = []
    for eid, e in entities.items():
        entity_lines.append(f"- {eid} [{e['type']}]: {e['name']}")

    relation_lines = []
    for rel in relations:
        relation_lines.append(
            f"- {rel['source']} -({rel['type']})-> {rel['target']}"
        )

    return textwrap.dedent(
        f"""
        Entities:
        {chr(10).join(entity_lines)}

        Relations:
        {chr(10).join(relation_lines)}
        """
    ).strip()

Then combine text snippets from vector retrieval with the structured graph summary.

def build_llm_prompt(query: str, context: dict) -> str:
    vector_snippets = [r["text"] for r in context["vector_results"]]
    text_ctx = "\n\n".join(vector_snippets[:5])

    graph_ctx = format_graph_context(context["graph_results"])

    system_instructions = """
You are an assistant that must answer using only the provided knowledge.
Prefer structured relations from the knowledge graph when reasoning.
If information is missing or inconsistent, say you do not know.
""".strip()

    return f"""{system_instructions}

Question: {query}

Knowledge graph context:
{graph_ctx}

Text snippets:
{text_ctx}
"""

This structure gives you better control and debuggability than dumping raw chunks into a prompt.

Structured RAG patterns that work well in practice

1. Constraint-aware answering

If you encode constraints or rules as nodes and edges (or as a separate rule engine that references graph entities), the LLM can check answers against them.

Pattern:

Retrieve candidate answer entities via graph + vector
Retrieve relevant constraints (e.g. legal rules, eligibility conditions)
Provide both as context and ask the LLM to verify compliance

This is especially useful in regulated domains where hallucinated compliance is unacceptable.

2. Multi-hop question answering

Plain RAG often struggles when the answer depends on 2-3 hops: "Which projects depend on a library that has a critical vulnerability?".

Here the graph excels:

Encode dependencies as edges
Attach vulnerability information to packages or components
Run a graph query to find all paths that match the pattern
Let the LLM summarize and explain the result set

Your LLM is now summarizing over a precise subgraph instead of trying to discover the structure from raw text at query time.

3. Provenance and explainability

Store provenance metadata on nodes and edges:

Source document ID
Paragraph or chunk ID
Timestamp
Confidence score from extraction

At answer time you can:

Show which paths contributed to the answer
Link back to the original documents and sections
Ask the LLM to generate a step-by-step reasoning trace using only those paths

This is crucial when you need human trust in the system.

Privacy, security, and governance considerations

If you are dealing with sensitive or personal data, knowledge graphs amplify both value and risk.

Some patterns from differential privacy and data minimization transfer directly:

Data minimization: do you really need to store PII as graph nodes, or can you store pseudonymized IDs and keep the mapping in a separate secure service?
Attribute-level access control: graph stores often support fine-grained ACLs. Use them. For example, allow engineers to see service dependency edges but not customer-related nodes.
Differential privacy at query time: for aggregated statistics over the graph (counts, averages), add noise to protect individual nodes, similar to DP for NLP corpora.
Auditability: log which graph queries were generated by the LLM, and which nodes or edges were touched.

In a structured RAG stack, your LLM is no longer the only sensitive component. The graph layer becomes just as important from a security standpoint.

Implementation notes and pitfalls

Do not over-model too early

It is tempting to design a beautiful ontology up front. In practice, start with a minimal schema:

5-10 key entity types
5-15 relation types
Basic attributes

Iterate as you observe real queries and failure modes.

Keep graph and text in sync

You now maintain two views of reality:

Raw text in your document store + vector DB
Structured graph in your graph store

You need:

Idempotent ingestion jobs
Clear versioning of documents and their extracted graph segments
Periodic consistency checks (e.g. sample documents and re-extract to compare)

Evaluate both retrieval and reasoning

Extend your RAG evaluation metrics to structured RAG:

Graph coverage: how often are relevant entities and relations present?
Path correctness: are the graph paths used by the LLM factually correct?
Answer quality: does using the graph improve factual accuracy on multi-hop questions compared to text-only RAG?

A/B test a text-only pipeline vs the structured RAG variant.

Tooling and libraries

Some building blocks that work well in Python setups:

networkx for offline prototyping of graph logic
Official drivers for Neo4j / other graph DBs
LangChain or LangGraph for tool-based or agentic orchestration, where graph querying is a tool and the LLM plans which tools to call

Where this fits in the broader LLM stack

Structured RAG is complementary to other techniques in the LLM stack:

Fine-tuning: improves the base model's ability to follow your ontology and relation vocabulary
Agentic systems: agents can use the graph as a world model, planning over entities and relations instead of raw text
Hybrid search: the graph itself can be indexed with vectors for fuzzy linking of nodes and documents

The core idea does not change: use LLMs for high-level reasoning and generation, but give them explicit, queryable structure to think with.

Key Takeaways

Vector-only RAG flattens structure and struggles with multi-hop reasoning, constraints, and explainability, especially at production scale.
Knowledge graphs model entities, relations, and provenance explicitly, which pairs naturally with LLM reasoning in structured RAG architectures.
A practical structured RAG pipeline has two main loops: LLM-driven graph construction at ingestion time and graph-driven retrieval at query time.
Use LLMs for entity and relation extraction, then resolve entities with vector search to avoid graph explosion and maintain canonical nodes.
At query time, classify query type, translate natural language into graph queries (e.g. Cypher), then merge graph and vector results into a structured prompt.
Graph-aware prompts that expose entities, relations, and provenance lead to more reliable answers and better debuggability than raw chunk dumps.
Start with a minimal ontology, keep graph and text views in sync, and evaluate both retrieval and reasoning improvements over text-only RAG.
Treat the graph layer as a first-class security and privacy surface, with minimization, access control, and auditing similar to your LLM layer.
Structured RAG complements fine-tuning and agentic systems, providing a stable world model that LLMs can plan and reason over effectively.