SimGRAG: Enhancing Knowledge‑Graph‑Driven Retrieval‑Augmented Generation with Similar Subgraphs

Knowledge Graph Connections
Image source: Pexels

In the era of large language models (LLMs), ensuring that generated text is factual, precise, and contextually rich remains a challenge. Retrieval‑Augmented Generation (RAG) combines the strengths of pretrained LLMs with external knowledge sources to overcome hallucination and improve answer quality. SimGRAG introduces a novel twist on RAG: it leverages similar subgraphs from a knowledge graph to guide generation. This post walks through every step of installing, configuring, and using SimGRAG, explains its core ideas in clear, non‑technical language, and highlights its practical benefits.


Table of Contents

  1. Why SimGRAG? – The Motivation

  2. Core Components Overview

  3. Getting Started – Prerequisites

  4. Preparing Datasets

  5. Setting Up Directory Structure

  6. Configuring SimGRAG

  7. Running SimGRAG Pipelines

  8. How SimGRAG Works – High‑Level Flow

  9. Use Cases and Advantages

  10. Tips for Smooth Deployment

  11. Citation and Further Reading

  12. Conclusion


Why SimGRAG? – The Motivation

“What if an AI system could peek into a structured map of knowledge, find patterns that look like your question, and then write answers grounded in that map?”

That is precisely the idea behind SimGRAG. Traditional RAG systems retrieve relevant documents or text passages. SimGRAG goes one step further: it retrieves subgraphs—small clusters of connected entities and relations—from a knowledge graph. By focusing on subgraphs that are structurally similar to the user’s query, the model gains:

  • Better Context: Structural clues guide the model toward the right entities.
  • Reduced Hallucination: Facts come directly from the graph.
  • Improved Multi‑Hop Reasoning: Complex questions spanning multiple relations become easier to answer.

Core Components Overview

SimGRAG is built on three interchangeable, plug‑and‑play modules. You can swap each component for your preferred alternatives, as long as interfaces remain consistent.

Large Language Model

The generation engine for SimGRAG is a large language model. By default, SimGRAG uses Llama 3 70B, a 70‑billion‑parameter model known for strong generative capabilities. To run it locally:

# Install Ollama (see https://ollama.com/)
# Then start the Llama 3 70B model:
ollama run llama3:70b

# Launch the local server used by SimGRAG:
bash ollama_server.sh

You may replace this with another model—just update the configuration accordingly.

Embedding Model

To compare user queries with graph elements, we convert symbols (nodes and relations) into numerical vectors. SimGRAG uses the Nomic-Embed-Text-V1 model from Hugging Face.

# In the project root:
mkdir -p data/raw
cd data/raw

# Clone the embedding model:
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1

This model captures semantic similarity between text descriptions of entities or relations.

Vector Database

Vectors need fast similarity search. SimGRAG relies on Milvus, an open‑source vector database optimized for this task. After installing Milvus (see https://milvus.io/), start its service so SimGRAG can read and write embeddings.


Getting Started – Prerequisites

Before diving in, ensure you have:

  • A machine with a modern Linux distribution or macOS.
  • Sufficient GPU memory if you run large models locally.
  • Python 3.8+ and pip installed.
  • Installed dependencies (run pip install -r requirements.txt in the project root after cloning).
  • Ollama for LLM serving.
  • Milvus for vector search.
  • Git access to Hugging Face for the embedding model.

Preparing Datasets

SimGRAG supports two public knowledge‑graph benchmarks: MetaQA (for multi‑hop question answering) and FactKG (for fact verification). You must download each dataset manually and place it in the raw data folder.

MetaQA Dataset

  1. Visit the MetaQA repository:
    https://github.com/yuyuz/MetaQA
  2. Download or clone the data into data/raw/MetaQA.

FactKG Dataset

  1. Visit the FactKG repository:
    https://github.com/jiho283/FactKG
  2. Place the downloaded files in data/raw/FactKG.

Having both folders under data/raw ensures SimGRAG’s scripts can locate and index the data.


Setting Up Directory Structure

After completing downloads, your project structure should look like:

SimGraphRAG
├── data
│   └── raw
│       ├── nomic-embed-text-v1
│       ├── MetaQA
│       └── FactKG
├── configs
├── pipeline
├── prompts
└── src
  • data/raw: Contains embedding model and datasets.
  • configs: Holds JSON or YAML files to set parameters.
  • pipeline: Scripts for indexing and querying each dataset.
  • prompts: Templates used by the LLM during generation.
  • src: Core code modules (vector indexing, retrieval logic, utility functions).

Configuring SimGRAG

Before you run any pipeline, edit the files in configs/:

  • model_name: Name or path of your LLM server.
  • embedding_path: File path to the cloned embedding model.
  • vector_db: Connection settings for Milvus (host, port, collection name).
  • output_filename: Where final results will be saved (e.g., results/FactKG_query.txt).

Proper configuration ensures all components talk to each other smoothly.


Running SimGRAG Pipelines

Once setup is complete, move into the pipeline directory to start indexing and querying.

MetaQA Workflow

MetaQA evaluates performance on one‑hop, two‑hop, and three‑hop questions.

cd pipeline

# Build vector index from MetaQA graph
python metaQA_index.py

# Run one‑hop question answering
python metaQA_query1hop.py

# Run two‑hop question answering
python metaQA_query2hop.py

# Run three‑hop question answering
python metaQA_query3hop.py

Each script:

  1. Loads graph facts.
  2. Embeds nodes and relations.
  3. Builds or loads a Milvus index.
  4. Retrieves similar subgraphs for each query.
  5. Feeds subgraph context + query to the LLM via prompts/.
  6. Writes answers and correctness flags to output files.

FactKG Workflow

FactKG focuses on factual verification. Its pipeline is simpler:

cd pipeline

# Index the FactKG graph
python factKG_index.py

# Run fact verification queries
python factKG_query.py

After completion, check the file defined by output_filename in your config. Each line is a JSON‑style dictionary:

  • question: The input question or statement.
  • retrieved_subgraphs: The subgraphs Milvus returned.
  • generated_answer: The LLM’s output.
  • correct: A Boolean flag indicating ground‑truth match.

How SimGRAG Works – High‑Level Flow

  1. User Query
    A question like “Which actors starred in both Movie A and Movie B?”

  2. Subgraph Retrieval

    • Embed the query into vector space (via Nomic‑Embed).
    • Search Milvus for the top‑K similar entity/relation nodes.
    • Assemble connected nodes into candidate subgraphs.
  3. Prompt Construction

    • Turn subgraph facts into a text template.
    • Append the user query.
    • Ensure prompts stay within token limits.
  4. Generation

    • Send prompt to LLM (Llama 3 70B via Ollama).
    • Receive human‑readable answer grounded in graph facts.
  5. Evaluation & Logging

    • Compare against ground‑truth (for benchmarks).
    • Save detailed records for analysis.

This flow tightly couples structured retrieval with freeform generation.


Use Cases and Advantages

Scenario Benefit
Multi-Hop QA Structural matching narrows search to relevant hops, boosting answer quality.
Fact Verification Grounded subgraphs reduce model hallucination in true/false assessments.
Information Completion Missing links in a graph can be inferred from similar patterns.
Explainable AI Subgraphs offer clear, interpretable rationales behind answers.

By tapping into similar subgraphs, SimGRAG helps LLMs maintain both factual accuracy and reasoning depth.


Tips for Smooth Deployment

  • GPU Resources: Large models and embedding computations benefit from a dedicated GPU.
  • Milvus Scaling: Monitor memory usage; consider sharding large graphs.
  • Prompt Design: Keep templates concise to avoid token overflow.
  • Logging: Enable verbose mode during initial tests to troubleshoot misalignments.
  • Component Swaps: If you prefer another embedding model or vector store, adjust config files—no code rewrite needed.

Citation and Further Reading

If you adopt SimGRAG in your research or projects, please cite:

@inproceedings{simgrag2025,
    title = "{SimGRAG}: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation",
    author = "Cai, Yuzheng and Guo, Zhenyue and Pei, Yiwen and Bian, Wanrui and Zheng, Weiguo",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.163/",
    pages = "3139--3158",
    ISBN = "979-8-89176-256-5"
}

Conclusion

SimGRAG offers a structured, extensible way to strengthen retrieval‑augmented generation by tapping into subgraph similarity. Its modular design—spanning LLMs, embedding models, and vector databases—lets you tailor each piece to your needs. Whether you’re solving benchmark QA tasks or building real‑world fact‑grounded assistants, SimGRAG provides a clear, reproducible path from graph data to high‑quality answers.

Dive in, experiment with different components, and unlock more reliable, explainable AI with SimGRAG.


Happy experimenting!