SimGRAG: Enhancing Knowledge‑Graph‑Driven Retrieval‑Augmented Generation with Similar Subgraphs
Image source: Pexels
In the era of large language models (LLMs), ensuring that generated text is factual, precise, and contextually rich remains a challenge. Retrieval‑Augmented Generation (RAG) combines the strengths of pretrained LLMs with external knowledge sources to overcome hallucination and improve answer quality. SimGRAG introduces a novel twist on RAG: it leverages similar subgraphs from a knowledge graph to guide generation. This post walks through every step of installing, configuring, and using SimGRAG, explains its core ideas in clear, non‑technical language, and highlights its practical benefits.
Table of Contents
-
-
-
2.1 Large Language Model -
2.2 Embedding Model -
2.3 Vector Database
-
-
-
-
4.1 MetaQA Dataset -
4.2 FactKG Dataset
-
-
-
-
-
7.1 MetaQA Workflow -
7.2 FactKG Workflow
-
-
-
-
-
-
Why SimGRAG? – The Motivation
“
“What if an AI system could peek into a structured map of knowledge, find patterns that look like your question, and then write answers grounded in that map?”
That is precisely the idea behind SimGRAG. Traditional RAG systems retrieve relevant documents or text passages. SimGRAG goes one step further: it retrieves subgraphs—small clusters of connected entities and relations—from a knowledge graph. By focusing on subgraphs that are structurally similar to the user’s query, the model gains:
-
Better Context: Structural clues guide the model toward the right entities. -
Reduced Hallucination: Facts come directly from the graph. -
Improved Multi‑Hop Reasoning: Complex questions spanning multiple relations become easier to answer.
Core Components Overview
SimGRAG is built on three interchangeable, plug‑and‑play modules. You can swap each component for your preferred alternatives, as long as interfaces remain consistent.
Large Language Model
The generation engine for SimGRAG is a large language model. By default, SimGRAG uses Llama 3 70B, a 70‑billion‑parameter model known for strong generative capabilities. To run it locally:
# Install Ollama (see https://ollama.com/)
# Then start the Llama 3 70B model:
ollama run llama3:70b
# Launch the local server used by SimGRAG:
bash ollama_server.sh
You may replace this with another model—just update the configuration accordingly.
Embedding Model
To compare user queries with graph elements, we convert symbols (nodes and relations) into numerical vectors. SimGRAG uses the Nomic-Embed-Text-V1 model from Hugging Face.
# In the project root:
mkdir -p data/raw
cd data/raw
# Clone the embedding model:
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1
This model captures semantic similarity between text descriptions of entities or relations.
Vector Database
Vectors need fast similarity search. SimGRAG relies on Milvus, an open‑source vector database optimized for this task. After installing Milvus (see https://milvus.io/), start its service so SimGRAG can read and write embeddings.
Getting Started – Prerequisites
Before diving in, ensure you have:
-
A machine with a modern Linux distribution or macOS. -
Sufficient GPU memory if you run large models locally. -
Python 3.8+ and pip
installed. -
Installed dependencies (run pip install -r requirements.txt
in the project root after cloning). -
Ollama for LLM serving. -
Milvus for vector search. -
Git access to Hugging Face for the embedding model.
Preparing Datasets
SimGRAG supports two public knowledge‑graph benchmarks: MetaQA (for multi‑hop question answering) and FactKG (for fact verification). You must download each dataset manually and place it in the raw data folder.
MetaQA Dataset
-
Visit the MetaQA repository:
https://github.com/yuyuz/MetaQA
-
Download or clone the data into data/raw/MetaQA
.
FactKG Dataset
-
Visit the FactKG repository:
https://github.com/jiho283/FactKG
-
Place the downloaded files in data/raw/FactKG
.
Having both folders under data/raw
ensures SimGRAG’s scripts can locate and index the data.
Setting Up Directory Structure
After completing downloads, your project structure should look like:
SimGraphRAG
├── data
│ └── raw
│ ├── nomic-embed-text-v1
│ ├── MetaQA
│ └── FactKG
├── configs
├── pipeline
├── prompts
└── src
-
data/raw: Contains embedding model and datasets. -
configs: Holds JSON or YAML files to set parameters. -
pipeline: Scripts for indexing and querying each dataset. -
prompts: Templates used by the LLM during generation. -
src: Core code modules (vector indexing, retrieval logic, utility functions).
Configuring SimGRAG
Before you run any pipeline, edit the files in configs/
:
-
model_name: Name or path of your LLM server. -
embedding_path: File path to the cloned embedding model. -
vector_db: Connection settings for Milvus (host, port, collection name). -
output_filename: Where final results will be saved (e.g., results/FactKG_query.txt
).
Proper configuration ensures all components talk to each other smoothly.
Running SimGRAG Pipelines
Once setup is complete, move into the pipeline
directory to start indexing and querying.
MetaQA Workflow
MetaQA evaluates performance on one‑hop, two‑hop, and three‑hop questions.
cd pipeline
# Build vector index from MetaQA graph
python metaQA_index.py
# Run one‑hop question answering
python metaQA_query1hop.py
# Run two‑hop question answering
python metaQA_query2hop.py
# Run three‑hop question answering
python metaQA_query3hop.py
Each script:
-
Loads graph facts. -
Embeds nodes and relations. -
Builds or loads a Milvus index. -
Retrieves similar subgraphs for each query. -
Feeds subgraph context + query to the LLM via prompts/
. -
Writes answers and correctness flags to output files.
FactKG Workflow
FactKG focuses on factual verification. Its pipeline is simpler:
cd pipeline
# Index the FactKG graph
python factKG_index.py
# Run fact verification queries
python factKG_query.py
After completion, check the file defined by output_filename
in your config. Each line is a JSON‑style dictionary:
-
question: The input question or statement. -
retrieved_subgraphs: The subgraphs Milvus returned. -
generated_answer: The LLM’s output. -
correct: A Boolean flag indicating ground‑truth match.
How SimGRAG Works – High‑Level Flow
-
User Query
A question like “Which actors starred in both Movie A and Movie B?” -
Subgraph Retrieval
-
Embed the query into vector space (via Nomic‑Embed). -
Search Milvus for the top‑K similar entity/relation nodes. -
Assemble connected nodes into candidate subgraphs.
-
-
Prompt Construction
-
Turn subgraph facts into a text template. -
Append the user query. -
Ensure prompts stay within token limits.
-
-
Generation
-
Send prompt to LLM (Llama 3 70B via Ollama). -
Receive human‑readable answer grounded in graph facts.
-
-
Evaluation & Logging
-
Compare against ground‑truth (for benchmarks). -
Save detailed records for analysis.
-
This flow tightly couples structured retrieval with freeform generation.
Use Cases and Advantages
By tapping into similar subgraphs, SimGRAG helps LLMs maintain both factual accuracy and reasoning depth.
Tips for Smooth Deployment
-
GPU Resources: Large models and embedding computations benefit from a dedicated GPU. -
Milvus Scaling: Monitor memory usage; consider sharding large graphs. -
Prompt Design: Keep templates concise to avoid token overflow. -
Logging: Enable verbose mode during initial tests to troubleshoot misalignments. -
Component Swaps: If you prefer another embedding model or vector store, adjust config files—no code rewrite needed.
Citation and Further Reading
If you adopt SimGRAG in your research or projects, please cite:
@inproceedings{simgrag2025,
title = "{SimGRAG}: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation",
author = "Cai, Yuzheng and Guo, Zhenyue and Pei, Yiwen and Bian, Wanrui and Zheng, Weiguo",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.163/",
pages = "3139--3158",
ISBN = "979-8-89176-256-5"
}
Conclusion
SimGRAG offers a structured, extensible way to strengthen retrieval‑augmented generation by tapping into subgraph similarity. Its modular design—spanning LLMs, embedding models, and vector databases—lets you tailor each piece to your needs. Whether you’re solving benchmark QA tasks or building real‑world fact‑grounded assistants, SimGRAG provides a clear, reproducible path from graph data to high‑quality answers.
Dive in, experiment with different components, and unlock more reliable, explainable AI with SimGRAG.
Happy experimenting!