Building a Smart Q&A System from Scratch: A Practical Guide to Agentic RAG with LangGraph

Agentic RAG for Dummies Logo

Have you ever wished for a document Q&A assistant that understands conversation context, asks for clarification when things are ambiguous, and can handle complex questions in parallel, much like a human would? Today, we will dive deep into how to build a production-ready intelligent Q&A system using 「Agentic RAG (Agent-driven Retrieval-Augmented Generation)」 and the 「LangGraph」 framework. This article is not just a tutorial; it’s a blueprint for the next generation of human-computer interaction.

Why Are Existing RAG Systems Not Enough?

Before we begin, let’s examine the current state of most RAG (Retrieval-Augmented Generation) tutorials. They typically demonstrate the basics: upload a document, split the text, build an index, and then answer questions. That sounds good, but in practice, you quickly hit limitations:

  • 「Context Amnesia」: Each question-answer interaction is isolated. The system cannot remember what you just asked, leading to disjointed conversations.
  • 「”Garbage In, Garbage Out”」: If a question is vague or ambiguous, the system attempts an answer anyway, often with poor results.
  • 「Black-and-White Retrieval」: You either search very precise but context-poor small snippets, or retrieve large text blocks filled with irrelevant information.
  • 「Inflexible Architecture」: The entire pipeline is often hard-coded, making it difficult to customize for your specific needs, such as legal or medical domains.
  • 「Single-Threaded Thinking」: When faced with a complex query containing multiple sub-questions, the system can only process them linearly, which is inefficient.

So, is there a better solution? Absolutely. This article introduces a modern RAG system that integrates 「conversational memory, intelligent query clarification, hierarchical retrieval, and parallel multi-agent processing」. It’s like equipping your document library with a patient, thoughtful research assistant who can multitask.

Understanding Our Core Tools: Agentic RAG and LangGraph

First, let’s clarify some key concepts:

  • 「What is RAG?」 Think of it as an AI taking an “open-book exam.” When asked a question, it doesn’t answer solely from memory (its pre-trained knowledge). Instead, it first “consults” the documents you provide (retrieval), then combines the found information (augmentation) to formulate an answer (generation).
  • 「What does “Agentic” mean?」 This means the system is empowered with autonomous decision-making and action capabilities. It is no longer a passive Q&A machine but an intelligent agent that can actively judge: “Is this question clear?”, “Do I need to look up more context?”, “How many parts does this question contain?”
  • 「The Role of LangGraph」: It is part of the LangChain framework, specifically designed for building 「stateful, multi-step agent workflows」. You can think of it as a visual workflow designer that clearly defines the logic loops for an agent’s thinking, decision-making, and actions.

The following diagram summarizes the workflow of the system we are about to build:
Agentic RAG Workflow Diagram

The System Core: Demystifying the Four Intelligent Stages

Our intelligent Q&A system decomposes the processing of a user query into four tightly coupled intelligent stages. Let’s examine each one.

Stage One: Conversation Understanding – Giving AI a “Memory”

Imagine you ask an assistant: “How do I install SQL?” After getting the answer, you immediately follow up with: “How do I update it?” A good assistant should understand that “it” refers to SQL. This is the function of 「conversational memory」.

Our system continuously analyzes the conversation history, extracting key topics, mentioned entities, and unresolved questions, condensing them into a one or two-sentence summary. This summary doesn’t mechanically repeat all chat logs but focuses on the core context. It is then used to disambiguate subsequent questions. For example, “How do I update it?” combined with context is rewritten into the clear query: “How do I update the SQL software?”

Stage Two: Query Clarification – Starting by “Understanding the Question”

Not all questions are born clear. In this stage, the system acts like a meticulous interviewer, performing a deep analysis and optimization of the user query:

  1. 「Resolves References」: Automatically links pronouns like “it” or “that thing” to explicitly mentioned entities in the context.
  2. 「Splits Complex Questions」: Automatically decomposes a question like “What are JavaScript and Python?” into two independent queries: “What is JavaScript?” and “What is Python?”, preparing them for parallel processing.
  3. 「Identifies Ambiguous Queries」: Detects nonsense, insulting, or overly broad questions.
  4. 「Requests Human Clarification (Key Feature)」: When the system determines the query intent is unclear, it 「actively pauses the workflow」 and asks the user for more details via the interface. This embodies the “Human-in-the-Loop” concept, effectively preventing futile retrievals based on misunderstood queries.
  5. 「Optimizes for Retrieval」: Rewrites colloquial questions (e.g., “Tell me about blockchain”) into keyword-rich statements better suited for retrieval (e.g., “basic principles and definition of blockchain technology”).

Stage Three: Intelligent Retrieval – A Duet of Precision and Breadth

This is the system’s “look up information” phase, employing a clever 「hierarchical indexing」 strategy and a powerful 「multi-agent parallel processing」 architecture.

「Hierarchical Indexing: The Right Tool for the Job」
During document preprocessing, we split each document twice:

  • 「Parent Chunks」: Large sections divided based on Markdown headers (H1, H2, H3). They retain complete logic and context.
  • 「Child Chunks」: Smaller, fixed-length fragments further split from each parent chunk. They are compact and precise, ideal for search pinpointing.

During retrieval, the system first searches the 「child chunks」 to obtain high-precision matching snippets. If the information seems insufficient, it then uses the child snippet’s associated parent_id to fetch the complete 「parent chunk」, thereby gaining ample background information. It’s like using a search engine to find a specific paragraph in a relevant article, then opening the entire article for an in-depth read.

「Multi-Agent Map-Reduce: Parallelizing Thought」
When Stage Two identifies multiple sub-questions (either explicitly asked by the user or decomposed from one complex question), the system activates the “multi-agent parallel processing” mode.

Using LangGraph’s Send API, the system 「simultaneously spawns an independent agent sub-graph for each sub-question」. Each agent independently executes the complete retrieval-evaluation-answering workflow. Finally, all agent answers are aggregated into one unified response.

「For example」: When processing “Explain the similarities and differences between JavaScript and Python,” it might activate two agents in parallel—one specializing in JavaScript, the other in Python—and finally merge their findings for comparison. This significantly speeds up processing compound questions.

Stage Four: Response Generation – From Information to Answer

In this final stage, the system synthesizes all retrieved information (which may come from a single agent or from the aggregation of multiple parallel agents) to generate a coherent, accurate, and targeted final answer to the user’s original question. For parallel processing results, a dedicated “aggregation” node is responsible for merging sub-answers, deduplicating, and organizing them into a well-structured response.

Hands-On Practice: Step-by-Step Construction of Your Intelligent Q&A System

Enough theory. Let’s see how to implement this with code. The following is a simplified step-by-step overview. Complete code can be found in the project linked at the end.

Step One: Environment and Model Configuration – Choosing Your “Brain”

The system is designed to be 「model provider agnostic」. You can freely choose the “brain” based on your needs:

  • 「Local Deployment (Recommended for Development)」: Use 「Ollama」 to run open-source models (like qwen3:4b). It’s completely free but requires local computing resources.

    ollama pull qwen3:4b-instruct-2507-q4_K_M
    
    from langchain_ollama import ChatOllama
    llm = ChatOllama(model="qwen3:4b-instruct-2507-q4_K_M", temperature=0)
    
  • 「Cloud API (Recommended for Production)」: Use 「Google Gemini」, 「OpenAI GPT」, or 「Anthropic Claude」. They are powerful and charge based on usage.

    from langchain_google_genai import ChatGoogleGenerativeAI
    import os
    os.environ["GOOGLE_API_KEY"] = "your-api-key"
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp", temperature=0)
    

「Only this single line of LLM initialization code needs to be changed. All other workflow code remains exactly the same.」

Step Two: Document Processing – From PDF to Knowledge Network

The foundation of any RAG system is high-quality document processing. Our pipeline begins by converting PDFs into structured Markdown text.

def pdf_to_markdown(pdf_path, output_dir):
    # Use the PyMuPDF4LLM library for conversion
    doc = pymupdf.open(pdf_path)
    md = pymupdf4llm.to_markdown(doc)
    # ... Save the processed Markdown file

After conversion, we proceed to the aforementioned 「hierarchical indexing」 stage:

  1. Use MarkdownHeaderTextSplitter to split by headers, generating 「parent chunks」.
  2. Use RecursiveCharacterTextSplitter to split each parent chunk into smaller 「child chunks」.
  3. Store the vectors of all child chunks (including both 「dense vectors」 for semantic understanding and 「sparse vectors」 for keyword matching) into the 「Qdrant」 vector database.
  4. Save the parent chunks as JSON files to the local filesystem, linked to child chunks via parent_id.

Step Three: Defining the Agent’s “Skills” – Retrieval Tools

We equip the agent with two core tool functions:

  • search_child_chunks(query): Performs hybrid search in the vector database, returning the most relevant child chunks.
  • retrieve_parent_chunks(parent_ids): Extracts complete parent chunks from the filesystem based on the parent_id carried by the child chunks.
@tool
def search_child_chunks(query: str, k: int = 5) -> List[dict]:
    """Search for the top K most relevant child chunks."""
    # ... Call the vector database for hybrid retrieval

Step Four: Building the Intelligent Workflow – LangGraph in Action

This is the most crucial part. We use LangGraph to define two levels of graphs:

「1. Agent Sub-graph (Processes a Single Question)」
This is a simple “think-act” loop:

Start -> Think (call LLM to judge if search is needed) -> [If tools are needed] -> Execute Tool (search/fetch parent) -> Return results to Think node -> [If info is sufficient] -> Extract Final Answer -> End

「2. Main Orchestrator Graph (Coordinates the Entire Session)」
This is the system’s central brain. The flow is as follows:

New Message -> Summarize Conversation History -> Analyze and Rewrite Query -> {Is query clear?}
         |
         |-- Unclear -> Pause, wait for human clarification -> Return to analysis node
         |
         |-- Clear -> Split into N sub-questions -> [Send in parallel to N Agent Sub-graphs] -> Wait for all sub-graphs to finish -> Aggregate all answers -> Generate Final Reply -> End

The beauty of this architecture is that 「parallelization is optional for simple questions」. A single question follows the single-agent path, while complex multi-part questions automatically trigger parallel processing, maximizing resource utilization.

Step Five: Creating the Interactive Interface – Conversing with Your Assistant

Finally, we use 「Gradio」 to quickly build a web interface with conversation history.

import gradio as gr

def chat_with_agent(message, history):
    # Feed the user message into the LangGraph workflow
    result = agent_graph.invoke({"messages": [HumanMessage(content=message)]}, config)
    # Return the assistant's final response
    return result['messages'][-1].content

# Create the chat interface
demo = gr.ChatInterface(fn=chat_with_agent)
demo.launch()

Once running, you’ll see a clean chat window in your browser where you can have intelligent conversations with your document library.

How to Get Started?

We provide three pathways to suit different needs, from quick experimentation to production deployment.

Option 1: Quick Experience (Google Colab Notebook)

Best for beginners and those wanting a quick taste. Simply click the “Open in Colab” badge on the project’s main page, upload your PDF documents as instructed in the notebook, and run the code cells sequentially. You’ll see results within minutes.

Option 2: Local Development (Complete Python Project)

This is the best choice if you want to delve deeper, customize features, or integrate it into your own application.

# 1. Clone the project
git clone https://github.com/your-repo/agentic-rag-for-dummies.git
cd agentic-rag-for-dummies

# 2. Create a virtual environment and install dependencies
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows
pip install -r requirements.txt

# 3. Place your PDF documents in the `docs/` folder

# 4. Run the application
python app.py
# Then open http://localhost:7860 in your browser

Option 3: Docker One-Click Deployment

Suitable for users who want quick deployment or a unified environment. Ensure Docker has at least 8GB of memory allocated.

# Build the image
docker build -f project/Dockerfile -t agentic-rag .

# Run the container
docker run --name rag-assistant -p 7860:7860 agentic-rag
# Access http://localhost:7860

Common Issues and Tuning Guide

You might encounter some typical situations while building and using the system. Here is a quick troubleshooting table:

Symptom Possible Cause Suggested Solution
「Answers ignore documents, fabricate information」 1. LLM capability is insufficient.
2. System prompt does not enforce retrieval.
1. Switch to a more capable model (e.g., >7B parameters).
2. Check and strengthen the rag_agent_prompt, explicitly requiring “must search first.”
「Relevant documents are not retrieved」 1. Query rewriting deviates from original intent.
2. Chunk size is inappropriate.
3. Embedding model mismatches the domain.
1. Adjust the query_analysis_prompt to reduce excessive “creativity.”
2. Increase child chunk size (chunk_size) to improve recall.
3. Try domain-specific embedding models.
「Answers are fragmented, lack context」 Parent chunks are too small to provide sufficient background. Increase max_parent_size in document_chunker.py.
「Slow processing speed」 1. Local model inference is slow.
2. Too many chunks, making retrieval slow.
1. Consider using cloud API models.
2. Appropriately increase chunk size to reduce the total number of chunks.
「Parallel processing not triggered」 The query analysis node failed to split the question. Check the splitting rules section in get_query_analysis_prompt().

Conclusion: What Have You Gained?

Through the journey of this article, you have not merely learned about a RAG project; you have acquired a modular blueprint for building a 「next-generation interactive Q&A system」. The core value of this system lies in:

  • 「More Human-like Interaction」: Possesses memory, knows when to clarify, and conducts natural, fluid conversations.
  • 「Smarter Retrieval」: Hierarchical indexing balances precision and breadth; parallel processing improves efficiency for complex problem-solving.
  • 「Powerful Customizability」: Every component—from the LLM and embedding models to workflow nodes—can be swapped as needed, easily adapting to different vertical domains like law, medicine, or technology.
  • 「Production-Ready Architecture」: Provides a complete path from an experimental notebook to a modular project and Docker deployment.

The ultimate purpose of technology is to serve people. What this Agentic RAG system does is precisely to bridge the gap between information and humans by making machines better understand human intent and context. Now, it’s your turn to get hands-on, assemble these modules, and inject intelligent life into your own knowledge base.


The project “Agentic RAG for Dummies” described in this article is released under the MIT open-source license. All code can be freely used for learning and commercial projects. Questions and code contributions are welcome in the project repository.