Datacapsule: A Multi-Path Retrieval Solution Based on Knowledge Graphs
In the era of information explosion, finding useful information from a vast amount of data has become a challenge for everyone. Datacapsule, a multi-path retrieval solution based on knowledge graphs, offers a new approach to this problem.
What is Datacapsule?
Datacapsule is a solution that uses multi-path retrieval technology to achieve precise knowledge retrieval. It covers various functional modules such as retrieval systems, entity relation extraction, entity attribute extraction, entity linking, structured database construction, and question-answering systems.
Core Advantages of Datacapsule
Compared to traditional knowledge graph construction and retrieval methods, Datacapsule has the following advantages:
- Efficient graph construction: With optimized algorithms and models, Datacapsule can quickly build knowledge graphs.
- Accurate retrieval: Based on multi-path retrieval technology, Datacapsule can flexibly select retrieval strategies according to the type of user question.
- Multi-round dialogue support: Datacapsule can better grasp user needs and provide more coherent and accurate answers.
Main Functions of Datacapsule
Knowledge Graph and Structured Database Construction
Datacapsule uses dspy for intent recognition to process entity extraction and build graph information. It then converts the constructed graph information into structured information and stores it in a database.
Knowledge Graph Storage and Management
Using NetworkX, Datacapsule implements knowledge graph storage and management, supporting dynamic construction and querying of entity relationships.
Vector Database Retrieval
Datacapsule integrates a lightweight vector database based on NanoVector, enabling efficient semantic similarity retrieval.
Multi-Path Retrieval Method Based on Graphs
This is the core function of Datacapsule. It combines a reasoning system based on Chain of Thought and supports multi-round dialogue context understanding, forming a complete reasoning and querying system.
When a user initiates a query, the system first determines whether the entity in the question exists in the knowledge graph. If not, it directly uses vector retrieval to obtain answers. If it does, the system further determines the question type (entity query, relationship query, attribute query, or statistical query) and adopts corresponding retrieval strategies.
Real-Time Communication and Status Synchronization
Datacapsule uses WebSocket for real-time message pushing, supporting stream-based dialogue responses and real-time feedback of optimizer status.
Model Optimizer
Datacapsule supports model optimization based on user feedback, with version management and rollback capabilities, and provides a visualization of the optimization process.
Database Management System
Datacapsule uses SQLite to store user interaction records, supports batch processing of vector data, and has data version control capabilities.
Front-End Interaction Interface
The front-end interface of Datacapsule is built with React 18 + Vite, offering features like real-time dialogue windows, user question collection, reasoning process display, and optimization progress exhibition.
System Monitoring and Logging
Based on loguru, Datacapsule provides a hierarchical logging system for performance monitoring, error tracking, and API call statistics.
Environment Configuration Management
Datacapsule supports multiple LLM model configurations, flexible environment variable management, and multi-environment deployment.
Technical Framework and System Architecture
Front-End Technology Stack
- Development Languages: JavaScript + TypeScript
- Front-End Framework: React 18 + Vite
- UI Framework: TailwindCSS
- Build Tool: Vite
- Real-Time Communication: WebSocket Client
Back-End Technology Stack
- Development Language: Python (Version 3.8+ recommended)
- Web Framework: FastAPI
- Databases:
- Structured Data: SQLite
- Vector Database: NanoVector
- Graph Structure Storage: NetworkX
- Knowledge Extraction:
- Entity & Relationship Extraction: DSPy + CoT
- AI Models:
- Embedding Models: Various configurations supported
- Large Language Models: Supports OpenAI/DeepSeek, etc.
- Development Tools:
- Dependency Management: pip
- Environment Management: python-dotenv
- Logging System: loguru
System Architecture
- Front-Back End Separation Architecture
- WebSocket Real-Time Communication
- Hybrid Recall of Vector Retrieval + Graph Retrieval + Text2SQL
- DSPy Intent Understanding and Reasoning
Getting Started with Datacapsule
Clone the Repository
First, clone the Datacapsule backend service:
git clone https://github.com/loukie7/Datacapsule.git
Then, clone the frontend service.
Install Dependencies
Navigate to the Datacapsule directory and install the required dependencies:
cd Datacapsule
pip install -r requirements.txt
Configure Environment Variables
Create an .env file in the directory and configure it based on the .env.example template. Key configurations include LLM configurations, system environment settings, vector retrieval parameters, and embedding model settings.
Run the Service
Start the backend service with:
cd Datacapsule
python app.py
For detailed steps on starting the frontend service, visit the Datacapsule-webui repository.
Data Processing
Datacapsule supports two data processing methods: using built-in sample data and custom data. If you want to use custom data, use tools/entity_extraction.py for graph data extraction and entity_extraction_db.py for storing structured data.
Example Queries
After launching successfully, the interface will appear as follows:
When the entity is not in the graph, the system automatically switches to vector retrieval. For example, when top_k is set to 1, only the most similar result is returned.
For entities within the graph, Datacapsule can handle various queries:
- Entity Query: “What is the Taiwan hagfish?”
- Relationship Query: “What is the relationship between Taiwan hagfish and Slime eel?”
- Attribute Query: “What are the living habits of Slime eel?”
- Statistical Query: “How many species are there in the hagfish family?”
You can click the link on the homepage to access knowledge graph information.
DSPy Intent Understanding Mechanism
Zero-Shot Understanding Capability
The DSPy framework uses the ReAct mode, allowing large models to understand user intents without pre-training. The ReAct module automatically parses the signatures and docstrings of each tool function, generating implicit prompts to guide the model in selecting the appropriate tools.
Tool Selection Mechanism
In dspy_inference.py, the ReAct module automatically parses the signatures and docstrings of each tool function to generate implicit prompts, guiding the model in selecting the appropriate tools.
DSPy Optimization Principles and Effects
Optimization Technique Essence
DSPy optimization is based on prompt engineering automation. The system collects user feedback data through the evaluator in dspy_evaluation.py, and the optimization process is stored in program files within the dspy_program directory.
Optimization Process
The optimization logic in app.py collects user questions and feedback as optimization samples, uses BiologicalRetrievalEvaluation to assess reasoning quality, and applies multiple iterations to generate more precise thinking templates.
Optimization Effects
After optimization, improvements are seen in intent understanding, tool selection, reasoning patterns, and domain-specific understanding.
Data Source Replacement and Scenario Adaptability
Built-in Data Source Replacement
Datacapsule includes two built-in sample datasets, demo18.json and demo130.json. You can replace them with:
# Replace the small test dataset
cp your_small_dataset.json docs/demo18.json
# Replace the full dataset
cp your_full_dataset.json docs/demo130.json
Custom Data Introduction
To introduce your own domain data, you need to make comprehensive adjustments:
- Prepare JSON-formatted data.
- Extract entities and build graphs with tools/entity_extraction.py.
- Create a relational database with tools/entity_extraction_db.py.
- Adjust various DSPy components.
Data Scenario Adaptability
Datacapsule is best suited for scenarios with clear answers, highly structured data, and professional vertical domains. For scenarios requiring non-quantitative evaluation, reasoning, and multi-source heterogeneous data, custom evaluation metrics are needed.
System Limitations and Improvement Directions
Limitations of the Current Intent Recognition Module
The intent recognition module of Datacapsule has certain limitations, such as limited streaming output support, challenges in quantifying optimization effects, and insufficient architectural flexibility.
Complex Query Processing Capability
Datacapsule supports multi-condition filtered statistical queries, but the precision depends on the granularity of structured data fields.
Response Efficiency Improvement Strategies
To enhance response efficiency, consider deploying high-performance inference frameworks locally or conduct pre-deployment testing of multiple service providers for performance and cost.
Knowledge Graph Management and Display
Graph Database and Visualization Optimization
Datacapsule currently uses a lightweight graph database implementation. Future plans include integrating professional graph databases and developing an admin console to optimize storage structures for large-scale graph processing.
Knowledge Graph Display Optimization
Currently, Datacapsule offers basic HTML display. Future updates will incorporate professional graph visualization libraries to enable adaptive layouts and interactive features.
Reasoning Process Display Explanation
Datacapsule intentionally displays detailed reasoning processes to help developers and users understand system decision paths. In production environments, detailed reasoning processes can be hidden, while development environments can retain them for debugging and optimization.
Next Steps: From Solution to End-to-End Product
Currently, Datacapsule is essentially a technical solution. However, future plans aim to transform it into an end-to-end product.
Product Development Path
The core shift will be from code modification to configuration-driven operations. Planned features include a visual configuration interface, modular design, low-code/no-code interfaces, and automated workflows.
Data Capsule Product Vision
Data Capsule aims to reduce the difficulty of enterprise knowledge construction, form a closed-loop enterprise knowledge barrier, and unleash the potential of large models in vertical domains. It is suitable for enterprise-specific knowledge management, professional domain Q&A, and industry knowledge graph construction.
Conclusion
Datacapsule provides a powerful solution for knowledge graph construction and information retrieval with its robust features and flexible architecture. It has immense potential and value for both enterprises and individuals. As it continues to evolve, Datacapsule is expected to bring more convenience and innovation.