Datacapsule: Revolutionizing Knowledge Graph Retrieval with Multi-Path Technology

高效码农

9 months ago

Datacapsule: A Multi-Path Retrieval Solution Based on Knowledge Graphs

In the era of information explosion, finding useful information from a vast amount of data has become a challenge for everyone. Datacapsule, a multi-path retrieval solution based on knowledge graphs, offers a new approach to this problem.

What is Datacapsule?

Datacapsule is a solution that uses multi-path retrieval technology to achieve precise knowledge retrieval. It covers various functional modules such as retrieval systems, entity relation extraction, entity attribute extraction, entity linking, structured database construction, and question-answering systems.

Core Advantages of Datacapsule

Compared to traditional knowledge graph construction and retrieval methods, Datacapsule has the following advantages:

Efficient graph construction: With optimized algorithms and models, Datacapsule can quickly build knowledge graphs.
Accurate retrieval: Based on multi-path retrieval technology, Datacapsule can flexibly select retrieval strategies according to the type of user question.
Multi-round dialogue support: Datacapsule can better grasp user needs and provide more coherent and accurate answers.

Main Functions of Datacapsule

Knowledge Graph and Structured Database Construction

Datacapsule uses dspy for intent recognition to process entity extraction and build graph information. It then converts the constructed graph information into structured information and stores it in a database.

Knowledge Graph Storage and Management

Using NetworkX, Datacapsule implements knowledge graph storage and management, supporting dynamic construction and querying of entity relationships.

Vector Database Retrieval

Datacapsule integrates a lightweight vector database based on NanoVector, enabling efficient semantic similarity retrieval.

Multi-Path Retrieval Method Based on Graphs

This is the core function of Datacapsule. It combines a reasoning system based on Chain of Thought and supports multi-round dialogue context understanding, forming a complete reasoning and querying system.

When a user initiates a query, the system first determines whether the entity in the question exists in the knowledge graph. If not, it directly uses vector retrieval to obtain answers. If it does, the system further determines the question type (entity query, relationship query, attribute query, or statistical query) and adopts corresponding retrieval strategies.

Real-Time Communication and Status Synchronization

Datacapsule uses WebSocket for real-time message pushing, supporting stream-based dialogue responses and real-time feedback of optimizer status.

Model Optimizer

Datacapsule supports model optimization based on user feedback, with version management and rollback capabilities, and provides a visualization of the optimization process.

Database Management System

Datacapsule uses SQLite to store user interaction records, supports batch processing of vector data, and has data version control capabilities.

Front-End Interaction Interface

The front-end interface of Datacapsule is built with React 18 + Vite, offering features like real-time dialogue windows, user question collection, reasoning process display, and optimization progress exhibition.

System Monitoring and Logging

Based on loguru, Datacapsule provides a hierarchical logging system for performance monitoring, error tracking, and API call statistics.

Environment Configuration Management

Datacapsule supports multiple LLM model configurations, flexible environment variable management, and multi-environment deployment.

Technical Framework and System Architecture

Front-End Technology Stack

Development Languages: JavaScript + TypeScript
Front-End Framework: React 18 + Vite
UI Framework: TailwindCSS
Build Tool: Vite
Real-Time Communication: WebSocket Client

Back-End Technology Stack

Development Language: Python (Version 3.8+ recommended)
Web Framework: FastAPI
Databases:
- Structured Data: SQLite
- Vector Database: NanoVector
- Graph Structure Storage: NetworkX
Knowledge Extraction:
- Entity & Relationship Extraction: DSPy + CoT
AI Models:
- Embedding Models: Various configurations supported
- Large Language Models: Supports OpenAI/DeepSeek, etc.
Development Tools:
- Dependency Management: pip
- Environment Management: python-dotenv
- Logging System: loguru

System Architecture

Front-Back End Separation Architecture
WebSocket Real-Time Communication
Hybrid Recall of Vector Retrieval + Graph Retrieval + Text2SQL
DSPy Intent Understanding and Reasoning

Getting Started with Datacapsule

Clone the Repository

First, clone the Datacapsule backend service:

git clone https://github.com/loukie7/Datacapsule.git

Then, clone the frontend service.

Install Dependencies

Navigate to the Datacapsule directory and install the required dependencies:

cd Datacapsule
pip install -r requirements.txt

Configure Environment Variables

Create an .env file in the directory and configure it based on the .env.example template. Key configurations include LLM configurations, system environment settings, vector retrieval parameters, and embedding model settings.

Run the Service

Start the backend service with:

cd Datacapsule
python app.py

For detailed steps on starting the frontend service, visit the Datacapsule-webui repository.

Data Processing

Datacapsule supports two data processing methods: using built-in sample data and custom data. If you want to use custom data, use tools/entity_extraction.py for graph data extraction and entity_extraction_db.py for storing structured data.

Example Queries

After launching successfully, the interface will appear as follows:

When the entity is not in the graph, the system automatically switches to vector retrieval. For example, when top_k is set to 1, only the most similar result is returned.

For entities within the graph, Datacapsule can handle various queries:

Entity Query: “What is the Taiwan hagfish?”
Relationship Query: “What is the relationship between Taiwan hagfish and Slime eel?”
Attribute Query: “What are the living habits of Slime eel?”
Statistical Query: “How many species are there in the hagfish family?”

You can click the link on the homepage to access knowledge graph information.

DSPy Intent Understanding Mechanism

Zero-Shot Understanding Capability

The DSPy framework uses the ReAct mode, allowing large models to understand user intents without pre-training. The ReAct module automatically parses the signatures and docstrings of each tool function, generating implicit prompts to guide the model in selecting the appropriate tools.

Tool Selection Mechanism

In dspy_inference.py, the ReAct module automatically parses the signatures and docstrings of each tool function to generate implicit prompts, guiding the model in selecting the appropriate tools.

DSPy Optimization Principles and Effects

Optimization Technique Essence

DSPy optimization is based on prompt engineering automation. The system collects user feedback data through the evaluator in dspy_evaluation.py, and the optimization process is stored in program files within the dspy_program directory.

Optimization Process

The optimization logic in app.py collects user questions and feedback as optimization samples, uses BiologicalRetrievalEvaluation to assess reasoning quality, and applies multiple iterations to generate more precise thinking templates.

Optimization Effects

After optimization, improvements are seen in intent understanding, tool selection, reasoning patterns, and domain-specific understanding.

Data Source Replacement and Scenario Adaptability

Built-in Data Source Replacement

Datacapsule includes two built-in sample datasets, demo18.json and demo130.json. You can replace them with:

# Replace the small test dataset
cp your_small_dataset.json docs/demo18.json

# Replace the full dataset
cp your_full_dataset.json docs/demo130.json

Custom Data Introduction

To introduce your own domain data, you need to make comprehensive adjustments:

Prepare JSON-formatted data.
Extract entities and build graphs with tools/entity_extraction.py.
Create a relational database with tools/entity_extraction_db.py.
Adjust various DSPy components.

Data Scenario Adaptability

Datacapsule is best suited for scenarios with clear answers, highly structured data, and professional vertical domains. For scenarios requiring non-quantitative evaluation, reasoning, and multi-source heterogeneous data, custom evaluation metrics are needed.

System Limitations and Improvement Directions

Limitations of the Current Intent Recognition Module

The intent recognition module of Datacapsule has certain limitations, such as limited streaming output support, challenges in quantifying optimization effects, and insufficient architectural flexibility.

Complex Query Processing Capability

Datacapsule supports multi-condition filtered statistical queries, but the precision depends on the granularity of structured data fields.

Response Efficiency Improvement Strategies

To enhance response efficiency, consider deploying high-performance inference frameworks locally or conduct pre-deployment testing of multiple service providers for performance and cost.

Knowledge Graph Management and Display

Graph Database and Visualization Optimization

Datacapsule currently uses a lightweight graph database implementation. Future plans include integrating professional graph databases and developing an admin console to optimize storage structures for large-scale graph processing.

Knowledge Graph Display Optimization

Currently, Datacapsule offers basic HTML display. Future updates will incorporate professional graph visualization libraries to enable adaptive layouts and interactive features.

Reasoning Process Display Explanation

Datacapsule intentionally displays detailed reasoning processes to help developers and users understand system decision paths. In production environments, detailed reasoning processes can be hidden, while development environments can retain them for debugging and optimization.

Next Steps: From Solution to End-to-End Product

Currently, Datacapsule is essentially a technical solution. However, future plans aim to transform it into an end-to-end product.

Product Development Path

The core shift will be from code modification to configuration-driven operations. Planned features include a visual configuration interface, modular design, low-code/no-code interfaces, and automated workflows.

Data Capsule Product Vision

Data Capsule aims to reduce the difficulty of enterprise knowledge construction, form a closed-loop enterprise knowledge barrier, and unleash the potential of large models in vertical domains. It is suitable for enterprise-specific knowledge management, professional domain Q&A, and industry knowledge graph construction.

Conclusion

Datacapsule provides a powerful solution for knowledge graph construction and information retrieval with its robust features and flexible architecture. It has immense potential and value for both enterprises and individuals. As it continues to evolve, Datacapsule is expected to bring more convenience and innovation.