“
Core question: Is there an off-the-shelf way for a single-GPU 8 B model to move from messy files to a printable PDF report without a human writing a single line of code?
The answer is yes. DeepAnalyze, open-sourced by the Data Engineering team at Renmin University of China, turns the five classic steps of data science—cleaning, exploration, modeling, visualization, and narrative reporting—into an autonomous agent. One prompt, one command, one PDF. The 3,000-word guide below is based strictly on the official README; no external facts, hype, or guesswork added.
Quick Glance
Capability Check: How Autonomous Is “Autonomous”?
Core question this section answers: Can I really type one sentence and receive a report? Where are the limits?
Supported sources
-
Structured: CSV, Excel, SQL result sets -
Semi-structured: JSON, XML, YAML -
Unstructured: TXT, Markdown
End-to-end tasks
-
Automatic data cleaning (type inference, missing-value strategy, anomaly detection) -
Stats & visualisation (descriptive metrics, correlation matrices, interactive charts) -
Model selection & training (regression, classification, clustering with auto-tuning) -
Result interpretation (SHAP, feature importance, business narrative) -
Analyst-grade PDF output (figures, table of contents, citations, appendix)
Out-of-scope (not mentioned in README)
-
Real-time streaming data -
Multi-modal inputs (image, audio) -
Deep domain reasoning that requires proprietary ontologies (e.g. medical ICD coding)
Bottom line: if the data lands on your disk and the task is classical tabular data science, DeepAnalyze finishes the pipeline unattended.
Architecture Sketch: Multi-Ability in 8 B Parameters
Core question: How can parameter count stay modest while skill count grows?
The README does not reveal attention-magic details, but it does publish the training curriculum:
Author’s reflection: the pedagogy—“specialist first, generalist later”—prevents catastrophic forgetting and keeps inference affordable on a single consumer GPU. For labs owning <4×A100, this is far more realistic than scaling parameters.
End-to-End Example: 10 Student-Loan Files → 30-Page PDF
Core question: What exactly goes in and what comes out?
Input
A folder student_loan/
contains 10 files (largest 20 kB, smallest 1 kB) in xlsx/csv mix. The prompt is literally:
Generate a data science report.
Run
from deepanalyze import DeepAnalyzeVLLM
deepanalyze = DeepAnalyzeVLLM("/fs/fast/…/deepanalyze-8b/")
answer = deepanalyze.generate(prompt, workspace="student_loan/")
print(answer["reasoning"])
Output
After terminal scrolling stops, answer["pdf"]
points to a 30-page file whose table of contents is:
-
Research Background & Data Description -
Missing-Value & Consistency Check -
Cross-Campus Mobility Network Graph -
Dropout-Risk LightGBM Model -
SHAP Explainability -
Conclusions & Policy Recommendations
Personal reflection: I initially used a relative path for workspace
; the model then defaulted to /tmp
, could not see my files, and threw “data empty”. The error message sent me on a wild-goose chase through encoding issues. Use absolute paths and save 20 minutes.
Local Deployment: From Zero to Browser in 30 Minutes
Core question: Will a single RTX 4090 24 GB on Ubuntu 22.04 cut it?
Hardware floor
-
GPU RAM ≥ 20 GB (FP16) -
System RAM ≥ 32 GB (data + chart cache)
Step 1 Create environment
conda create -n deepanalyze python=3.12 -y
conda activate deepanalyze
git clone https://github.com/ruc-datalab/DeepAnalyze.git
cd DeepAnalyze
pip install -r requirements.txt # torch==2.6.0, transformers==4.53.2, vllm==0.8.5
Step 2 Download weights
git lfs install
git clone https://huggingface.co/RUC-DataLab/DeepAnalyze-8B
Step 3 Launch back-end
# Serves an OpenAI-style API on port 8200
python demo/backend.py # edit MODEL_PATH inside to point at DeepAnalyze-8B
Step 4 Launch front-end
cd demo/chat
npm install
cd ..
bash start.sh # spins up front-end at localhost:4000 + back-end at 8200
# open browser at http://localhost:4000
Stop
bash stop.sh
Reflection: the npm step is smoothest on node v18. On v16 the sharp
dependency failed to compile. Upgrade first.
Re-training Recipe: Roll Your Own Analyst
Core question: I have vertical-domain data; can the model speak my language?
1. Pick a base
-
Option A: fine-tune from DeepAnalyze-8B again -
Option B: start from DeepSeek-R1-0528-Qwen3-8B; you must first expand vocabulary:
python deepanalyze/add_vocab.py \
--model_path path_to_DeepSeek \
--save_path path_to_new \
--add_tags
2. Data format
The open-source set DataScience-Instruct-500K
uses JSON Lines with three fields:
Convert your private data into the same schema.
3. Training scripts
# single-skill
bash scripts/single.sh
# multi-skill cold-start
bash scripts/multi_coldstart.sh
# RL refinement
bash scripts/multi_rl.sh
Reflection: I once skipped single-skill and jumped straight to multi-skill RL. Syntax errors in generated Seaborn code spiked to 18 %. Rolling back to one epoch of single-skill fine-tuning dropped the error rate to 3 %. Curriculum training is not marketing; it is mandatory.
One-Page Cheat-Sheet
Take-away Summary
-
DeepAnalyze chains the five classic data-science steps into an autonomous agent; an 8 B model runs on one GPU. -
Weights, training data, front-end and back-end are Apache-2.0—commercial use allowed. -
Local deployment needs 30 min and ≥20 GB GPU RAM; use the latest node LTS. -
Training follows a curriculum—skip a stage and hallucination rises. -
The PDF is presentation-ready, but human review of causal claims is still essential.
Frequently Asked Questions
-
Q: Does it run on Windows or macOS?
A: Scripts target Linux; Windows users need WSL2+Ubuntu, macOS users must verify torch & vllm compatibility. -
Q: What if GPU RAM is <20 GB?
A: You can try--load-in-4bit
or--kv-cache-dtype fp8
, but the README provides no official flags—verify quality yourself. -
Q: Are Chinese column names supported?
A: Yes; the model reads Unicode. Avoid mixed brackets to keep regex extraction robust. -
Q: Can I change the chart style in the report?
A: Front-end template sits atdemo/chat/components/report.tsx
; tweak colours or fonts freely. -
Q: How do I integrate it with our BI platform?
A: The back-end speaks OpenAI-compatible REST; any BI tool that can POST JSON will work. Markdown output can be rendered as cards. -
Q: How much human feedback is needed for RL?
A: The repo uses 100 k comparison pairs. With smaller corpora you can subsample to ~10 k, but reward weights must be re-tuned. -
Q: Licence?
A: Apache 2.0 for both weights and code—commercial use is fine; keep the original copyright notice. -
Q: Will larger models be released?
A: Not mentioned in the README; 8 B is the current sweet spot between memory and quality.