AI Coding Assistant Training Data Extraction Toolkit: A Complete Collection Solution from Conversations to Code
In machine learning model training, high-quality conversational data and code interaction records are the cornerstones of improving model performance. Whether you’re training a custom code assistant or analyzing how AI coding tools are used, you need complete, structured raw data. The toolkit we’re covering today is designed to solve this exact need—it automatically extracts all conversation, agent operation, and code context data from mainstream AI coding assistants, providing a solid data foundation for model training.
I. What Can This Toolkit Do for You?
Simply put, this is a comprehensive extraction suite that automatically discovers and collects full interaction data from popular AI coding assistants. Whether you use Claude Code, Cursor, or another programming aid, it can “dig out” conversation histories hidden in the software’s storage and organize them into formats suitable for machine learning training.
Specifically, it can extract the following content:
-
Complete conversation records: Every user query and AI response—nothing is missed -
Code context information: Involved file paths, specific line numbers, and code snippets -
Code differences and modification suggestions: AI-proposed code changes, including detailed additions, deletions, and edits -
Multi-file association data: Preserves relationships when conversations involve multiple files -
Tool usage records: Various tools called by the AI and their execution results -
Timestamps and metadata: Creation time of each piece of information, the model used, and other supplementary details
If you’re preparing to train a code-related machine learning model or need to analyze AI coding assistant interaction patterns, this data will prove invaluable.
II. What Extraction Scripts Are Included in the Toolkit?
The toolkit provides specialized extraction scripts for different AI coding assistants. Each script is optimized to accurately identify the target tool’s data storage locations and formats.
1. extract_claude_code.py: Extract Data from Claude Code
If you use Claude Code or Claude Desktop, this script will locate and extract your data. It automatically searches these locations:
-
~/.claude -
~/.claude-code -
~/.claude-local -
~/.claude-m2 -
~/.claude-zai
Claude stores data primarily in JSONL session files. This script extracts message content, tool usage records, file context, and code differences from these files.
2. extract_codex.py: Extraction Tool for Codex
If Codex is installed on your device, running this script will extract its data. It mainly searches:
-
~/.codex -
~/.codex-local
Codex uses Rollout JSONL files for data storage. The script extracts user-agent messages, tool execution results, and code differences from these files.
3. extract_cursor.py: Full Support for All Cursor Versions
Cursor is a favorite among many developers, and this script offers the most comprehensive support for it, handling data formats from older versions to the latest releases. It searches these locations:
-
macOS: ~/Library/Application Support/Cursor -
Other systems: Corresponding application support directories
Cursor’s data is stored primarily in SQLite databases, including state.vscdb and cursorDiskKV. The script works with all Cursor modes:
-
Legacy Chat mode (workspace storage) -
Composer inline storage (v1.x – messages in the composerData array) -
Composer separate storage (v1.x-v2.0 transition – messages in bubbleId keys) -
Latest Composer/Agent (v2.0+)
It extracts code context, selections, diffs, suggested edits, code blocks, and tool execution results/outputs.
4. extract_trae.py: Extract Interaction Data from Trae
For Trae, this script searches:
-
~/.trae -
~/Library/Application Support/Trae
Trae uses a flexible storage format combining JSONL files and SQLite databases. The script handles both formats uniformly, extracting chat records, agent data, tool usage, and code differences.
5. extract_windsurf.py: Process Windsurf Data
Windsurf’s extraction script searches its application support directory (e.g., ~/Library/Application Support/Windsurf on macOS). Its data is stored in VSCode-like SQLite databases, and the script extracts chat records, agent/flow conversations, and code context.
6. extract_continue.py: For Continue AI Assistant
If you use Continue AI Assistant, this script searches the ~/.continue/sessions/ directory, where JSON-format session files are stored. It extracts user-assistant messages, tool calls and results, reasoning blocks, context items, and workspace information.
III. How to Get Started with These Tools?
This toolkit requires no additional dependencies—it uses only Python 3’s standard library. However, you need to ensure you have Python 3.6 or higher installed.
Step 1: Check Your Python Environment
First, confirm you have a compatible Python version. Open a terminal or command prompt and enter:
python3 --version
If the displayed version is 3.6 or higher, you’re ready to proceed. If not, you’ll need to upgrade Python first.
Step 2: Run the Corresponding Extraction Script
Execute the script matching your AI coding assistant. For example:
-
Extract data from Claude Code:
python3 extract_claude_code.py
-
Extract data from Cursor:
python3 extract_cursor.py
-
Extract data from Codex:
python3 extract_codex.py
-
Extract data from Trae:
python3 extract_trae.py
-
Extract data from Windsurf:
python3 extract_windsurf.py
To extract data from all supported tools at once, run:
./extract_all.sh
Step 3: View the Extraction Results
All scripts create an extracted_data/ folder in the current directory. Extracted data is saved as timestamped JSONL files inside this folder. A typical directory structure looks like this:
extracted_data/
├── claude_code_conversations_20250116_143022.jsonl
├── cursor_complete_20250116_143045.jsonl
├── codex_conversations_20250116_143102.jsonl
├── trae_conversations_20250116_143115.jsonl
└── windsurf_conversations_20250116_143130.jsonl
Each filename includes the tool name and extraction time, making it easy to identify and manage.
IV. What Does the Extracted Data Look Like?
Extracted data uses JSONL format (one JSON object per line), which is ideal for handling large datasets while maintaining the independence of each conversation.
A typical conversation entry looks like this:
{
"messages": [
{
"role": "user",
"content": "How do I fix this TypeScript error?",
"code_context": [
{
"file": "/Users/user/project/src/index.ts",
"code": "const x: string = 123;",
"range": {
"selectionStartLineNumber": 10,
"positionLineNumber": 10
}
}
],
"timestamp": "2025-01-16T14:30:22.123Z"
},
{
"role": "assistant",
"content": "The error occurs because you're assigning a number to a string type...",
"suggested_diffs": [...],
"model": "claude-sonnet-4-5",
"timestamp": "2025-01-16T14:30:25.456Z"
}
],
"source": "cursor-composer",
"name": "TypeScript Type Error Fix",
"created_at": 1705414222000
}
As you can see, each conversation contains multiple messages (messages), each with a clear role (role)—either user (human developer) or assistant (AI). User messages may include code_context, which records the discussed code snippet, its file location, and line numbers. Assistant messages may include suggested_diffs (AI-proposed code modifications) and model (the AI model used).
The entire conversation also includes metadata like source (indicating which tool it came from), name (conversation title), and created_at (creation timestamp), facilitating subsequent filtering and analysis.
V. How Does the Toolkit Extract Data?
The scripts follow a six-step automated process to locate and extract data:
1. Detect the Operating System
First, the script identifies whether you’re using macOS, Linux, or Windows, as file storage locations vary by system.
2. Search Common Storage Locations
Based on the operating system, the script automatically searches these common application data directories:
-
macOS: ~/Library/Application Support,~/.config, user home directory -
Linux: ~/.config,~/.local/share, user home directory -
Windows: %APPDATA%,%LOCALAPPDATA%, user home directory
3. Locate All Installations of the Target Tool
Within these directories, the script searches for the target AI coding assistant’s installation folders to ensure no potential installation locations are missed.
4. Scan Storage Files
Once the tool’s installation directory is found, the script scans for storage files, including:
-
SQLite databases (e.g., .vscdb,.dbfiles) -
JSONL session files -
Project-specific directories
5. Extract Complete Data
The script uses corresponding parsing methods for different file formats to extract full data, including conversation context and code differences.
6. Save as Structured JSONL
Finally, all extracted data is organized into a unified JSONL format, named with a timestamp, and saved to the extracted_data/ directory.
How Do Storage Formats Differ Across Tools?
Different AI coding assistants store data in distinct ways. Understanding these differences helps you better grasp the extraction process:
-
Claude Code and Codex: Primarily use JSONL files (one event per line). Files are typically located in paths like
~/.claude/projects/[project]/[session].jsonland use an event-based structure with type markers. -
Cursor (v0.43 to v2.0+): Uses SQLite databases, with two main locations:
-
Workspace data: ~/Library/Application Support/Cursor/User/workspaceStorage/[hash]/state.vscdb -
Global data: ~/Library/Application Support/Cursor/User/globalStorage/state.vscdb
The database’sItemTablestores chat records, whilecursorDiskKVstores Composer/Agent data. Its storage structure has evolved with versions—from early Chat mode to later Composer inline storage, transitional separate storage, and the latest format in v2.0+. The script fully supports all these variations.
-
-
Trae and Windsurf: Use hybrid formats (JSONL + SQLite databases) with storage structures similar to VSCode extension data.
VI. How to Understand the Extracted Data?
Extracted data contains rich information. Understanding what each field means helps you make the most of it.
What Are the Message Roles?
The role field in the data has two main values:
-
user: Indicates a message sent by a human developer -
assistant: Indicates a response from the AI assistant
What Information Does Code Context Include?
Code-related fields help reconstruct the original development scenario:
-
code_context: Contains selected file content, code snippets, and line number ranges discussed -
suggested_diffs: AI-proposed code modifications -
tool_use: Records tools called by the AI (e.g., code execution, file operations) -
tool_results: Tool execution outputs and applied diffs -
diff_histories: Complete edit history records
What Are Metadata Used For?
Metadata helps filter and categorize data:
-
source: Indicates which tool the data came from (e.g., “cursor-composer”, “claude-code”) -
session_id/composer_id: Unique conversation identifier -
project_path: Working directory at the time of the conversation -
timestamp: Time the message was created -
model: AI model used (if recorded)
VII. What Advanced Uses Are There?
Beyond basic extraction, you can further process the data to meet specific needs.
How to Merge All Extraction Results?
To combine data from multiple tools into one file, use these commands:
# Merge all JSONL files into one
cat extracted_data/*.jsonl > all_conversations.jsonl
# Count total conversations
wc -l all_conversations.jsonl
# Count conversations by source tool
grep -o '"source":"[^"]*"' all_conversations.jsonl | sort | uniq -c
How to Filter Conversations by Date?
To extract only conversations from a specific time period, use this Python code:
import json
from datetime import datetime
with open('extracted_data/cursor_complete_20250116.jsonl') as f:
for line in f:
conv = json.loads(line)
# Get creation timestamp (default to 0 if missing)
created = conv.get('created_at', 0)
# Filter conversations after January 1, 2024 (timestamp: 1704067200000)
if created > 1704067200000:
print(json.dumps(conv))
How to Extract Only Conversations with Code Diffs?
If your training focus is on code modifications, use this code to filter relevant conversations:
import json
with open('extracted_data/cursor_complete.jsonl') as f:
for line in f:
conv = json.loads(line)
# Check if any message contains suggested diffs or diff histories
if any('suggested_diffs' in m or 'diff_histories' in m
for m in conv['messages']):
print(json.dumps(conv))
VIII. What Is the Quality of the Extracted Data?
Understanding data quality helps you assess whether it meets your needs.
What Content Can Be Fully Extracted?
-
Complete conversations: Includes user queries and AI responses, with full context preserved for multi-turn dialogues -
Code context: File paths and names, selected code snippets, line number ranges, and multi-file relationships -
Diffs and edits: AI-proposed code changes, applied diffs, complete edit histories, and file modifications -
Metadata: Timestamps, project paths, model information, and conversation titles
What Content Might Be Missing?
-
Partial data: Pure user messages without AI responses, deleted/archived sessions, and corrupted database entries may not be extractable -
Privacy-related content: Data may contain proprietary code, API keys, and personal file paths—requiring subsequent processing
IX. What Privacy and Security Considerations Are There When Using Extracted Data?
Privacy and security are critical when handling extracted data, especially if it contains sensitive information.
1. Scan for Secret Information
Before using the data, scan for sensitive information like API keys and passwords using this tool:
# Install the detection tool first
pip install detect-secrets
# Scan all extracted files
detect-secrets scan extracted_data/*.jsonl
2. Review Sensitive Data
After scanning, conduct a manual review:
-
Check for API keys, passwords, tokens, and other credentials -
Ensure no proprietary code is exposed -
Anonymize personal information in file paths if needed
3. Store Data Securely
-
Save data on encrypted hard drives -
Do not commit data to public code repositories -
Encrypt backups for added security
X. What Training Scenarios Can the Extracted Data Be Used For?
The primary use of this data is training machine learning models—especially code-related AI assistants.
Direct Use for Fine-Tuning
You can load the data for fine-tuning using Hugging Face’s datasets library:
from datasets import load_dataset
# Load all extracted data
dataset = load_dataset(
'json',
data_files='extracted_data/*.jsonl',
split='train'
)
# Filter for complete conversations with AI responses
dataset = dataset.filter(
lambda x: any(m['role'] == 'assistant' for m in x['messages'])
)
Use with Unsloth
Unsloth is an efficient model training library. Combine it with extracted data for fast model training:
from unsloth import FastLanguageModel
# Load the base model
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/qwen2.5-coder-7b-instruct",
max_seq_length=4096,
load_in_4bit=True,
)
# Format conversation data to match the model's input format
def format_chat(example):
return {
'text': tokenizer.apply_chat_template(
example['messages'],
tokenize=False
)
}
# Apply the formatting function to the dataset
dataset = dataset.map(format_chat)
XI. What Issues Might Arise During Use, and How to Fix Them?
You may encounter common issues while using these tools. Here are their solutions:
Issue 1: Script Says “No Installations Found”
If the script can’t locate the tool’s installation:
-
Confirm the AI coding assistant is actually installed -
Manually check if the tool is installed in a non-default path -
Add a custom path to the script’s find_XXX_installations()function:
# Add this line in the function to specify your tool's installation path
locations.append(Path("/custom/path/to/tool"))
Issue 2: Extraction Completes But extracted_data Is Empty
This usually means there’s no historical data to extract:
-
Confirm you’ve used the tool and have chat history -
Check if data is stored in a non-standard location -
Manually search for database files:
# Search for all possible database files on macOS/Linux
find ~ -name "*.vscdb" -o -name "*.db" 2>/dev/null
Issue 3: “Database Locked” Error
SQLite databases lock when the tool is in use:
-
Close the AI tool before extraction -
Connect to the database in read-only mode:
# Modify the database connection code in the script
conn = sqlite3.connect(f'file:{db_path}?mode=ro', uri=True)
Issue 4: “Permission Denied” Error
This occurs when the script lacks file read permissions:
-
Run the script with appropriate permissions (e.g., add sudobefore the command if needed) -
Check file ownership to ensure the current user has read access -
Copy database files to an accessible directory before extraction
XII. What Notes Apply to Different Operating Systems?
Usage varies slightly across operating systems. Being aware of these differences prevents common issues.
macOS
-
Most tools store data in ~/Library/Application Support -
Access to certain system directories may require “Full Disk Access” permission -
SQLite databases are typically located in ~/Library/Application Support/[Tool Name]/User/
Linux
-
Data is mainly stored in ~/.configand~/.local/share -
Some tools may use ~/.local/state -
Tools may use $XDG_CONFIG_HOMEif it’s set
Windows
-
Common paths are %APPDATA%and%LOCALAPPDATA%, corresponding toC:\Users\[Username]\AppData\Roaming\[Tool Name] -
Accessing files in “Program Files” may require administrator privileges
XIII. Which Versions of AI Coding Assistants Are Supported?
AI tools may change their data storage formats between versions. Here’s the toolkit’s compatibility:
Cursor
-
✅ v2 (0.43+): Supports Composer/Agent data stored in cursorDiskKV -
✅ v1: Supports chat records in workspace ItemTable -
⚠️ Versions older than v0.43: Different format with limited support
Claude Code
-
✅ All versions using JSONL session files -
✅ Project-based structure support
Codex
-
✅ Rollout JSONL format support -
✅ Time-organized session structure support
XIV. What Tips Help Handle Large Datasets?
When working with large volumes of extracted data, these tips improve efficiency:
Split Large Files
Divide large JSONL files into smaller chunks for easier processing:
# Split the file into chunks of 1000 lines each, with "chunk_" as the prefix
split -l 1000 all_conversations.jsonl chunk_
Compress for Storage
Compress data files to save space:
# Compress all extracted JSONL files
gzip extracted_data/*.jsonl
Speed Optimization
Use multiprocessing to accelerate processing of large numbers of database files:
# Import the multiprocessing module
from multiprocessing import Pool
# Assume extract_from_db is a function that processes a single database file, and db_files is a list of database files
with Pool() as pool:
results = pool.map(extract_from_db, db_files)
XV. How to Contribute to This Toolkit?
If you discover a new AI coding assistant or an updated storage format, contributions are welcome:
-
Follow the existing script structure -
Add auto-discovery logic for the new tool -
Ensure complete data extraction (messages + context + diffs) -
Output data in organized JSONL format -
Update this documentation
XVI. License and Disclaimer
This toolkit is available under the MIT License—you can freely use it for training machine learning models.
However, note the following: This toolkit extracts YOUR OWN data from locally installed AI tools. You are responsible for:
-
Ensuring you have the right to extract and use the data -
Properly handling sensitive/proprietary information -
Complying with the tool’s Terms of Service -
Scanning for secrets before sharing or using the data for training
Frequently Asked Questions (FAQ)
Does this toolkit require additional dependencies?
No, it uses only Python 3’s standard library. You just need Python 3.6 or higher installed.
Can the extracted data be used for commercial model training?
This depends on the content of the extracted data and the Terms of Service of the relevant AI tool. You must ensure you have the right to use the data and do not infringe on any third-party rights.
Why are there no code diffs in the extracted conversations?
There are two possible reasons: either the conversation didn’t involve code modifications, or the tool version doesn’t record code diffs. Check if the tool version is in the supported list or manually verify the original storage files.
Can I extract data from multiple tools at the same time?
Yes—run the ./extract_all.sh script to extract data from all supported tools in one go.
Are the file paths in the extracted data real?
Yes, the data retains the original file path information. If privacy protection is needed, anonymize these paths before use.
Will the toolkit modify the original data?
No—all extraction operations are read-only. The toolkit will not modify the original storage files of the AI tool, so you can use it with confidence.
This toolkit enables you to systematically collect and organize interaction data from AI coding assistants, providing high-quality materials for model training. Whether for research or practical applications, this structured data helps you achieve your goals more efficiently.

