AI Coding Assistant Training Data Extraction Toolkit: A Complete Collection Solution from Conversations to Code

In machine learning model training, high-quality conversational data and code interaction records are the cornerstones of improving model performance. Whether you’re training a custom code assistant or analyzing how AI coding tools are used, you need complete, structured raw data. The toolkit we’re covering today is designed to solve this exact need—it automatically extracts all conversation, agent operation, and code context data from mainstream AI coding assistants, providing a solid data foundation for model training.

I. What Can This Toolkit Do for You?

Simply put, this is a comprehensive extraction suite that automatically discovers and collects full interaction data from popular AI coding assistants. Whether you use Claude Code, Cursor, or another programming aid, it can “dig out” conversation histories hidden in the software’s storage and organize them into formats suitable for machine learning training.

Specifically, it can extract the following content:

Complete conversation records: Every user query and AI response—nothing is missed
Code context information: Involved file paths, specific line numbers, and code snippets
Code differences and modification suggestions: AI-proposed code changes, including detailed additions, deletions, and edits
Multi-file association data: Preserves relationships when conversations involve multiple files
Tool usage records: Various tools called by the AI and their execution results
Timestamps and metadata: Creation time of each piece of information, the model used, and other supplementary details

If you’re preparing to train a code-related machine learning model or need to analyze AI coding assistant interaction patterns, this data will prove invaluable.

II. What Extraction Scripts Are Included in the Toolkit?

The toolkit provides specialized extraction scripts for different AI coding assistants. Each script is optimized to accurately identify the target tool’s data storage locations and formats.

1. `extract_claude_code.py`: Extract Data from Claude Code

If you use Claude Code or Claude Desktop, this script will locate and extract your data. It automatically searches these locations:

~/.claude
~/.claude-code
~/.claude-local
~/.claude-m2
~/.claude-zai

Claude stores data primarily in JSONL session files. This script extracts message content, tool usage records, file context, and code differences from these files.

2. `extract_codex.py`: Extraction Tool for Codex

If Codex is installed on your device, running this script will extract its data. It mainly searches:

~/.codex
~/.codex-local

Codex uses Rollout JSONL files for data storage. The script extracts user-agent messages, tool execution results, and code differences from these files.

3. `extract_cursor.py`: Full Support for All Cursor Versions

Cursor is a favorite among many developers, and this script offers the most comprehensive support for it, handling data formats from older versions to the latest releases. It searches these locations:

macOS: ~/Library/Application Support/Cursor
Other systems: Corresponding application support directories

Cursor’s data is stored primarily in SQLite databases, including state.vscdb and cursorDiskKV. The script works with all Cursor modes:

Legacy Chat mode (workspace storage)
Composer inline storage (v1.x – messages in the composerData array)
Composer separate storage (v1.x-v2.0 transition – messages in bubbleId keys)
Latest Composer/Agent (v2.0+)

It extracts code context, selections, diffs, suggested edits, code blocks, and tool execution results/outputs.

4. `extract_trae.py`: Extract Interaction Data from Trae

For Trae, this script searches:

~/.trae
~/Library/Application Support/Trae

Trae uses a flexible storage format combining JSONL files and SQLite databases. The script handles both formats uniformly, extracting chat records, agent data, tool usage, and code differences.

5. `extract_windsurf.py`: Process Windsurf Data

Windsurf’s extraction script searches its application support directory (e.g., ~/Library/Application Support/Windsurf on macOS). Its data is stored in VSCode-like SQLite databases, and the script extracts chat records, agent/flow conversations, and code context.

6. `extract_continue.py`: For Continue AI Assistant

If you use Continue AI Assistant, this script searches the ~/.continue/sessions/ directory, where JSON-format session files are stored. It extracts user-assistant messages, tool calls and results, reasoning blocks, context items, and workspace information.

III. How to Get Started with These Tools?

This toolkit requires no additional dependencies—it uses only Python 3’s standard library. However, you need to ensure you have Python 3.6 or higher installed.

Step 1: Check Your Python Environment

First, confirm you have a compatible Python version. Open a terminal or command prompt and enter:

python3 --version

If the displayed version is 3.6 or higher, you’re ready to proceed. If not, you’ll need to upgrade Python first.

Step 2: Run the Corresponding Extraction Script

Execute the script matching your AI coding assistant. For example:

Extract data from Claude Code:

python3 extract_claude_code.py

Extract data from Cursor:

python3 extract_cursor.py

Extract data from Codex:

python3 extract_codex.py

Extract data from Trae:

python3 extract_trae.py

Extract data from Windsurf:

python3 extract_windsurf.py

To extract data from all supported tools at once, run:

./extract_all.sh

Step 3: View the Extraction Results

All scripts create an extracted_data/ folder in the current directory. Extracted data is saved as timestamped JSONL files inside this folder. A typical directory structure looks like this:

extracted_data/
├── claude_code_conversations_20250116_143022.jsonl
├── cursor_complete_20250116_143045.jsonl
├── codex_conversations_20250116_143102.jsonl
├── trae_conversations_20250116_143115.jsonl
└── windsurf_conversations_20250116_143130.jsonl

Each filename includes the tool name and extraction time, making it easy to identify and manage.

IV. What Does the Extracted Data Look Like?

Extracted data uses JSONL format (one JSON object per line), which is ideal for handling large datasets while maintaining the independence of each conversation.

A typical conversation entry looks like this:

{
  "messages": [
    {
      "role": "user",
      "content": "How do I fix this TypeScript error?",
      "code_context": [
        {
          "file": "/Users/user/project/src/index.ts",
          "code": "const x: string = 123;",
          "range": {
            "selectionStartLineNumber": 10,
            "positionLineNumber": 10
          }
        }
      ],
      "timestamp": "2025-01-16T14:30:22.123Z"
    },
    {
      "role": "assistant",
      "content": "The error occurs because you're assigning a number to a string type...",
      "suggested_diffs": [...],
      "model": "claude-sonnet-4-5",
      "timestamp": "2025-01-16T14:30:25.456Z"
    }
  ],
  "source": "cursor-composer",
  "name": "TypeScript Type Error Fix",
  "created_at": 1705414222000
}

As you can see, each conversation contains multiple messages (messages), each with a clear role (role)—either user (human developer) or assistant (AI). User messages may include code_context, which records the discussed code snippet, its file location, and line numbers. Assistant messages may include suggested_diffs (AI-proposed code modifications) and model (the AI model used).

The entire conversation also includes metadata like source (indicating which tool it came from), name (conversation title), and created_at (creation timestamp), facilitating subsequent filtering and analysis.

V. How Does the Toolkit Extract Data?

The scripts follow a six-step automated process to locate and extract data:

1. Detect the Operating System

First, the script identifies whether you’re using macOS, Linux, or Windows, as file storage locations vary by system.

2. Search Common Storage Locations

Based on the operating system, the script automatically searches these common application data directories:

macOS: ~/Library/Application Support, ~/.config, user home directory
Linux: ~/.config, ~/.local/share, user home directory
Windows: %APPDATA%, %LOCALAPPDATA%, user home directory

3. Locate All Installations of the Target Tool

Within these directories, the script searches for the target AI coding assistant’s installation folders to ensure no potential installation locations are missed.

4. Scan Storage Files

Once the tool’s installation directory is found, the script scans for storage files, including:

SQLite databases (e.g., .vscdb, .db files)
JSONL session files
Project-specific directories

5. Extract Complete Data

The script uses corresponding parsing methods for different file formats to extract full data, including conversation context and code differences.

6. Save as Structured JSONL

Finally, all extracted data is organized into a unified JSONL format, named with a timestamp, and saved to the extracted_data/ directory.

How Do Storage Formats Differ Across Tools?

Different AI coding assistants store data in distinct ways. Understanding these differences helps you better grasp the extraction process:

Claude Code and Codex: Primarily use JSONL files (one event per line). Files are typically located in paths like ~/.claude/projects/[project]/[session].jsonl and use an event-based structure with type markers.
Cursor (v0.43 to v2.0+): Uses SQLite databases, with two main locations:
- Workspace data: ~/Library/Application Support/Cursor/User/workspaceStorage/[hash]/state.vscdb
- Global data: ~/Library/Application Support/Cursor/User/globalStorage/state.vscdb
  The database’s ItemTable stores chat records, while cursorDiskKV stores Composer/Agent data. Its storage structure has evolved with versions—from early Chat mode to later Composer inline storage, transitional separate storage, and the latest format in v2.0+. The script fully supports all these variations.
Trae and Windsurf: Use hybrid formats (JSONL + SQLite databases) with storage structures similar to VSCode extension data.

VI. How to Understand the Extracted Data?

Extracted data contains rich information. Understanding what each field means helps you make the most of it.

What Are the Message Roles?

The role field in the data has two main values:

user: Indicates a message sent by a human developer
assistant: Indicates a response from the AI assistant

What Information Does Code Context Include?

Code-related fields help reconstruct the original development scenario:

code_context: Contains selected file content, code snippets, and line number ranges discussed
suggested_diffs: AI-proposed code modifications
tool_use: Records tools called by the AI (e.g., code execution, file operations)
tool_results: Tool execution outputs and applied diffs
diff_histories: Complete edit history records

What Are Metadata Used For?

Metadata helps filter and categorize data:

source: Indicates which tool the data came from (e.g., “cursor-composer”, “claude-code”)
session_id/composer_id: Unique conversation identifier
project_path: Working directory at the time of the conversation
timestamp: Time the message was created
model: AI model used (if recorded)

VII. What Advanced Uses Are There?

Beyond basic extraction, you can further process the data to meet specific needs.

How to Merge All Extraction Results?

To combine data from multiple tools into one file, use these commands:

# Merge all JSONL files into one
cat extracted_data/*.jsonl > all_conversations.jsonl

# Count total conversations
wc -l all_conversations.jsonl

# Count conversations by source tool
grep -o '"source":"[^"]*"' all_conversations.jsonl | sort | uniq -c

How to Filter Conversations by Date?

To extract only conversations from a specific time period, use this Python code:

import json
from datetime import datetime

with open('extracted_data/cursor_complete_20250116.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        # Get creation timestamp (default to 0 if missing)
        created = conv.get('created_at', 0)
        # Filter conversations after January 1, 2024 (timestamp: 1704067200000)
        if created > 1704067200000:
            print(json.dumps(conv))

How to Extract Only Conversations with Code Diffs?

If your training focus is on code modifications, use this code to filter relevant conversations:

import json

with open('extracted_data/cursor_complete.jsonl') as f:
    for line in f:
        conv = json.loads(line)
        # Check if any message contains suggested diffs or diff histories
        if any('suggested_diffs' in m or 'diff_histories' in m
               for m in conv['messages']):
            print(json.dumps(conv))

VIII. What Is the Quality of the Extracted Data?

Understanding data quality helps you assess whether it meets your needs.

What Content Can Be Fully Extracted?

Complete conversations: Includes user queries and AI responses, with full context preserved for multi-turn dialogues
Code context: File paths and names, selected code snippets, line number ranges, and multi-file relationships
Diffs and edits: AI-proposed code changes, applied diffs, complete edit histories, and file modifications
Metadata: Timestamps, project paths, model information, and conversation titles

What Content Might Be Missing?

Partial data: Pure user messages without AI responses, deleted/archived sessions, and corrupted database entries may not be extractable
Privacy-related content: Data may contain proprietary code, API keys, and personal file paths—requiring subsequent processing

IX. What Privacy and Security Considerations Are There When Using Extracted Data?

Privacy and security are critical when handling extracted data, especially if it contains sensitive information.

1. Scan for Secret Information

Before using the data, scan for sensitive information like API keys and passwords using this tool:

# Install the detection tool first
pip install detect-secrets
# Scan all extracted files
detect-secrets scan extracted_data/*.jsonl

2. Review Sensitive Data

After scanning, conduct a manual review:

Check for API keys, passwords, tokens, and other credentials
Ensure no proprietary code is exposed
Anonymize personal information in file paths if needed

3. Store Data Securely

Save data on encrypted hard drives
Do not commit data to public code repositories
Encrypt backups for added security

X. What Training Scenarios Can the Extracted Data Be Used For?

The primary use of this data is training machine learning models—especially code-related AI assistants.

Direct Use for Fine-Tuning

You can load the data for fine-tuning using Hugging Face’s datasets library:

from datasets import load_dataset

# Load all extracted data
dataset = load_dataset(
    'json',
    data_files='extracted_data/*.jsonl',
    split='train'
)

# Filter for complete conversations with AI responses
dataset = dataset.filter(
    lambda x: any(m['role'] == 'assistant' for m in x['messages'])
)

Use with Unsloth

Unsloth is an efficient model training library. Combine it with extracted data for fast model training:

from unsloth import FastLanguageModel

# Load the base model
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/qwen2.5-coder-7b-instruct",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Format conversation data to match the model's input format
def format_chat(example):
    return {
        'text': tokenizer.apply_chat_template(
            example['messages'],
            tokenize=False
        )
    }

# Apply the formatting function to the dataset
dataset = dataset.map(format_chat)

XI. What Issues Might Arise During Use, and How to Fix Them?

You may encounter common issues while using these tools. Here are their solutions:

Issue 1: Script Says “No Installations Found”

If the script can’t locate the tool’s installation:

Confirm the AI coding assistant is actually installed
Manually check if the tool is installed in a non-default path
Add a custom path to the script’s find_XXX_installations() function:

# Add this line in the function to specify your tool's installation path
locations.append(Path("/custom/path/to/tool"))

Issue 2: Extraction Completes But `extracted_data` Is Empty

This usually means there’s no historical data to extract:

Confirm you’ve used the tool and have chat history
Check if data is stored in a non-standard location
Manually search for database files:

# Search for all possible database files on macOS/Linux
find ~ -name "*.vscdb" -o -name "*.db" 2>/dev/null

Issue 3: “Database Locked” Error

SQLite databases lock when the tool is in use:

Close the AI tool before extraction
Connect to the database in read-only mode:

# Modify the database connection code in the script
conn = sqlite3.connect(f'file:{db_path}?mode=ro', uri=True)

Issue 4: “Permission Denied” Error

This occurs when the script lacks file read permissions:

Run the script with appropriate permissions (e.g., add sudo before the command if needed)
Check file ownership to ensure the current user has read access
Copy database files to an accessible directory before extraction

XII. What Notes Apply to Different Operating Systems?

Usage varies slightly across operating systems. Being aware of these differences prevents common issues.

macOS

Most tools store data in ~/Library/Application Support
Access to certain system directories may require “Full Disk Access” permission
SQLite databases are typically located in ~/Library/Application Support/[Tool Name]/User/

Linux

Data is mainly stored in ~/.config and ~/.local/share
Some tools may use ~/.local/state
Tools may use $XDG_CONFIG_HOME if it’s set

Windows

Common paths are %APPDATA% and %LOCALAPPDATA%, corresponding to C:\Users\[Username]\AppData\Roaming\[Tool Name]
Accessing files in “Program Files” may require administrator privileges

XIII. Which Versions of AI Coding Assistants Are Supported?

AI tools may change their data storage formats between versions. Here’s the toolkit’s compatibility:

Cursor

✅ v2 (0.43+): Supports Composer/Agent data stored in cursorDiskKV
✅ v1: Supports chat records in workspace ItemTable
⚠️ Versions older than v0.43: Different format with limited support

Claude Code

✅ All versions using JSONL session files
✅ Project-based structure support

Codex

✅ Rollout JSONL format support
✅ Time-organized session structure support

XIV. What Tips Help Handle Large Datasets?

When working with large volumes of extracted data, these tips improve efficiency:

Split Large Files

Divide large JSONL files into smaller chunks for easier processing:

# Split the file into chunks of 1000 lines each, with "chunk_" as the prefix
split -l 1000 all_conversations.jsonl chunk_

Compress for Storage

Compress data files to save space:

# Compress all extracted JSONL files
gzip extracted_data/*.jsonl

Speed Optimization

Use multiprocessing to accelerate processing of large numbers of database files:

# Import the multiprocessing module
from multiprocessing import Pool

# Assume extract_from_db is a function that processes a single database file, and db_files is a list of database files
with Pool() as pool:
    results = pool.map(extract_from_db, db_files)

XV. How to Contribute to This Toolkit?

If you discover a new AI coding assistant or an updated storage format, contributions are welcome:

Follow the existing script structure
Add auto-discovery logic for the new tool
Ensure complete data extraction (messages + context + diffs)
Output data in organized JSONL format
Update this documentation

XVI. License and Disclaimer

This toolkit is available under the MIT License—you can freely use it for training machine learning models.

However, note the following: This toolkit extracts YOUR OWN data from locally installed AI tools. You are responsible for:

Ensuring you have the right to extract and use the data
Properly handling sensitive/proprietary information
Complying with the tool’s Terms of Service
Scanning for secrets before sharing or using the data for training

Frequently Asked Questions (FAQ)

Does this toolkit require additional dependencies?

No, it uses only Python 3’s standard library. You just need Python 3.6 or higher installed.

Can the extracted data be used for commercial model training?

This depends on the content of the extracted data and the Terms of Service of the relevant AI tool. You must ensure you have the right to use the data and do not infringe on any third-party rights.

Why are there no code diffs in the extracted conversations?

There are two possible reasons: either the conversation didn’t involve code modifications, or the tool version doesn’t record code diffs. Check if the tool version is in the supported list or manually verify the original storage files.

Can I extract data from multiple tools at the same time?

Yes—run the ./extract_all.sh script to extract data from all supported tools in one go.

Are the file paths in the extracted data real?

Yes, the data retains the original file path information. If privacy protection is needed, anonymize these paths before use.

Will the toolkit modify the original data?

No—all extraction operations are read-only. The toolkit will not modify the original storage files of the AI tool, so you can use it with confidence.

This toolkit enables you to systematically collect and organize interaction data from AI coding assistants, providing high-quality materials for model training. Whether for research or practical applications, this structured data helps you achieve your goals more efficiently.

AI Coding Assistant Data Extraction Toolkit: The Ultimate Training Data Solution