Learning to Edit Interactive Machine Learning Notebooks: A Practical Guide

An in-depth exploration of how interactive notebooks evolve and how language models can learn to edit them efficiently.

Jupyter Notebook

In the machine learning world, Jupyter Notebooks have become essential tools. They allow developers and researchers to document experiments, analyze data, and visualize results all in one place. But as notebooks grow in size and complexity, editing them becomes more time-consuming and error-prone. What if models could automatically learn how to edit notebooks as developers do?

This blog post explores the groundbreaking research behind “Learning to Edit Interactive Machine Learning Notebooks.” It introduces a unique dataset of notebook edits, explains how models were trained to understand and edit notebooks, and walks through the process of replicating the study using open-source tools. The content is fully based on the original research documentation, with no external additions, and aims to be informative, practical, and accessible for readers with a college-level technical background.


Introduction: Why Editing Jupyter Notebooks Matters

Notebooks are not just passive records of code—they evolve as ideas evolve. Developers insert, delete, or modify code cells, adjust markdown explanations, and update visualizations. Over time, this editing behavior becomes an important signal of how machine learning workflows develop.

However, there’s a problem: current code generation models are typically trained on static code. They don’t understand how code evolves. This research bridges that gap by collecting real notebook edits and teaching language models to predict those changes. The outcome is a new direction in intelligent code editing and software maintenance.


The Dataset: Real Edits from Thousands of ML Notebooks

The research team collected notebook edit histories from 792 public GitHub repositories focused on machine learning. Using Git version histories, they captured how notebooks were edited over time. The result is a dataset containing:

  • 48,398 notebook edit samples
  • 20,095 commit-based revisions
  • File-level and cell-level representations
  • Filtered commit messages for quality training data

These edits cover a wide range of actions, including code additions, markdown changes, bug fixes, and improvements to machine learning pipelines.

Dataset Files

  1. commits.jsonl: Contains raw commit information, including:

    • Repository name
    • Commit hash
    • Commit message
    • Notebook path
    • Pre-edit and post-edit code
  2. commits-filtered.jsonl: A cleaned version excluding commits with messages under three characters (to remove noise).

  3. Train/test splits:

    • train_index.txt
    • val_index.txt
    • test_index.txt

The dataset is openly available via Zenodo: DOI: 10.5281/zenodo.15716537


Step-by-Step: How the Dataset Was Built

1. Fetching ML Repositories

The script python_fetch.py uses the GitHub API to identify the top 1,000 Python repositories tagged for machine learning. It outputs:


top\_1000\_python\_repos.json

This file lists repositories in descending order of popularity.

2. Cloning and Extracting Edits

Using change_stat.py, the system clones each repository and extracts notebook commit history. For every commit that includes a notebook file, the script computes diffs between versions.

Each record includes:

  • Repo name
  • Commit ID
  • Commit message
  • Notebook file path
  • Content before and after the commit

Output:


commits.jsonl

3. Filtering and Splitting

To ensure data quality, the split.py script removes low-quality edits (e.g., very short messages) and splits the dataset into training, validation, and testing subsets.


train\_index.txt
val\_index.txt
test\_index.txt

You can then regenerate the cleaned dataset using:


python split.py

This produces a de-duplicated, balanced dataset ready for model training.


Training Models to Edit Notebooks

The core goal is to teach language models how to perform edits on Jupyter Notebooks given a description of the change (commit message) and the original notebook.

Finetuning with LoRA

The training script train_baseline.py allows developers to finetune large language models using LoRA (Low-Rank Adaptation). LoRA is a memory-efficient technique for adapting large models to specific tasks.

{
  "tokenizer.path": "deepseek-ai/deepseek-coder-6.7b-instruct",
  "model.path":     "deepseek-ai/deepseek-coder-6.7b-instruct",
  "model.rank":     8,
  "alpha":          16,
  "learning_rate":  3e-4,
  "training.epochs": 3,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 2
}

Run training with:

python train_baseline.py --config lora_config.json

After training, the model is saved in a structured directory (e.g., 1.3b_file/final) and ready for evaluation.


Inference Modes: How to Generate Notebook Edits

To test the model, you can run inference in multiple modes:

  • Zero-shot: The model receives only the commit message and the original notebook.
  • One-shot: The model also sees one example of a similar edit.
  • Five-shot: The model sees five examples for context.

Example: File-level Edit Generation

from baseline import generate_code_change_whole_file

result = generate_code_change_whole_file(
    tokenizer=tokenizer,
    model=model,
    commit_message=sample.message,
    original_code=sample.old,
    cell_diff=sample.cell_diff
)

You can switch between file-level and cell-level modes by changing parameters in the script.


Evaluation Metrics: How Good Are the Edits?

The following metrics are used to evaluate model performance:

  • BLEU & CodeBLEU: Measure token-level overlap between the generated and reference code.
  • EditSim: Quantifies how closely the generated edit matches the actual developer changes.
  • ROUGE-L: Measures longest common subsequences between output and reference.

Run evaluations with:

python accuracy.py \
  --output_folder model/results/whole_file_zero_shot_1.3b \
  --expected_folder model/results/expected_whole_file

python finetune_score.py --tpe whole

The results are saved in .pkl files for easy analysis.


Dataset Statistics

You can analyze the dataset and model using the following scripts:

  • model_stat.py: Reports model input/output lengths and token distributions.
  • dataset_size_stat.py: Shows the number of samples per dataset split.

These reports help verify that training and evaluation are running as expected.


Lessons Learned

Here are some of the key insights from the study:

1. Commit Messages Matter

Commit messages provide critical context. Edits accompanied by clear messages allow models to learn more precise patterns. Vague or uninformative messages lower edit accuracy.

2. Global Context is Key

Notebook edits often depend on cells outside the immediate scope of the change. Including the full notebook or related cells improves model performance.

3. Training Quality Over Quantity

Although the dataset is large, filtering low-quality edits improves model outcomes more than just increasing data size.

4. One-shot and Few-shot Learning is Promising

Providing examples of edits (1 or 5 shots) significantly improves model performance, especially when the examples are semantically similar to the target edit.


Practical Applications

This research opens the door to multiple real-world use cases:

  • Automated notebook cleanup: Remove obsolete cells or deprecated code.
  • ML pipeline refactoring: Reorganize code for readability and efficiency.
  • Error correction: Detect and fix common mistakes in experimental workflows.
  • Notebook summarization: Generate changelogs from commit messages and code diffs.

As notebooks become more central to collaborative ML workflows, these tools will play a major role in improving productivity and maintainability.


Reproducing the Study

To replicate this work, follow these steps:

  1. Clone the project repository.
  2. Fetch top ML repositories using python_fetch.py.
  3. Generate the dataset with change_stat.py.
  4. Split and clean the data using split.py.
  5. Finetune the model with train_baseline.py.
  6. Run inference and evaluation using baseline.py and accuracy.py.

The process is fully automated, with each step generating output logs and reports.


Conclusion

“Learning to Edit Interactive Machine Learning Notebooks” introduces a novel approach to modeling notebook evolution. By collecting real-world edit data and training specialized models, it sets a foundation for the future of intelligent notebook assistants.

If you’re working with machine learning notebooks, this dataset and toolkit provide a compelling opportunity to automate mundane editing tasks and learn more about how your workflows evolve.

Dataset available at: https://doi.org/10.5281/zenodo.15716537