SketchGraphs: A Large-Scale Dataset for Relational Geometry in CAD

Central Question: What is SketchGraphs and why does it matter for CAD and machine learning research?

SketchGraphs is a dataset of 15 million CAD sketches extracted from real-world models. Each sketch is represented as a geometric constraint graph, where nodes are geometric primitives and edges represent designer-imposed constraints such as parallelism, tangency, or perpendicularity. The dataset is designed to support machine learning for design automation and geometric program induction, and it provides both raw and processed data formats for different use cases.

SketchGraphs Illustration

This article explains what SketchGraphs contains, how to use it, the models it supports, and the research and application scenarios it enables. It also reflects on the importance of modeling not just geometry, but the logic of design constraints.


What is the structure of SketchGraphs?

Answer in brief: SketchGraphs consists of several data formats: raw JSON sketches, construction sequence datasets, and filtered versions for machine learning training and evaluation.

Summary

The dataset is structured to accommodate different research needs: full fidelity raw data for advanced processing, compact construction sequences for modeling, and filtered subsets for efficient experimentation.

Data Formats

  1. Raw JSON Data (43GB)

    • Extracted from Onshape, delivered as 128 tar archives compressed with zstandard.
    • Contains complete entity identifiers and design details.
    • Best suited for advanced use cases that require maximum fidelity.
    • Tools: sketchgraphs.pipeline.make_sketch_dataset and make_sequence_dataset.

    Example scenario: A researcher building a custom parser for CAD data would choose the raw JSON to reconstruct every detail.

  2. Construction Sequence Dataset (15GB)

    • Stored as a single binary file (sg_all.npy).
    • More concise, eliminating redundant identifiers.
    • Directly supported by the SketchGraphs Python libraries.
    • Forms the baseline for machine learning models.

    Example scenario: A machine learning practitioner training a graph neural network to predict missing constraints would use this dataset.

  3. Filtered Construction Sequences

    • Simplified by removing overly large or small sketches.
    • Retains key entities and constraints.
    • Divided into train, validation, and test splits.
    • Used as the standard benchmark for training models.

    Example scenario: A developer prototyping new generative models for CAD would rely on the filtered dataset to ensure consistent evaluation.

Sketch with Constraint Graph

How can SketchGraphs be installed and used?

Answer in brief: SketchGraphs can be installed via pip, with additional dependencies for training models.

Summary

Installation is straightforward for exploring data, but training requires PyTorch and torch-scatter.

Installation Steps

pip install -e SketchGraphs
  • This command installs the required dependencies to load and explore the dataset.

  • For training, additional libraries must be installed:

Example Usage

  • Explore data representations using the demo notebook:
    demos/sketchgraphs_demo.ipynb

  • This notebook shows:

    • Loading sketches.
    • Visualizing geometric constraint graphs.
    • Solving constraints using Onshape’s API.

Author’s reflection:
When I first tried the installation, the simplicity of pip-based setup was striking. But the real power comes after adding PyTorch and torch-scatter — only then does the dataset transform from static data into a dynamic training ground for models.


What models are included with SketchGraphs?

Answer in brief: SketchGraphs includes baseline models for generative sketch modeling and automatic constraint inference, built on graph neural networks.

Summary

The models demonstrate how sketches can be represented as graphs and processed with GNNs for two key tasks.

Model Types

  1. Generative Modeling

    • Goal: Produce new CAD sketches.
    • Method: Learn the probability distribution over constraint graphs.
    • Application scenario: A system that generates plausible new sketch designs based on patterns in the dataset.
  2. Autoconstrain

    • Goal: Automatically add missing constraints to sketches.
    • Method: Use graph neural networks to predict which constraints should exist between entities.
    • Application scenario: When a designer creates a rough sketch, the model proposes constraints to stabilize it.

Author’s reflection:
These baseline models are not just reference implementations. They embody a philosophy: CAD design is not about static shapes, but about relationships that make designs adaptive and resilient.


Why does SketchGraphs focus on constraints instead of just shapes?

Answer in brief: Constraints encode the design logic that makes sketches adaptable, whereas raw shapes lack this relational information.

Summary

SketchGraphs emphasizes constraint graphs to capture design intent, which is critical for modification and reuse.

Practical Example

  • A rectangle drawn as four lines could be represented as just geometry.
  • But if constraints define it as a closed loop with perpendicular and parallel relationships, the design becomes editable and consistent.

Author’s reflection:
In collaborative design, I’ve seen how losing constraints makes a CAD file brittle. SketchGraphs’ focus on constraints resonates with real-world needs: preserving intent so designs remain functional over time.


What are the licensing considerations?

Answer in brief: The dataset uses CAD sketches from Onshape, and the copyright remains with the original creators.

Summary

Use of the dataset is subject to Onshape’s Terms of Use regarding user content.

Implication

Researchers should be aware that while the dataset is open for study, commercial use may require additional legal review.


Action Checklist / Implementation Steps

  • [ ] Install SketchGraphs with pip.

  • [ ] Install PyTorch and torch-scatter for training models.

  • [ ] Choose the dataset format:

    • Raw JSON for full fidelity.
    • Construction sequence for ML training.
    • Filtered sequence for efficient experimentation.
  • [ ] Explore the demo notebook to load, visualize, and solve constraints.

  • [ ] Train baseline models or develop new ones based on GNNs.

  • [ ] Consider licensing terms when applying the dataset beyond research.


One-page Overview

  • What is SketchGraphs?
    A dataset of 15 million CAD sketches represented as geometric constraint graphs.

  • Why is it important?
    It enables machine learning research in design automation and program induction by providing relational geometry data.

  • What formats are available?

    • Raw JSON (43GB).
    • Construction sequence (15GB).
    • Filtered construction sequence with train/validation/test splits.
  • How to use it?
    Install via pip, explore with the demo notebook, train models with PyTorch and torch-scatter.

  • What models are included?
    Baseline GNNs for generative sketch modeling and automatic constraint inference.

  • What should you keep in mind?
    Constraints capture design logic; licensing remains with original sketch authors.


Frequently Asked Questions (FAQ)

Q1: What is the main content of the SketchGraphs dataset?
A1: It contains CAD sketches represented as geometric constraint graphs with 15 million samples.

Q2: Which dataset format should I use for machine learning?
A2: The construction sequence or the filtered sequence datasets are most suitable.

Q3: How do I install SketchGraphs?
A3: Use pip install -e SketchGraphs, and install PyTorch and torch-scatter for model training.

Q4: What tasks can the baseline models solve?
A4: Generative modeling of sketches and automatic constraint inference.

Q5: How big is the dataset?
A5: Raw JSON data is about 43GB; the construction sequence is 15GB.

Q6: Can I use SketchGraphs data for commercial projects?
A6: The sketches remain under the copyright of their original creators, so usage must follow Onshape’s Terms of Use.

Q7: What is the difference between raw and filtered datasets?
A7: Raw data is complete but heavy; filtered data removes extreme cases and includes predefined splits.

Q8: Why focus on constraints instead of shapes?
A8: Constraints capture design intent, making sketches more adaptable and meaningful than geometry alone.