Sweep Next-Edit: The 1.5B Parameter Code Autocomplete Model That Outperforms Giants Like Qwen-8B

高效码农

2 months ago

Open-Source 1.5B Parameter Next-Edit Code Autocomplete Model: Performance, Design, and Practice

For programmers, code autocomplete tools have long been an indispensable part of daily development. An efficient and accurate autocomplete model can significantly reduce repetitive coding work and boost productivity. Conversely, slow, low-accuracy tools or those requiring code to be uploaded to the cloud not only disrupt development workflows but also pose potential data privacy risks. Today, we introduce Sweep Next-Edit—an open-source 1.5B parameter code autocomplete model designed to address these pain points. It runs locally on laptops in under 500ms and outperforms models over 4x its size on core next-edit benchmarks. We are open-sourcing the model weights to empower the community to build fast, privacy-preserving autocomplete experiences for all IDEs, including VSCode, Neovim, Emacs, and beyond.

I. Next-Edit Models: Balancing Speed and Quality

The core metrics for evaluating a code autocomplete model are straightforward: response speed and prediction accuracy. Many models sacrifice accuracy for speed or vice versa, but Sweep Next-Edit achieves a balanced performance through optimized model design and inference logic.

1. Intuitive Comparison of Speed vs. Quality

We conducted comprehensive speed and quality evaluations of mainstream next-edit models, covering various parameter sizes of the Qwen3 series, Zed (Zeta), Instinct (Continue), Mercury Coder (Inception), and Sweep series (0.5B, 1.5B, 3B, 7B).

Model	Speed (tokens/second)	Core Quality Metric (Overall Accuracy)
Qwen3-0.6B	–	17.54%
Qwen3-1.7B	–	30.19%
Qwen3-4B	–	48.27%
Qwen3-8B	–	55.62%
Zed (Zeta)	–	43.27%
Instinct (Continue)	–	25.30%
Mercury Coder	–	54.09%
Sweep 0.5B	–	61.30%
Sweep 1.5B	–	67.82%
Sweep 3B	–	72.12%
Sweep 7B	–	81.28%

The data clearly shows that even the 1.5B parameter version of Sweep Next-Edit outperforms models of similar or larger sizes—surpassing the 8B parameter Qwen3-8B by 12 percentage points and Mercury Coder by 13.73 percentage points in overall accuracy.

2. Accuracy Breakdown by Task Category

To evaluate model capabilities more precisely, we split the tests into five core task categories: edits below the cursor, edits above the cursor, distant jump edits (Tab to Jump), fill-in-the-middle (FIM), and no redundant suggestions (Noise). Below is the performance of each model across these tasks:

Model	Noise	Tab to Jump	Below	Above	FIM	Overall
Sweep 7B	100.00%	82.03%	89.84%	75.51%	74.22%	81.28%
Sweep 3B	100.00%	82.03%	82.03%	57.14%	60.16%	72.12%
Sweep 1.5B	96.88%	74.22%	71.88%	46.94%	56.25%	67.82%
Sweep 0.5B	96.88%	66.41%	63.28%	53.06%	43.75%	61.30%
Mercury Coder	62.50%	70.31%	71.88%	53.06%	64.06%	54.09%
Zed Zeta	87.50%	62.50%	67.19%	69.39%	35.94%	43.27%
Continue Instinct	34.38%	23.44%	10.94%	8.16%	37.50%	25.30%
Qwen2.5-Coder-7B	100.00%	56.25%	54.69%	61.22%	41.41%	55.62%
Qwen3-8B	96.88%	54.69%	34.38%	22.45%	42.19%	48.27%
Qwen3-4B	90.62%	33.59%	24.22%	16.33%	25.00%	30.19%
Qwen3-1.7B	68.75%	14.06%	14.06%	14.29%	15.62%	17.54%

The “Noise” metric measures how often the model avoids redundant suggestions when no changes are needed (higher accuracy = fewer redundancies). “Tab to Jump” evaluates the model’s ability to predict changes far from the cursor. “Below/Above” correspond to predictions for code near the cursor, and “FIM” assesses filling in code gaps at the cursor position. Notably, the Sweep series leads across all categories, with the 1.5B+ versions meeting the demands of most daily development scenarios.

II. Why Do Other Next-Edit Models Underperform?

During evaluations, we found that models like Zeta (Zed) and Instinct (Continue) fell short of expectations, with core issues concentrated in format design, tokenization strategies, training data, and technical choices:

1. Fatal Flaws in Formatting and Tokenization

Both Instinct and Zeta use markers like <<|editable_region_start|> and <<|editable_region_end|> to define code editing boundaries. However, these markers tokenize poorly—each splits into 7 separate tokens. For models, this often leads to “marker misalignment”: placing the end marker incorrectly, resulting in extra or missing lines of code that break the structure.

In contrast, dedicated special tokens (optimized during model pre-training) are far more suitable for this use case, as they tokenize consistently and are easier for models to recognize.

2. Poor Handling of Multi-File Context

Code development often involves cross-file interactions, yet Zeta lacks multi-file context support entirely. While Instinct supports it, it uses custom tokens instead of the dedicated special tokens used in Qwen2.5-Coder’s pre-training. This is analogous to asking someone familiar with standard grammar to read text written in a dialect—comprehension and output suffer significantly.

3. Redundant Instructions Add Model Noise

Instinct’s prompt includes irrelevant information, such as a system message stating: “You are Instinct, an intelligent next-edit predictor developed by Continue.dev.” This information provides no value for code completion but consumes token resources, increasing inference noise—especially harmful for small-parameter models, which suffer reduced efficiency, accuracy, and increased latency.

4. Issues with Training Data and Technical Choices

Insufficient Data: Zeta was trained on only a few hundred samples, and Instinct on just a few thousand—most from a single code repository, leading to severe data sparsity.
Poor Technical Selection: Zeta used LoRA for training, but our tests show LoRA struggles to teach models basic pattern matching, let alone precise next-edit completion.

III. Core Design Principles of Sweep Next-Edit

To address these limitations, we redesigned the model around two core dimensions—training format and editing window—to better align with real-world code editing workflows.

1. Optimized Training Format: Adapting from FIM to Next-Edit

Qwen2.5-Coder’s pre-training format is designed primarily for FIM (fill-in-the-middle) tasks, not next-edit scenarios. Its format is as follows:

<<|repo_name|>{repo_name}
<<|file_sep|>{file_path_1}
{file_content_1}
<<|file_sep|>{file_path_2}
{file_content_2}
<<|file_sep|>{current_file}
Current file prefix
<<|fim_prefix|>
{prefix}
<<|fim_suffix|>
{suffix}
<<|fim_middle|>{completion}

In contrast, our training format for Sweep Next-Edit prioritizes “recent code changes” and focuses the model on rewriting code around the cursor:

<<|file_sep|>{file_path_1}
{file_content_1}
<<|file_sep|>{file_path_2}
{file_content_2}
<<|file_sep|>{changed_file_1}.diff
original:
{before_changes_of_diff}
updated:
{after_changes_of_diff}
<<|file_sep|>{changed_file_1}.diff
original:
{before_changes_of_diff}
updated:
{after_changes_of_diff}
<<|file_sep|>original/{file_path}
{contents_prior_to_most_recent_change}
<<|file_sep|>current/{file_path}
{current_state_of_contents}
<<|file_sep|>updated/{file_path}
{updated_state_of_contents}

(1) Choosing the Diff Format: Finding the Optimal Solution with Genetic Algorithms

There are numerous ways to represent code changes (diffs): unified diff format, side-by-side comparison, original/updated blocks, old/new labels, and with or without .diff markers. To find the optimal format, we conducted comprehensive testing using a genetic algorithm:

Initialize Population: Start with 10 diff format variants, covering various markers, labels, and presentation styles.
Evaluate Fitness: Run inference on a held-out validation set for each format and calculate whitespace-agnostic exact-match accuracy.
Selection & Breeding: Retain top-performing formats and create new variants by combining elements from successful designs (e.g., using diff markers from one format and label styles from another).
Mutation: Randomly tweak some formats to explore new variations.
Iterate: Repeat for 3 generations, evaluating ~30 prompt formats total.

The winning format—original/updated blocks without unified diff syntax—was initially surprising, but further analysis revealed key reasons:

Unified diff formats are user-friendly (e.g., lines marked with +/−), but for models, they become continuous text streams like this:

             if (params.length >= 8 && params[7] != null) {\n-                linkState = LinkState.valueOf(params[7].toUpperCase());\n+                linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));\n             }

Models struggle to “visually” identify line correspondences as humans do. In contrast, the original/updated block format lets models clearly compare pre- and post-change differences:

original:
if (params.length >= 8 && params[7] != null) {
linkState = LinkState.valueOf(params[7].toUpperCase());
}
updated:
if (params.length >= 8 && params[7] != null) {
linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
}

This aligns with Claude Sonnet’s preference for string replacement over unified diffs—such formats are closer to models’ pre-training data, leading to more accurate generation.

2. Editing Window: Fixed Windows Replace AST Dynamic Boundaries

The design of the code editing window is another critical factor affecting model accuracy. We initially tested AST (Abstract Syntax Tree)-based dynamic boundaries but found fixed windows deliver superior results.

(1) Problems with Instinct’s Dynamic Window

Instinct’s editing window design is:

20 lines above changes
<<|editable_region_start|>
1 line above cursor
line containing cursor
5 lines below cursor
<<|editable_region_end|>
20 lines below changes

The model is tasked with rewriting the 7 lines inside the markers, but evaluations showed it often continues generating content after <<|editable_region_end|>, adding irrelevant code. This occurs because the model can see the 20 lines outside the markers and mistakenly believes it needs to complete them. Additionally, the “editable region” concept is outside the model’s pre-training distribution, increasing comprehension difficulty.

(2) Sweep’s Fixed Window Design

We simplified the editing window to a fixed 21-line span: 10 lines above the cursor, the cursor line, and 10 lines below. The model’s task is to rewrite these 21 lines:

10 lines above cursor
line containing cursor
10 lines below cursor

While generating 21 lines theoretically slows performance, our custom n-gram implementation and early termination logic in TensorRT-LLM reduce average warm-start autocomplete latency to under 100ms.

(3) Why Abandon AST Dynamic Boundaries?

AST-based dynamic boundaries seem elegant in theory: if a user edits code inside a function, the window expands to include the entire function body. However, practical implementation revealed two critical issues:

Inconsistent Window Sizes: Function lengths vary drastically (from 5 to 30+ lines), leading to unstable training as the model encounters wildly different output lengths.
Boundary Prediction Errors: The model must learn both “what to generate” and “where boundaries end.” A single boundary error invalidates the entire output.

Fixed windows eliminate these issues by letting the model focus solely on generating correct content, knowing exactly how many lines to produce—dramatically improving accuracy.

IV. The Training Process of Sweep Next-Edit

A high-performance model requires both solid Supervised Fine-Tuning (SFT) for foundational skills and Reinforcement Learning (RL) for refinement. Our training process consists of two core phases:

1. Supervised Fine-Tuning (SFT): Building the Foundation

(1) Training Data and Hardware

We completed SFT in 4 hours using 8 H100 GPUs and ~100k training samples. The data was scraped from the most popular permissively licensed GitHub repositories (filtered by commit count) over the past year. While slightly biased toward recent trends (e.g., LLM inference, MCP servers), this aligns with the daily development tasks of most Sweep users. We also upsampled data to match the language distribution of JetBrains users, including Java, Kotlin, C#, PHP, and Ruby.

(2) Limitations of the SFT Phase

Despite using high-quality data, the model exhibited two key issues post-SFT:

Unparsable Generated Code: The model only sees a 21-line code slice, lacking context of the broader codebase. This leads to code that parses locally but fails when integrated (e.g., missing parentheses).
Excessively Large Diffs: Fully generated files are naturally more verbose than partial states during active editing. Excessive insertions/deletions distract users and reduce suggestion acceptance rates.

2. On-Policy Reinforcement Learning (RL): Refining Details

SFT teaches models to mimic “correct” outputs, while RL lets models explore freely and learn from reward signals. This addresses SFT’s limitations: SFT penalizes functionally equivalent but differently formatted outputs, while RL recognizes multiple valid solutions, enabling the model to discover outputs that are both correct and natural.

We ran 2000 steps of RL to eliminate undesirable behaviors, designing two core reward functions:

Syntax Parsing Reward: Use tree-sitter to verify if generated code parses correctly in mainstream languages (Java, Python, etc.) when merged into the file.
Size Regularization Reward: Restrict the size of code changes to avoid overly large diffs.

Graph showing parse reward increasing during reinforcement learning training

RL significantly improved the model’s code parsing success rate and reduced redundant changes, making outputs more aligned with real developer editing habits.

V. How to Try Sweep Next-Edit?

Sweep Next-Edit’s value lies in practical application. We offer two convenient ways to try it, catering to different developer needs:

1. JetBrains IDE Plugin

If you use JetBrains IDEs (e.g., IntelliJ IDEA, PyCharm, Android Studio), install the Sweep AI Plugin directly. The plugin integrates our custom inference engine and Sweep Next-Edit, delivering lightning-fast next-edit suggestions that run entirely locally—no code uploads to the cloud.

2. Self-Host the Model (For Custom Development)

If you want to build custom tools (e.g., integrate with VSCode or Neovim), follow the example code on Hugging Face:

(1) Environment Setup

First, download the model file and run script:

Model File: Sweep Next-Edit 1.5B (GGUF format, Q8_0 quantization)
Run Script: run_model.py

Install dependencies:

uv pip install llama-cpp-python huggingface_hub

(2) Run the Model

Execute the script to experience next-edit capabilities:

python run_model.py

Key model parameters (compatible with most local devices):

Format: GGUF (Q8_0 quantization)
Parameters: 1.5B
Context Length: 8192 tokens
Base Model: Qwen2.5-Coder

VI. Benchmark Details: Ensuring Objective Evaluations

To validate the credibility of our performance data, we detail the benchmark setup and evaluation logic:

1. Latency Testing (Speed)

(1) Test Environment

All requests are cold (no KV cache), simulating first-time completion in real development.
Task: Rewrite a 21-line code snippet per our next-edit format.
Optimization: Inference on our TensorRT-LLM fork, using FP8 quantization + n-gram speculative decoding with greedy decoding (temperature=0).
Third-Party Model Handling: Mercury-Coder latency is measured via its official API endpoint; 500 errors are treated as “no changes” (aligning with production logic).

(2) Input/Output Scale

Input: 6000±2500 tokens per request
Output: 70±20 tokens per request

(3) Tokens Per Second (TPS) Calculation

TPS = Average output tokens / Average end-to-end latency (tokenization via Qwen’s tokenizer).

2. Quality Testing (Accuracy)

(1) Evaluation Dimensions

The overall score is the average of five core benchmarks:

Next-edit suggestions below the cursor: Predict code changes near the cursor’s lower area.
Next-edit suggestions above the cursor: Predict code changes near the cursor’s upper area.
Tab-to-jump next-edit suggestions: Predict code changes far from the cursor.
Fill-in-the-middle (FIM) completions: Fill code gaps at the cursor position.
Noise evaluation: Avoid redundant suggestions when no changes are needed.

(2) Evaluation Metric: Why “Exact Match”?

We chose “whitespace-agnostic exact-match accuracy” over soft metrics like CodeBLEU or LLM-as-a-judge (used by Instinct’s evaluation) for two key reasons:

Next-edit suggestions are highly predictable: Most daily development completions are low-entropy changes (e.g., repeating the last edit, fixing typos, completing patterns), often with only one optimal solution.
“Almost correct” suggestions are worse than no suggestions: 90% correct code still requires manual fixes, disrupting workflows and causing more frustration than no suggestion at all. Soft metrics’ “partial credit” fails to reflect real user experiences.

Consider this real test case:
Test Scenario: The model must predict edits based on a recent pattern (adding Locale.ENGLISH to toUpperCase()).

Instinct: No changes—failed to recognize the pattern.
Zeta: Added irrelevant changes that broke code structure.
Sweep 1.5B: Only made the precise pattern-matched change with no redundancies.

This example illustrates why exact match best reflects a model’s real value to developers.

VII. Frequently Asked Questions (FAQ)

Q1: Which IDEs can Sweep Next-Edit be used with?

A: We open-sourced the model weights, so it can theoretically be adapted to all major IDEs (VSCode, Neovim, Emacs, JetBrains, etc.). A ready-to-use JetBrains plugin is available now, and integration examples for other IDEs can be found in the run_model.py script on Hugging Face.

Q2: Why choose 1.5B parameters for the open-source version?

A: 1.5B strikes the optimal balance of performance, speed, and device compatibility. It delivers accuracy exceeding models 4x its size while running locally on ordinary laptops (response time <500ms), balancing privacy and efficiency.

Q3: Is the training data legally compliant?

A: All training data comes from popular GitHub repositories with permissive licenses (e.g., MIT, Apache 2.0), filtered by commits from the past year. This ensures both data quality and compliance with open-source agreements.

Q4: How are the RL reward functions calculated?

A: Core dimensions include: ① Using tree-sitter to parse generated code—adding points for successful parsing, deducting for failures; ② Measuring the line/character count of code changes—adding points for staying within preset reasonable limits, deducting for excess. Reward function weights were validated through multiple rounds to align with real user experiences.

Q5: Why are fixed windows better than AST dynamic boundaries?

A: AST dynamic boundaries suffer from instability—function/class lengths vary drastically, making it hard for the model to learn consistent output sizes during training. Additionally, the model must predict both “what to change” and “where boundaries end,” doubling the error risk. Fixed windows let the model focus solely on generating correct content, improving accuracy and stability.

Q6: What hardware is required to run Sweep Next-Edit locally?

A: The 1.5B parameter GGUF quantized version runs smoothly on most laptops (8GB+ RAM, Intel Core i5/AMD Ryzen 5 or higher). For faster response times, a entry-level GPU (e.g., RTX 3050) is recommended.

Conclusion

The open-sourcing of Sweep Next-Edit aims to provide developers with a “useful, fast, and private” code autocomplete tool while sharing our insights on model design, training, and evaluation to help the community avoid common pitfalls. This 1.5B parameter model outperforms most larger models while running locally on ordinary devices, balancing efficiency and data security.