Vibe Coding in Practice: Alibaba’s Guide to Scaling AI Coding Tools and Avoiding Pitfalls

With the rapid advancement of Large Language Models (LLMs), the concept of “Vibe Coding” has entered the developer’s toolkit. Simply put, it refers to a programming style that relies heavily on intuition, natural language interaction, and AI assistance. But as an emerging paradigm, how does it perform in enterprise environments? Is it a genuine efficiency booster or a source of new technical debt?

Based on the insights shared by Xiang Bangyu, a Senior Technical Expert at Alibaba, at the QCon Global Software Development Conference, this article dives deep into the current state of Vibe Coding tools at Alibaba. We will explore the challenges encountered and the solutions developed. The core question we aim to answer is: When AI coding tools evolve from Copilots to Agents, what actually happens to development efficiency and code quality, and how should enterprises manage the associated costs and security risks?

The Four Forms of Vibe Coding and Alibaba’s Current Status

A Panoramic View of Mainstream Tool Types

Current Vibe Coding tools are not a monolith; they can be broadly categorized into four types based on deployment environment and interaction style. Each has specific use cases and limitations.

Tool Type	Representative Products	Core Characteristics	Applicable Scenarios
Native IDE	Cursor, Trae, QCoder	Local integrated development environment; high flexibility.	Individual developers; frontend scenarios requiring deep customization.
IDE Plugin	Aone Copilot	Based on VSCode/JetBrains; fits existing habits.	Mainstream enterprise R&D; high penetration in backend development.
Web Agent	Aone Agent	Browser entry point; asynchronous container execution; cross-platform.	Multi-role collaboration (QA, PM, Ops); cloud-based tasks.
CLI Tool	Claude Code	Command-line interface; easy to integrate.	CI/CD pipelines; automated scripts; asynchronous container tasks.

Native IDEs like Cursor and Alibaba’s internal QCoder offer the most complete development experience and are particularly favored by frontend developers. IDE Plugins remain the most mainstream form within enterprises because they don’t require changing developers’ existing workflows. Surprisingly, CLI Tools have seen a rise in popularity. Initially, the team thought command-line tools wouldn’t be accepted by mainstream R&D, but they have proven highly usable in CI pipelines and asynchronous task execution.

Real Data from Alibaba: The Efficiency Secrets of Power Users

At Alibaba, Vibe Coding is more than just a concept; it is deeply integrated into the daily R&D process. Data from internal tools like Aone Copilot and Aone Agent shows a clear trend: The depth of tool usage correlates positively with code output efficiency.

Data indicates that the code submission lines for “power users” (those who frequently use Agent mode) have increased significantly. Statistics show that in September, power users submitted an average of about 560 lines of code per day, compared to about 400 lines for average users. While this doesn’t mean absolute individual output has doubled (as lines of code do not equal value, and much work involves collaboration), it at least proves the effectiveness of Agent mode in improving coding efficiency.

Even more interesting is the diversification of the user base. Tools like Aone Agent, a Web Agent form, are no longer limited to backend engineers. QA engineers use it to generate unit tests, product managers and operators use it for data research, and even designers are getting involved. This suggests that Vibe Coding is lowering the barrier to R&D, allowing non-professional developers to complete tasks that previously required coding knowledge.

Unavoidable Pitfalls: Real User Pain Points in Vibe Coding

Despite the impressive data, user frustration during actual usage cannot be ignored. Backend logs are filled with complaints like “the computer is too stupid.” This frustration isn’t groundless; it stems from challenges in code quality, debugging experience, and tool stability.

The Hidden Trap of Code Quality: Self-Consistency and Security Vulnerabilities

The biggest problem with AI-generated code is that it “looks correct” but hides fatal flaws.

First is the issue of consistency. In existing code repositories, AI tends to generate code matching its training data style, ignoring the project’s established conventions, leading to stylistic fragmentation. Second is the insufficient handling of edge cases, where low-level errors like null pointer exceptions and array out-of-bounds occur frequently. The most critical issue is security vulnerabilities. A Stanford University study points out that the proportion of injection-type vulnerabilities (such as SQL injection) in AI-generated code is as high as 45%. In practice, we also frequently observe XSS attacks and SQL injection risks.

Here is an instructive case: The AI “Self-Consistency” Trap.

To solve quality issues, we tried letting AI generate corresponding unit tests alongside the code. Theoretically, mutual verification between code and tests should guarantee quality. However, the results were shocking.

// Example: A logic-flawed array deduplication function
function unique(arr) {
  // Logic error: simply returns the original array without deduplication
  return arr; 
}

// AI-generated unit test
test('unique function should work', () => {
  expect(unique([1, 2, 2, 3])).toEqual([1, 2, 3]); // The test actually passed?
});

In reality, the AI might generate a logically incorrect deduplication function and then generate a test case that “pretends” to pass or adapts the test logic to fit the incorrect code. The AI fits itself logically, forming a perfect closed-loop error. This means if you rely entirely on AI to close the “code + test” loop, developers lose the last line of defense.

Debugging and Maintenance: Technical Debt from Black Boxes

Vibe Coding is, to some extent, creating a new type of “black box” technical debt.

We observed that after adopting Vibe Coding, debugging time actually increased by 30% to 50%. Why? Because users often skip checking the details of generated code and accept the DIFF directly. Once a problem arises, facing the massive amount of generated code, developers are often at a loss, not knowing which step of the logic went wrong.

Traditional debugging methods fail in the face of AI. Human developers are accustomed to using breakpoints and viewing stack traces, but current Vibe Coding tools support this poorly. They tend to prefer a primitive method—massive log printing.

# Typical debugging method of an AI Agent
console.log("Step 1: entering function");
console.log("Step 2: data is", data);
console.log("Step 3: error found");

This approach is inefficient and requires human intervention to copy and paste error messages. Additionally, the limitations of context understanding are a pain point. Faced with legacy code accumulated over years, AI lacks “global thinking.” It is difficult for the AI to understand the business background of historical code, leading to potentially destructive changes during modifications.

Tool Experience: Instability and Interaction Barriers

Beyond the code itself, tool instability is a major reason users quit. Vibe Coding tasks usually have long execution times (30 seconds to 5 minutes). If the model returns an error or a tool call fails, the user’s time cost is high.

The homogenization of interaction interfaces is also an issue. Current tools uniformly use chat boxes, making it hard for users to distinguish between Chat, Deep Research, and Agent modes. Facing a universal input box, users often don’t know what prompt to enter to trigger the correct tool flow. This directly leads to low retention rates for Web Agent tools like Devin—users come with high expectations but leave disappointed because they don’t know how to drive the tool effectively.

Product Architecture Evolution: From All-in-One to Vertical Specialization

Faced with these challenges, we underwent a profound architectural reflection while building Vibe Coding tools.

Architectural Reflection: Why Did All-in-One Fail?

Initially, we tried to build a “universal” Agent. Its core architecture was an input box, bundling all MCP tools, knowledge bases, and Playbooks on the periphery. We hoped it could handle all scenarios like data processing, frontend/backend development, and code review.

The result was disastrous.

Cost Explosion: To account for all possibilities, massive amounts of information were stuffed into the context. The token consumption for a single task reached tens of millions, costing hundreds of yuan per execution.
Success Rate Plummet: The excessive context length distracted the model, degrading instruction adherence. Tasks easily fell into infinite loops or went off track.
Poor Scenario Adaptation: Trying to solve everything with one logic resulted in performance in specific vertical domains (like frontend development) being worse than tools optimized specifically for them.

This made us realize that bigger is not better for Agents; specialization is.

Embracing Domestic Models: Challenges and Engineering Solutions

To control costs and address data compliance issues, we replaced all foreign SOTA models with domestic open-source models. This wasn’t just a simple “API endpoint replacement”; it was a battle of engineering.

Domestic models perform well in short-chain tasks but have obvious shortcomings in long-chain, complex logic Agent tasks:

Infinite Loops: The Agent tends to jump back and forth at a certain step, unable to exit.
Poor Format Adherence: Frequently generates unclosed XML tags, leading to parsing failures.
Instruction Forgetting: As context expands, the model tends to forget the initial instructions.

Addressing these model-level limitations, we didn’t sit and wait for model iterations; instead, we “patched” them through engineering means:

Primary-Backup Switching and Retry Mechanisms: Designed real-time circuit breakers and switching logic for stability issues.
Streaming Continuation: Implemented breakpoint continuation for output truncation issues to ensure the integrity of long-text generation.
Infinite Loop Detection: Added logical judgment at the Agent execution engine layer. Once repeated execution of the same instruction exceeds a threshold, immediate forced intervention occurs.
Format Fixer: A post-processing module automatically completes missing closing tags to fix format errors in model output.

Solving User Experience Hurdles: Templates and “Agent as Tool”

Users are often at a loss facing a blank input box. To solve this, we introduced a Template Mechanism.

We abstracted high-frequency, successful task paths into “Templates,” solidifying Prompts, toolsets, and knowledge bases. For example, a “JDK Upgrade Template” automatically loads tools and documents related to upgrades. Data proves that after using templates, task completion rates rose to over 95%, and currently, 50% of user tasks are initiated through templates.

Going further, we adopted the concept proposed in Manus 1.5: The Agent itself is a tool.

We encapsulated the Agent responsible for deep research into a “Tool.” The main Agent only needs to call this tool to get research results. This “nesting doll” architecture significantly reduces the context pressure on the main Agent and makes the system more modular.

Knowledge and Data Construction: Bridging the AI Cognitive Gap

Even the best Agent cannot function without good data.

Deep Structuring of Code Data

We not only built an Embedding database of code to support semantic retrieval but also introduced Repo Wiki to try to understand the structure of the entire codebase. More importantly, we included R&D behavior data into the knowledge system. Code is not just text; it is linked to CI builds, Code Review comments, and release monitoring. Associating code changes with requirements and bug tickets provides richer context for the Agent.

Redefining the Knowledge Base

Traditional document knowledge bases often suffer from outdated information, mixed text and images, and even contradictions. Feeding this directly to an Agent via RAG is akin to feeding it poison.

We established a Data Protocol Middle Layer for Agents. This middle layer cleans, verifies, and structures document information to ensure the knowledge fed to the AI is clean and accurate. At the same time, our product design encourages users to actively precipitate knowledge—because much tacit knowledge exists only in developers’ minds, and only through active entry can the Agent get smarter with use.

Practical Summary and Action Checklist

The future of Vibe Coding lies not in the infinite stacking of model parameters, but in how to build more reliable engineering architectures and more intuitive human-machine collaboration models.

One-Page Summary

Dimension	Core Insight	Actionable Advice
Tool Selection	There is no universal tool; choose by scenario (Native IDE / Plugin / Web Agent / CLI).	Use plugins for daily development, Web Agents for complex environment config, and CLI for CI integration.
Quality Control	AI tends to be “self-consistent,” generating incorrect code that passes its own tests.	Mandatory Human Intervention: Code or tests must be reviewed or led by a human.
Cost Control	Adapting domestic models requires engineering to compensate for model limitations.	Introduce a middle layer for loop detection and format repair; use templates to reduce invalid token consumption.
Security Compliance	AI-generated code has a high risk of injection vulnerabilities (approx. 45%).	Establish a security review pipeline; perform static scanning on generated code.

Checklist for Building Vibe Coding Tools

Avoid All-in-One Architecture: Don’t try to stuff all tools and knowledge into one Agent. Adopt vertical, template-based designs.
Beware of the “Self-Consistency” Trap: Do not let AI handle both “writing code” and “writing tests” in a fully automated loop; human checkpoints are mandatory.
Prioritize Debugging Experience: Current AI debugging capabilities are extremely weak. Tool designs must include features like log callback and state snapshots to assist in problem localization.
Engineer Domestic Model Adaptation: For stability issues in long-chain tasks with domestic models, middleware like loop detection and automatic format repair is essential.
Establish a Data Middle Layer: Do not directly RAG raw documents. Establish a cleaning protocol for Agents to remove outdated and contradictory information.

Frequently Asked Questions (FAQ)

Q1: What is the fundamental difference between Vibe Coding and traditional Copilots?
A1: Traditional Copilots mainly assist with completion; the human leads, and the AI is a sidekick. In Vibe Coding (especially Agent mode), users give commands, and the AI executes specific steps (like reading/writing files, executing commands), giving the AI much higher autonomy.

Q2: Why is it dangerous for AI to generate both code and unit tests simultaneously?
A2: Because LLMs have a tendency towards “logical self-consistency.” It might generate a piece of logically incorrect code and generate a test case specifically adapted to that incorrect logic, leading you to believe the code passed the test and is of good quality, masking the actual bug.

Q3: Why is Alibaba shifting away from some SOTA closed-source models to domestic models?
A3: Mainly due to cost, privacy compliance, and stability. Closed-source models are extremely expensive for long-chain complex tasks and pose data compliance risks. Domestic models perform well in short-chain tasks, and after engineering patches for long-chain shortcomings, the overall cost-performance ratio is better.

Q4: Why does debugging time increase when using Vibe Coding tools?
A4: Because AI-generated code is often a “black box”; users don’t check the logic line-by-line. Once a bug occurs, due to a lack of understanding of the internal logic and the AI’s difficulty in using traditional tools like breakpoints, developers often spend more time understanding the AI’s “intent” and tracing the error source.

Q5: How do you solve the problem of users “not knowing what to say” when facing an Agent input box?
A5: Introduce “Template” designs. Encapsulate high-frequency, successful task paths into templates. Users only need to select templates like “JDK Upgrade” or “Unit Test Generation,” and the system automatically loads the corresponding prompts and tools, guiding the user to input key parameters.

Q6: What are the main security risks of Vibe Coding tools?
A6: The main risks include AI-generated code containing security vulnerabilities like SQL injection or XSS, and Agents being hijacked by malicious instructions (e.g., planting backdoors in code or leaking sensitive information via network probing). It is recommended to run Agents in sandbox environments and perform mandatory security scans on generated code.

Q7: What is the “Agent as Tool” architectural concept?
A7: This is a modular design concept. Encapsulate an Agent responsible for a specific complex task (like deep research) as a “Tool” for the main Agent to call. This simplifies the main Agent’s logic, reduces context complexity, and improves overall stability.

Q8: What does Vibe Coding mean for non-professional developers?
A8: It means a lowered barrier to R&D. Roles like product managers, QA, and operations can complete simple R&D tasks like code analysis and data organization through natural language driving Web Agents, without needing to master complex programming syntax or development environment configurations.

Vibe Coding in Practice: Alibaba’s Real-World Guide to Scaling AI Coding Tools