From Code Completion to Autonomous SWE Agents: The 2025 Roadmap to Code Intelligence

20 days ago 高效码农

From Code Completion to Autonomous SWE Agents: A Practitioner’s Roadmap to Code Intelligence in 2025 What’s the next leap after 90 % single-function accuracy? Teach models to behave like software engineers—plan across files, edit with tests, verify with sandboxes, and keep learning from real merges. 0. One-Minute Scan: Where We Are and What to Do Next Stage Today’s Best Use 30-Day Stretch Goal IDE autocomplete 7B FIM model, temperature 0.3, inline suggestions Add unit-test verifier, GRPO fine-tune → +4-6 % on internal suite Code review Generic LLM second pair of eyes Distill team comments into preference pairs, DPO for one …

ReasoningBank: The Memory Engine That Teaches AI Agents to Reflect

2 months ago 高效码农

— From Task Executors to Self-Evolving Intelligent Systems Introduction: When AI Can’t “Hold a Grudge,” It Can’t Grow Either Imagine this: You’ve trained an AI Agent to automate your web workflows. Yesterday it learned to log into your admin panel and export reports. Today, you ask it to update user permissions. But what does it do? It asks again, “Where’s the login page?” That’s right — it forgot everything. This is the Achilles’ heel of most current LLM-based agents: amnesia. No matter how powerful the model is, once a task ends, all context — the successes, the failures, the hard-earned …

Stealth Sabotage in AI Agents: SHADE-Arena Exposes Hidden LLM Security Risks

6 months ago 高效码农

SHADE-Arena: Evaluating Stealth Sabotage and Monitoring in LLM Agents Can frontier AI models secretly execute harmful actions while performing routine tasks? Groundbreaking research reveals the sabotage potential of language model agents and defense strategies The Hidden Risk Landscape of Autonomous AI As large language models (LLMs) become increasingly deployed as autonomous agents in complex, real-world scenarios, their potential for stealth sabotage emerges as a critical safety concern. A collaborative research team from Anthropic, Scale AI, and independent institutions has developed the SHADE-Arena evaluation framework – the first systematic assessment of frontier LLMs’ ability to pursue hidden malicious objectives while appearing …

WebDancer: Autonomous Information-Seeking Agents Outperforming GPT-4o

6 months ago 高效码农

WebDancer: Breakthroughs in Autonomous Information-Seeking Agents Introduction: A New Paradigm for Complex Problem-Solving Traditional AI systems often struggle with complex real-world problems due to shallow, single-step information retrieval. Yet humans solve intricate tasks through multi-step reasoning and deep exploration—like researchers cross-referencing studies or validating hypotheses. Alibaba’s Tongyi Lab now addresses this gap with WebDancer, an open-source framework for training end-to-end autonomous information-seeking agents that browse the web and reason like humans. Key breakthrough: WebDancer achieves 61.1% Pass@3 accuracy on GAIA and 54.6% on WebWalkerQA benchmarks, outperforming GPT-4o in specific tasks. Part 1: Four Core Challenges in Deep Information Retrieval Building …