SWE basic performance reaches 50.8% without tools

0 0 3 minutes read

SWE basic performance reaches 50.8% without tools

Recent advances in LM Agents show promising potential to automate real-world tasks. These agents are usually operated by proposing and performing operations through APIs, supporting applications such as software engineering, robotics and scientific experiments. As these tasks become more complex, the LM proxy framework has evolved to include multiple proxy, multi-step retrieval and tailor-made scaffolding to optimize performance. A core challenge is to effectively explore and understand the environment, which prompts engineering scaffolding development using tools, memory mechanisms and custom pipelines. However, most existing methods employ partial observability, requiring the agent to gradually collect observations. Although this assumption exists in dynamic or unfamiliar environments, it does not apply to fully observable settings such as SWE-Bench, with all relevant information accessible from the start.

In software engineering, research on LM agents focuses on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems such as SWE-Agent and OpenHANDS code allow LMS to automatically interact with the code base normally through custom interfaces and retrieval tools. Other models such as Moatless and AutoPoderover enhance localization through search technology, while Specrover perfects the scaffolding design. Alternatively, structured pipelines such as agentless and CodeMonKey break down tasks into sequential phases such as localization, repairs, and verification. Although these methods depend on the performance of engineering components, the current study proposes to directly explain the entire task environment using the novel LMS (LCLM). Advances in LCLM architecture and infrastructure now allow these models to outperform the retrieval functioning system in many cases, thus reducing dependence on complex external scaffolding.

Researchers from Stanford, IBM and the University of Toronto explore whether complex scaffolding is needed, and whether LM agents that solve tasks such as SWE-Bench need to perform complex scaffolding. They show that competitive performance can be achieved only with LCLMs with proper prompts and no scaffolding, such as Gemini-1.5-Pro - accounting for 38% of SWE-Bench verification. Gemini-2.5-Pro reaches 50.8% using the same simple setup. Their work shows that many complex agent designs can be replaced with a powerful LCLM, thus simplifying construction and training. In addition, a mixed two-stage method using Gemini-1.5-Pro and Claude-3.7 achieved a solution rate of 48.6%, further supporting this simplified direction.

Traditional LM agents rely on interactive exploration due to partial observability, but many tasks (such as software debugging) allow for full observability. This study proposes a state replacement drug that utilizes LCLM to directly treat the state of the complete or compressed environment, bypassing the need for complex proxy scaffolding. For large code bases, relevant files are selected based on ranking compression to suit contextual limitations. Two methods were introduced: DirectSolve, LCLMS uses a full context to solve the task; and select “lclms” to localize the relevant files to solve the short story LMS (SCLMS). Both use target patch formats and verification to ensure accuracy and reduce hallucinations.

This experiment evaluated a simplified proxy framework using LLMS in a SWE-based validated benchmark, including 500 real-world software engineering tasks. The proposed method guides and SELECTSOLVES, using LCLMs such as Gemini-1.5-Pro and Gemini-2.5-Pro, in SelectSolve, other SCLMs for plaque generation (Claude-3.7-Sonnet). The results show that direct solutions outperform complex proxy methods such as agentless and coding with minimal engineering. SELECTSOLVE further improves accuracy by tinkering with more powerful models. Ablation studies highlight the importance of COT prompts, code retelling and token valid context design. In addition, positioning relevant files at the start of the prompt improves performance, emphasizing limitations in long article processing.

In short, the current cost of using LCLM-based methods is higher than existing methods such as Agentless and Codeact, with an average price of $2.60 per instance, compared to $0.25 and $0.87 respectively. However, the rapid decline in inference costs and context length makes LCLMS more practical. After initial run, technologies such as KV cache significantly reduce costs, reducing them to $0.725. While tiny codebase changes still limit the benefits of caches, further improvements may help. The study also shows that LCLM can handle a longer interaction history, thereby reducing the need for complex memory and retrieval mechanisms. It is worth noting that the unexposure LCLM models can compete for performance on SWE bench tasks.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 90K+ ml reddit.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)

liralbes 6 hours ago

0 0 3 minutes read