AWS introduces SWE-PolyBench: a new open source multilingual benchmark for evaluating AI encoding proxy

Recent advances in Large Language Models (LLMS) have enabled the development of AI-based coding agents that can generate, modify and understand software code. However, evaluation of these systems remains limited, often limited to synthetic or narrow benchmarks, mainly in Python. These benchmarks rarely reflect the structural and semantic diversity of realistic code bases, so many agents over-adapt to benchmark-specific patterns rather than proving transferable, transferable functionality.
AWS introduces SWE-PolyBench: A more comprehensive evaluation framework
To address these challenges, the AWS AI Lab has already introduced it SWE-PolyBenchThis is a multilingual storage-level benchmark designed to be based on the evaluation of an AI encoding agent that is executed. The benchmarks span 21 GITHUB repositories for four widely used programming languages (Java, JavaScript, Typescript, and Python), which includes 2,110 tasks, including bug fixes, functional implementations, and code refactoring.
Unlike previous benchmarks, SWE-PolyBench combines real-world pull requests (PRs) for enclosing practical problems and includes relevant test cases, allowing for verifiable evaluations. A smaller, layered subset –SWE-PolyBench500– Also released to support faster experiments while preserving task and linguistic diversity.
Technical structure and evaluation indicators
SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a question statement from a GitHub issue. The system applies associated ground real patches (e.g., Maven for Java, NPM for JS/TS, etc.) in containerized test environments configured for their respective language ecosystems. Then, the results of two types of unit tests are benchmarked: Failed channel (F2P) and Pass (P2P).
To provide a more refined evaluation of the encoder, SWE-PolyBench introduces Concrete Syntax Tree (CST)– Based on indicators. This includes file-level and node-level search scores, which evaluate the agent’s ability to locate and modify relevant parts of the code base. These metrics provide insights beyond binary pass/fail results, especially for complex multi-file modifications.
Empirical assessment and observation
Three open source encoding agents –assistant,,,,, Swe-Agentand No agent– Suitable for SWE-PolyBench. All use human Claude 3.5 as the base model and are modified to handle benchmark multilingual, storage-level requirements.
Evaluation shows that performance differences between language and task types are large. For example, the agent performs best on Python tasks (up to 24.1% pass rate) but struggles in typescripts (as low as 4.7%). Although Java has higher complexity in average node changes, success rates are achieved higher than typing rates, suggesting that preprocessing and syntax familiarity play a crucial role in model performance.

Performance also varies with task complexity. Tasks limited to single-function or single-layer variations resulted in higher success rates (up to 40%), while those requiring mixed or multi-file variations decreased significantly. Interestingly, high retrieval accuracy and recall (especially for file and CST node recognition) do not always translate into higher pass rates, suggesting that code localization is necessary but not sufficient to solve the problem.

Conclusion: A powerful evaluation of AI encoders
SWE-PolyBench provides a powerful and nuanced evaluation framework for coding agents, addressing key limitations of existing benchmarks. By supporting multiple programming languages, covering a wider range of task types and merging syntax-aware metrics, it provides a more representative assessment of agents’ real-world applicability.
Benchmarks show that despite promising capabilities, AI agents still have inconsistent performance in language and tasks. SWE-PolyBench lays the foundation for future research, aiming to improve the universality, robustness and reasoning capabilities of AI coding assistants.
Check out the AWS Devops blog, Hugging Faces – SWE-Polybench and GitHub – Swe-Polybench. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.
🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
