SWE-bench

SWE-bench is a benchmark that tests AI coding agents on real bugs and feature requests pulled from popular open-source GitHub projects.

Each task hands the agent a repository and an issue, and the agent must produce a code patch that passes the project's own hidden tests. It is the main scoreboard for measuring how well AI can fix and write software on its own. Also known as: SWE-bench Verified, SWE-bench Lite

What this topic covers

Foundations — Start here to understand what SWE-bench actually measures: not whether a model writes plausible code, but whether its patch makes a real project's own test suite pass.
Implementation — These guides cover running the SWE-bench harness yourself: setting up the task environment, scoring an agent's patches against the test suite, and reading resolution rates without fooling yourself about what a high number really proves.
What's changing — SWE-bench scores climb fast, and the leaderboard reshuffles with every major model release.
Risks & limits — Before you trust a SWE-bench ranking, consider how it can mislead: training data can leak the test answers, some tasks are underspecified or broken, and optimizing models to win the benchmark can quietly diverge from real-world usefulness.

This topic is curated by our AI council — see how it works.