Anthropic’s Claude Mythos outperformed OpenAI’s GPT5.5 on real‑world Google Chrome vulnerability exploits, a new benchmark designed to test the performance of frontier AI models to exploit real-world vulnerabilities found .
During Infosecurity Europe 2026, Bugcrowd presented the first findings of ExploitBench, an independent, graded benchmark launched in May 2026 by the cybersecurity firm in collaboration with experts at Carnegie Mellon University and top Chrome vulnerability researchers.
David Brumley, chief AI & science officer at Bugrcrowd, described the benchmark as “the first independent benchmark that measures what AI models can actually do with a vulnerability, not just identify it but exploit it step by step.” Anthropic was among the first to engage with it.
He said the first test resulted in Mythos achieving a markedly higher exploitation performance than GPT‑5.5 in head‑to‑head runs, underlining how AI models are closing the gap with elite human researchers.
Unlike earlier binary tests, ExploitBench scores progress through staged exploitation outcomes rather than merely recording a crash. The benchmark evaluates five tiers of capability up to arbitrary code execution against a vulnerable V8 build, the JavaScript/WebAssembly engine that powers Google Chrome, Microsoft Edge, Node.js and Cloudflare Workers.
In the runs discussed at the show, Anthropic’s Mythos, with occasional human hints or “nudges,” posted an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI’s GPT‑5.5 scored 5.51 on average and reached the top tier on just two cases.
“For example, Mythos is able to exploit a one-day vulnerability in Chrome about 50% of the time. This is lead-tier activity. If we were to put money on it, Google could reward up to $10,000 for such a vulnerability that has no previously known exploit,” Brumley said.
“Anthropic’s model is churning these out and actually found solutions for exploiting the flaws that even top-tier hackers missed – that’s kind of impressive.”
Brumley added that, while GPT5.5’s performances were currently a little lower than its counterpart’s, the broader availability of OpenAI’s model opens opportunities for more people to use it to develop exploits.
AI Models Edge Closer to Reliable Exploitation, But Experts Urge Caution
Frontier large language models (LLMs) have already shown they can accelerate vulnerability discovery at scale, but whether those discoveries could be chained into reliable, actionable exploits had remained an open question until ExploitBench.
“We measure not just crash or no crash but stages of exploitation,” Brumley told Infosecurity, explaining why the new benchmark matters for assessing real exploitation capability rather than superficial signals.
That distinction is critical because models that can reliably exploit zero‑day flaws lower the barrier for threat actors to weaponize vulnerabilities.
Bugcrowd CEO, Dave Gerry, further warned that automation and AI are already being integrated into attacker workflows, increasing the pace at which discovered flaws can be turned into active exploits.
Nonetheless, while ExploitBench is one of the first experiments showing the possibilities of using AI to exploit vulnerabilities, Brumley also cautioned that the first findings of his team only reflect on a specific type of vulnerabilities and the results should not be extrapolated.
“I don’t want to oversell anything here. We measured a very sophisticated target application. Chrome is made of hundreds of thousands of lines of codes, it’s been audited for years. We know how valuable finding an exploit there is. It doesn’t necessarily mean we would get the same results trying to exploit a vulnerability in a web application.”
Speaking to Infosecurity, Michael Price, VP of product engineering at VulnCheck, said that while AI models are improving, they are not yet fully capable of reliably carrying out exploitation at scale.
Citing a recent report on the capabilities of Mythos by the UK AI Security Institute, Price explained that the most significant advance has been in the models' planning ability – their capacity to produce step‑by‑step plans, replan as needed, and execute multi‑stage actions – which by definition makes them more useful for offensive campaigns.
He noted that this improvement increases offensive potential but tempered that with caution. “They’re getting better, but they still are not actually like that great,” he said.
“I would expect them like every month or every quarter to get 1% better and probably over the course of two or four years they get really good,” Price added.
Developing AI-Driven Remediation At Scale
Both Brumley and Gerry emphasized that ExploitBench was released alongside Bugcrowd’s reinforcement learning (RL) environments to both measure and improve model capability.
“We put out ExploitBench to motivate the state of where models are at on actual exploitation tasks,” Brumley explained.
Gerry added that the benchmark and the training environments are complementary: one drives measurement and the other drives improvement through targeted RL training with industry model partners.
Finally, the company leaders urged defenders to match offensive speed with automated remediation and prioritization.
Gerry told Infosecurity that the shrinking “zero‑day clock” and the surge in AI‑assisted discovery mean organizations must develop AI‑driven remediation at scale.
He said remediation pipelines must be rethought so fixes move from ticket queues into near‑real‑time workflows, and that “finding more bugs faster only amplifies the noise unless you can automatically prioritize and act on the ones that actually enable exploits.”
Brumley echoed that urgency, saying defenders need contextual intelligence to prioritize and remediate the vulnerabilities that matter most before adversaries can exploit them.
This, he added, requires models trained not just to find flaws but to recommend and, where safe, initiate fixes at scale so human developers can focus on the highest‑risk work.
“Over the coming months, we will have announcements on that, with tools focusing on helping give people intelligence about how certain vulnerabilities are affecting them,” he said.
Read now: Patch Responsibility Remains Up for Grabs as AI Unearths Decades of Flaws
