Mythos Outperforms GPT5.5 on Google Chrome Vulnerability Exploits

Anthropic’s Claude Mythos outperformed OpenAI’s GPT5.5 on real‑world Google Chrome vulnerability exploits, a new benchmark designed to test the performance of frontier AI models to exploit real-world vulnerabilities found .

During Infosecurity Europe 2026, Bugcrowd presented the first findings of ExploitBench, an independent, graded benchmark launched in May 2026 by the cybersecurity firm in collaboration with experts at Carnegie Mellon University and top Chrome vulnerability researchers.

David Brumley, chief AI & science officer at Bugrcrowd, described the benchmark as “the first independent benchmark that measures what AI models can actually do with a vulnerability, not just identify it but exploit it step by step.” Anthropic was among the first to engage with it.

He said the first test resulted in Mythos achieving a markedly higher exploitation performance than GPT‑5.5 in head‑to‑head runs, underlining how AI models are closing the gap with elite human researchers.

Unlike earlier binary tests, ExploitBench scores progress through staged exploitation outcomes rather than merely recording a crash. The benchmark evaluates five tiers of capability up to arbitrary code execution against a vulnerable V8 build, the JavaScript/WebAssembly engine that powers Google Chrome, Microsoft Edge, Node.js and Cloudflare Workers.

In the runs discussed at the show, Anthropic’s Mythos, with occasional human hints or “nudges,” posted an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI’s GPT‑5.5 scored 5.51 on average and reached the top tier on just two cases.

“For example, Mythos is able to exploit a one-day vulnerability in Chrome about 50% of the time. This is lead-tier activity. If we were to put money on it, Google could reward up to $10,000 for such a vulnerability that has no previously known exploit,” Brumley said.

“Anthropic’s model is churning these out and actually found solutions for exploiting the flaws that even top-tier hackers missed – that’s kind of impressive.”

Brumley added that, while GPT5.5’s performances were currently a little lower than its counterpart’s, the broader availability of OpenAI’s model opens opportunities for more people to use it to develop exploits.

AI Models Edge Closer to Reliable Exploitation, But Experts Urge Caution

Frontier large language models (LLMs) have already shown they can accelerate vulnerability discovery at scale, but whether those discoveries could be chained into reliable, actionable exploits had remained an open question until ExploitBench.

“We measure not just crash or no crash but stages of exploitation,” Brumley told Infosecurity, explaining why the new benchmark matters for assessing real exploitation capability rather than superficial signals.

That distinction is critical because models that can reliably exploit zero‑day flaws lower the barrier for threat actors to weaponize vulnerabilities.

Bugcrowd CEO, Dave Gerry, further warned that automation and AI are already being integrated into attacker workflows, increasing the pace at which discovered flaws can be turned into active exploits.

Nonetheless, while ExploitBench is one of the first experiments showing the possibilities of using AI to exploit vulnerabilities, Brumley also cautioned that the first findings of his team only reflect on a specific type of vulnerabilities and the results should not be extrapolated.

“I don’t want to oversell anything here. We measured a very sophisticated target application. Chrome is made of hundreds of thousands of lines of codes, it’s been audited for years. We know how valuable finding an exploit there is. It doesn’t necessarily mean we would get the same results trying to exploit a vulnerability in a web application.”

Speaking to Infosecurity, Michael Price, VP of product engineering at VulnCheck, said that while AI models are improving, they are not yet fully capable of reliably carrying out exploitation at scale.

Citing a recent report on the capabilities of Mythos by the UK AI Security Institute, Price explained that the most significant advance has been in the models' planning ability – their capacity to produce step‑by‑step plans, replan as needed, and execute multi‑stage actions – which by definition makes them more useful for offensive campaigns.

He noted that this improvement increases offensive potential but tempered that with caution. “They’re getting better, but they still are not actually like that great,” he said.

“I would expect them like every month or every quarter to get 1% better and probably over the course of two or four years they get really good,” Price added.

Developing AI-Driven Remediation At Scale

Both Brumley and Gerry emphasized that ExploitBench was released alongside Bugcrowd’s reinforcement learning (RL) environments to both measure and improve model capability.

“We put out ExploitBench to motivate the state of where models are at on actual exploitation tasks,” Brumley explained.

Gerry added that the benchmark and the training environments are complementary: one drives measurement and the other drives improvement through targeted RL training with industry model partners.

Finally, the company leaders urged defenders to match offensive speed with automated remediation and prioritization.

Gerry told Infosecurity that the shrinking “zero‑day clock” and the surge in AI‑assisted discovery mean organizations must develop AI‑driven remediation at scale.

He said remediation pipelines must be rethought so fixes move from ticket queues into near‑real‑time workflows, and that “finding more bugs faster only amplifies the noise unless you can automatically prioritize and act on the ones that actually enable exploits.”

Brumley echoed that urgency, saying defenders need contextual intelligence to prioritize and remediate the vulnerabilities that matter most before adversaries can exploit them.

This, he added, requires models trained not just to find flaws but to recommend and, where safe, initiate fixes at scale so human developers can focus on the highest‑risk work.

“Over the coming months, we will have announcements on that, with tools focusing on helping give people intelligence about how certain vulnerabilities are affecting them,” he said.

Read now: Patch Responsibility Remains Up for Grabs as AI Unearths Decades of Flaws

Infosecurity Europe: Mythos Outperforms GPT5.5 on Google Chrome Vulnerability Exploits, Says New Benchmark

Kevin Poireault

AI Models Edge Closer to Reliable Exploitation, But Experts Urge Caution

Developing AI-Driven Remediation At Scale

You may also like

Infosecurity Europe: Patch Responsibility Remains Up for Grabs as AI Unearths Decades of Flaws

Claude Desktop Extensions Vulnerable to Web-Based Prompt Injection

Organizations Found to Address Only 21% of GenAI-Related Vulnerabilities

Google Researchers Claim First Vulnerability Found Using AI

Google OSS-Fuzz Harnesses AI to Expose 26 Hidden Security Vulnerabilities

What’s Hot on Infosecurity Magazine?

Russian State Hackers Target Vulnerable Routers Worldwide, Joint Advisory Warns

Progress Software Warns of "External Security Threat" to ShareFile

75% CISOs Fear Executives Don’t Understand Cybersecurity Risks Employees Face

NCSC Touts National Scale, AI-Powered “Cyber Shield” for Defense

Novel OAuth Client ID Spoofing Technique Targets Cloud Environments

Suspected Chinese Threat Group Targets Universities via Vulnerable Roundcube Servers

Google Cloud's New CISO Chris Betz on Integrating AI in Cyber Defenses

Researchers Claim First Fully Agentic Ransomware: JadePuffer

Suspected Chinese Threat Group Targets Universities via Vulnerable Roundcube Servers

How Faster Cyber-Attacks Are Reshaping Enterprise Cybersecurity Strategies

UK Government Launches Cyber Resilience Pledge, Claiming 60+ Signatories

FBI, Google Take Down NetNut Proxy Network Used by Cyber Threat Actors

Financial Services Cyber Resilience: Stress Testing Third Parties Before Attackers Do

How to Manage Enterprise Cyber Resilience in the Age of AI

Behind the Curtain of Microsoft 365 Cybersecurity: Lessons from Overlooked Resilience Gaps

Why Resilience‑Focused Cloud Design Is Your Best Defense Against Modern Attacks

How To Enhance Security Operations with AI-Powered Defenses

How to Harness Advanced Intelligence Capabilities to Strengthen Cyber Defence

How Faster Cyber-Attacks Are Reshaping Enterprise Cybersecurity Strategies

Researchers Claim First Fully Agentic Ransomware: JadePuffer

AI is Already Powering Cyber-Attacks. Can it Power Cyber Defense?

Google Cloud's New CISO Chris Betz on Integrating AI in Cyber Defenses

How World Cup Password Trends Can Increase Active Directory Risk

New CISA Guide Helps Agencies Adopt SASE For Zero Trust

Infosecurity Europe: Mythos Outperforms GPT5.5 on Google Chrome Vulnerability Exploits, Says New Benchmark

Written by

AI Models Edge Closer to Reliable Exploitation, But Experts Urge Caution

Developing AI-Driven Remediation At Scale

You may also like

What’s Hot on Infosecurity Magazine?