The UK government has discovered and patched hundreds of vulnerabilities after running a series of internal hackathons using frontier AI models.
The weekly, in-person events were organized by the Government Cyber Coordination Centre (GC3) – an initiative from the National Cyber Security Centre (NCSC) and the Department for Science, Innovation and Technology (DSIT).
The idea was to use the models to scan public code repositories across nine government departments.
“Rather than mandate a single approach, we gave teams model access and let them build their own tooling, noticing what worked each week and building on the best approaches,” the GC3 said.
Participants identified 407 findings, including critical flaws such as authentication bypass, data exposure and remote code execution. Although some were already known and mitigated by compensating controls, others were zero days, the report, published on June 21, claimed.
All critical and high-risk weaknesses assessed as exploitable have been remediated, with no evidence of exploitation identified.
“AI models traced vulnerabilities across service boundaries, which traditional scanners can’t do, and linked business logic with technical detail. Departments prioritized validation and remediation through existing frameworks,” the report noted.
The various teams took different approaches. One created five new domain-specific Claude Skills to build a “reusable, scoped and consistent approach” across every open source repository and operator selected.
Another used traditional scanning tools like Gitleaks, Trivy, Semgrep and Hadolint to generate initial findings. Then they applied models to these findings, to check against OWASP and CWE frameworks, compose individual findings into attack paths, and confirm viability through a triage stage.
Another group built a six-stage agentic pipeline with each stage reading and challenging the last.
Frontier Models Deliver Strong Performance
The GC3 said it learned some important lessons through the hackathon initiative:
- The strongest results came from using frontier models as “tightly scoped components inside a structured pipeline” – with traditional vulnerability management workflows broken down into discrete, task-specific harnesses
- With the right architecture and task design many near-frontier and frontier models are similarly good at scanning code. Human expertise is still the difference, required to break problems down and identify wider context
- Triage is vital because agents generate candidate findings faster than humans can validate them. Careful upfront scoping and “structured internal filtering” improve focus and reduce costs. The whole project cost the government just £13,000 ($17,467) in tokens
- The next big job will be to integrate prioritization, review and patch-generation without “overwhelming human-centred processes”
However, it’s unclear what impact a new US government export ban on Anthropic’s Mythos and Fable models will have on the government’s hackathon initiatives.
The ban, which was brought in late on Friday, locks out all non-American users from the firm’s most powerful models.
