The Beginning of the End of Human Penetration Testing

Written by

Over the years, I’ve avidly followed automated penetration testing and its evolution, trying to determine whether it had the maturity to replace human pen testers outright. Progress was glacial and iterative, with many flaws and downsides, but also significant potential. 

In the past year, with a flurry of releases of AI-based pen testing tools, both open-source and commercial, I’ve had the opportunity to test many of them in depth, in side-by-side comparisons with human pen testers. It’s safe to say that we are at the dawn of a new beginning for pen testing or rather, the beginning of the end of human pen testing. 

Let’s take a step back first and recap that there’s nothing inherently wrong with human pen testers, quite the opposite, in fact, but there are many drawbacks. 

The Drawbacks of Human Pen Testing

First, pen testing is slow and expensive. Many pen testing companies still charge consultancy-level rates per day, and this can be anywhere between 6 to10 days, plus the additional charges for report writing and any eventual re-test. 

Secondly, the skill set is highly specialized, and there simply aren’t that many pen testers available today. If you commission a pen testing company to perform a penetration test, you can expect a lead time of up to six weeks. This scarcity is reflected in the price. 

It’s also not deterministic and leads to human bias. If two pen testers target the same application at the same time, they may find some of the same flaws but will often end up with completely different findings.

This can also lead to “pen tester syndrome,” whereby an application with a good security posture doesn’t yield many findings, so low-level issues that would typically be ignored are “talked up” in the final report to give the client something to fix. 

Lastly, it’s a “snapshot-in-time” report that becomes out of date almost as soon as it’s completed, due to the time it takes to produce. If you’re operating in a continuous delivery model and updating your environment every day, the pen test is effectively out of date immediately. 

Automated pen testing remedied some of these flaws, but not all of them. Traditional tools struggled with web applications, pivoting, exploitation, and business logic. However, they were deterministic, repeatable, produced concise reports the same day, and could be scaled to huge environments - if you could afford them.

It’s also worth mentioning that crowdsourced security was designed to remedy many of the issues found in traditional pen testing, but it too has its own drawbacks. 

AI Pen Testing: Faster and More Accurate

AI pen testing takes this one step further. Today, AI pen testing can be defined as an agent-based solution running on a Large Language Model (LLM) within a prompt-based framework. The Cybersecurity AI Framework (CAI) is a solid open-source example of this. So how do they work? 

Sticking with CAI as an example (the commercial models I tested worked in an almost identical fashion), you essentially plug in your LLM - either an open-source, self-hosted version or a foundational model API, such as OpenAI GPT-Cyber or Claude Mythos, set a target, provide the required credentials, and watch it run. Some tools also ingest source code for additional context and to help them zero in on trouble spots within the application. 

The findings are astounding. When compared side by side with a human pen tester, these tools perform the pen test much faster and with identical or better accuracy.

With open-source models, results vary significantly depending on the LLM used: larger models with trillions of parameters perform better than smaller models with only a few billion, and models specialized in coding outperform more general-purpose ones. 

Read more:  What Fronter AI Models Like Mythos and GPT-Cyber Mean for Modern Cybersecurity

The issue of weaknesses in web applications also largely disappears. These tools are miles ahead of any deterministic web application scanner on the market today. The reports they generate are concise, readable, and include fully documented exploit chains with proof-of-concept, akin to a high-quality bug bounty report.

As a result, issues such as human bias and “pen tester syndrome” disappear entirely. They can be run almost continuously, on a daily basis, removing the “point-in-time” weakness and delays of traditional pen tests. It’s no coincidence that recent academic research backs this up - but are there any downsides? 

The Downsides of AI Pen Testing

Yes, it’s not all roses. First, cost. AI is horrendously expensive to run, even when using open-source models that you self-host. You’ll need to invest in serious infrastructure to complete a pen test within a day. Commercial models are eye-wateringly expensive, and for many of them you could afford an entire in-house pen testing team for the price they charge. 

Secondly, these tools still struggle with mobile applications, largely because app emulation and realistic user interaction haven’t yet been perfected—but this will come. They also occasionally hallucinate, which is a well-known AI issue across all use cases, not just pen testing. 

Lastly, and unfortunately, it will take time to convince pen test clients that an AI-generated report is equivalent to one produced by a human pen tester. Compliance frameworks will take years to absorb this reality, so expect humans to remain part of the process for some time yet. 

That said, the writing is on the wall—not just for human pen testing, but for the entire DevSecOps pipeline: SAST, DAST, RAST, and traditional web application scanning tools will all disappear.

In their stead a single AI agent capable of contextually navigating all permutations of an application (both static and dynamic), running continuously, and producing perfectly readable reports. A leaner, faster pipeline with fewer siloed tools is something we should all welcome—if we can afford it.

What’s Hot on Infosecurity Magazine?