Researchers Discover Major Security Gaps in LLM Guardrails

Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection.

Researchers at Unit 42, Palo Alto Networks’ research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality can be manipulated into authorizing policy violations through stealthy input sequences.

Unit 42 refers to these LLMs as ‘AI Judges’ and said they are being increasingly deployed as AI operations scale.

In a new report published on March 10, Unit 42 demonstrated an attack method that could target these ‘AI Judges’ and empower them to authorize policy violations.

AdvJudge-Zero, Custom-Made Fuzzer for AI Judges

The attack chain involves the use of AdvJudge-Zero, an automated fuzzer developed internally at Unit 42 to perform red-team style assessments.

Fuzzers are tools that identify software vulnerabilities by providing unexpected input. AdvJudge-Zero functions with a similar approach to identify specific trigger sequences that exploit an LLM’s decision-making logic to bypass security controls.

The researchers noted that their technique differs from typical adversarial attacks on AI judges, which generally requires clear-box access to the model, meaning the attacker has full visibility to the internal structure of the system.

“In contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model's own predictive nature,” they wrote.

Attack on AI Judges Explained

The attack starts by probing the AI Judge and analyzing its next‑token probability distribution to identify tokens the model expects to see in natural text.

Instead of random noise, the system prioritizes low‑perplexity tokens, innocent‑looking characters such as markdown symbols, list markers, or structural phrases, that appear normal to both humans and the model but can strongly influence the model’s attention and reasoning.

After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the model’s decision changes.

Specifically, it monitors the logit gap – “the mathematical margin of confidence” – between the tokens representing “allow” and “block.” By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content.

In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision. These sequences act as subtle control elements that shift the model’s internal reasoning, causing it to “allow” the output even when the underlying content violates the GenAI company’s policy and thus allow the tool to generate harmful content or perform cyber-attacks.

99% Attack Success Rate

Using this attack technique, Unit 42 achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today, including open-weight enterprise LLMs, specialized reward models (i.e. LLMs specifically built and trained to act as security guards for other AI systems and commercial LLMs

“Even the largest, most ‘intelligent’ models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed,” the researchers wrote.

While this experiment showed that AI guardrails, including ‘AI judges,’ are susceptible to logic flaws, the researchers add that it also provides a solution.

“By adopting adversarial training – running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples – organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero,” the Unit 42 blog concluded.

Researchers Discover Major Security Gaps in LLM Guardrails

Kevin Poireault

AdvJudge-Zero, Custom-Made Fuzzer for AI Judges

Attack on AI Judges Explained

99% Attack Success Rate

You may also like

Claude Desktop Extensions Vulnerable to Web-Based Prompt Injection

Google OSS-Fuzz Harnesses AI to Expose 26 Hidden Security Vulnerabilities

DeepSeek Exposed Database Leaks Sensitive Data

New Zero-Click Attack Lets ChatGPT User Steal Data

Microsoft 365 Copilot: New Zero-Click AI Vulnerability Allows Corporate Data Theft

What’s Hot on Infosecurity Magazine?

ShinyHunters Targets Hundreds of Websites in New Salesforce Campaign

Trump Administration Unveils New Cyber Strategy for America

TriZetto Provider Solutions Breach Hits 3.4 Million Patients

Only 24% Of organizations Test Identity Recovery Every Six Months

Cloud Attackers Now Prefer Vulnerability Exploits Over Credentials, Google Cloud Finds

Ericsson Breach Exposes Data of 15k Employees and Customers

Expect Iran to Launch Cyber-Attacks Globally, Warns Google Head of Threat Intel

Surge in Attacks on Surveillance Cameras Linked to Iranian Hackers

Iranian Cyber Threat Actor Targets Iraqi Government Officials in AI-Powered Campaign

Helldown Ransomware Expands to Target VMware and Linux Systems

Hybrid Middle East Conflict Triggers Surge in Global Cyber Activity

AI-powered Cyber-Attacks Up Significantly in the Last Year, Warns CrowdStrike

Securing the AI Era: A CISO’s Perspective

How To Enhance Security Operations with AI-Powered Defenses

Revisiting CIA: Developing Your Security Strategy in the SaaS Shared Reality

Risk-Based IT Compliance: The Case for Business-Driven Cyber Risk Quantification

How Mid-Market Businesses Can Leverage Microsoft Security for Proactive Defenses

Why Your Organisation Needs Trusted Time Synchronisation

The Intelligence Edge: Clarity, Context and the Human Advantage in Modern CTI

Future-Proofing Critical Infrastructure: National Gas CTO Darren Curley on IT/OT Security Integration

Hundreds of Malicious Crypto Trading Add-Ons Found in Moltbot/OpenClaw

Russian Cyber Threat Actor Uses GenAI to Compromise Fortinet Firewalls

Why Ransomware Remains One of Cybersecurity’s Most Persistent and Costly Threats

Psychology, AI and the Modern Security Program: A CISO’s Guide to Human Centric Defence

Researchers Discover Major Security Gaps in LLM Guardrails

Written by

AdvJudge-Zero, Custom-Made Fuzzer for AI Judges

Attack on AI Judges Explained

99% Attack Success Rate

You may also like

What’s Hot on Infosecurity Magazine?