Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection.
Researchers at Unit 42, Palo Alto Networks’ research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality can be manipulated into authorizing policy violations through stealthy input sequences.
Unit 42 refers to these LLMs as ‘AI Judges’ and said they are being increasingly deployed as AI operations scale.
In a new report published on March 10, Unit 42 demonstrated an attack method that could target these ‘AI Judges’ and empower them to authorize policy violations.
AdvJudge-Zero, Custom-Made Fuzzer for AI Judges
The attack chain involves the use of AdvJudge-Zero, an automated fuzzer developed internally at Unit 42 to perform red-team style assessments.
Fuzzers are tools that identify software vulnerabilities by providing unexpected input. AdvJudge-Zero functions with a similar approach to identify specific trigger sequences that exploit an LLM’s decision-making logic to bypass security controls.
The researchers noted that their technique differs from typical adversarial attacks on AI judges, which generally requires clear-box access to the model, meaning the attacker has full visibility to the internal structure of the system.
“In contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model's own predictive nature,” they wrote.
Attack on AI Judges Explained
The attack starts by probing the AI Judge and analyzing its next‑token probability distribution to identify tokens the model expects to see in natural text.
Instead of random noise, the system prioritizes low‑perplexity tokens, innocent‑looking characters such as markdown symbols, list markers, or structural phrases, that appear normal to both humans and the model but can strongly influence the model’s attention and reasoning.
After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the model’s decision changes.
Specifically, it monitors the logit gap – “the mathematical margin of confidence” – between the tokens representing “allow” and “block.” By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content.
In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision. These sequences act as subtle control elements that shift the model’s internal reasoning, causing it to “allow” the output even when the underlying content violates the GenAI company’s policy and thus allow the tool to generate harmful content or perform cyber-attacks.
99% Attack Success Rate
Using this attack technique, Unit 42 achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today, including open-weight enterprise LLMs, specialized reward models (i.e. LLMs specifically built and trained to act as security guards for other AI systems and commercial LLMs
“Even the largest, most ‘intelligent’ models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed,” the researchers wrote.
While this experiment showed that AI guardrails, including ‘AI judges,’ are susceptible to logic flaws, the researchers add that it also provides a solution.
“By adopting adversarial training – running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples – organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero,” the Unit 42 blog concluded.
