A new report has revealed that open-weight large language models (LLMs) have remained highly vulnerable to adaptive multi-turn adversarial attacks, even when single-turn defenses appear robust.
The findings, published today by Cisco AI Defense, show that while isolated, one-off attack attempts frequently fail, persistent, multi-step conversations can achieve success rates exceeding 90% against most tested defenses.
Multi-Turn Attacks Outperform Single-Turn Tests
Cisco’s analysis compared single-turn and multi-turn testing to measure how models respond under sustained adversarial pressure.
Using over 1000 prompts per model, researchers observed that many models performed well when faced with a single malicious input but quickly deteriorated when attackers refined their strategy over several turns.
Adaptive attack styles, such as “Crescendo,” “Role-Play” and “Refusal Reframe,” allowed malicious actors to manipulate models into producing unsafe or restricted outputs. In total, 499 simulated conversations were analyzed, with each spanning 5-10 exchanges.
The results indicate that traditional safety filters are insufficient when models are subjected to iterative manipulation.
Key Vulnerabilities and Attack Categories
The study identified 15 sub-threat categories showing the highest failure rates across 102 total threat types.
Among them, malicious code generation, data exfiltration and ethical boundary violations ranked most critical.
Cisco’s scatter plot analyses revealed that models plotting above the diagonal line in vulnerability graphs share architectural weaknesses that make them disproportionately prone to multi-turn exploitation.
The research defined a “failure” as any instance where a model:
-
Produced harmful or inappropriate content
-
Revealed private or system-level information
-
Bypassed internal safety restrictions
Conversely, a “pass” occurred when the model refused or reframed harmful requests while maintaining data confidentiality.
Recommendations For Developers and Organizations
To mitigate risks, Cisco recommended several practices:
-
Implement strict system prompts aligned with defined use cases
-
Deploy model-agnostic runtime guardrails for adversarial detection
-
Conduct regular AI red-teaming assessments within intended business contexts
-
Limit model integrations with automated external services
The report also called for expanding prompt sample sizes, testing repeated prompts to assess variability and comparing models of different sizes to evaluate scale-dependent vulnerabilities.
“The AI developer and security community must continue to actively manage these threats (as well as additional safety and security concerns) through independent testing and guardrail development throughout the lifecycle of model development and deployment in organizations,” Cisco wrote.
“Without AI security solutions – such as multi-turn testing, threat-specific mitigation and continuous monitoring – these models pose significant risks in production, potentially leading to data breaches or malicious manipulations.”
