New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

A novel attack technique that undermines a common safety mechanism in agentic AI systems has been detailed by security researchers, showing how human approval prompts can be manipulated to execute malicious code.

The issue, observed by Checkmarx researchers, centers on Human-in-the-Loop (HITL) dialogs, which are designed to ask users for confirmation before an AI agent performs potentially risky actions such as running operating system commands.

The research, published on Tuesday, describes how attackers can forge or manipulate these dialogs so they appear harmless, even though approving them triggers arbitrary code execution.

The technique, dubbed Lies-in-the-Loop (LITL), exploits the trust users place in confirmation prompts, turning a safeguard into an attack vector.

A New Attack Vector

The analysis expands on earlier work by showing that attackers are not limited to hiding malicious commands out of view. They can also prepend benign-looking text, tamper with metadata that summarizes the action being taken and exploit Markdown rendering flaws in user interfaces.

In some cases, injected content can alter how a dialog is displayed, making dangerous commands appear safe or replacing them with innocuous ones.

The problem is particularly acute for privileged AI agents such as code assistants, which often rely heavily on HITL dialogs and lack other defensive layers recommended by OWASP.

HITL prompts are cited by OWASP as mitigations for prompt injection and excessive agency, making their compromise especially concerning.

“Once the HITL dialog itself is compromised, the human safeguard becomes trivially easy to bypass,” the researchers wrote.

The attack can originate from indirect prompt injections that poison the agent’s context long before the dialog is shown.

Affected Tools and Mitigation Strategies

The research references demonstrations involving Claude Code and Microsoft Copilot Chat in VS Code.

In Claude Code, attackers were shown to tamper with dialog content and metadata. In Copilot Chat, improper Markdown sanitization allowed injected elements to render in ways that could mislead users after approval.

The disclosure timeline shows that Anthropic acknowledged reports in August 2025 but classified them as informational. Microsoft acknowledged a report in October 2025 and later marked it as completed without a fix, stating the behavior did not meet its criteria for a security vulnerability.

The researchers stress that no single fix can eliminate LITL attacks, but they recommend a defense-in-depth approach, including:

Improving user awareness and training
Strengthening visual clarity of approval dialogs
Validating and sanitizing inputs, including Markdown
Using safe OS APIs that separate commands from arguments
Applying guardrails and reasonable length limits to dialogs

“Developers adopting a defense-in-depth strategy with multiple protective layers [...] can significantly reduce the risks for their users,” Checkmarx wrote.

“At the same time, users can strengthen resilience through greater awareness, attentiveness and a healthy degree of skepticism.”

New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

Alessandro Mascellino

A New Attack Vector

Affected Tools and Mitigation Strategies

You may also like

GitHub Deputy CSO: AI Steps Up Security Game in Software Development

AI Assistants Used as Covert Command-and-Control Relays

Banana Squad’s Stealthy GitHub Malware Campaign Targets Devs

Threat Actors Game GitHub Search to Spread Malware

Npm Packages Used to Distribute Phishing Links

What’s Hot on Infosecurity Magazine?

44% Surge in App Exploits as AI Speeds Up Cyber-Attacks, IBM Finds

Shai-Hulud-Like Worm Targets Developers via npm and AI Tools

Russian Cyber Threat Actor Uses GenAI to Compromise Fortinet Firewalls

Jackpotting Surge Costs Banks Over $20m, Warns FBI

Exploitable Vulnerabilities Present in 87% of Organizations

UK's Data Watchdog Gets a Makeover to Match Growing Demands

AI-powered Cyber-Attacks Up Significantly in the Last Year, Warns CrowdStrike

New Zero-Click Flaw in Claude Desktop Extensions, Anthropic Declines Fix

Shai-Hulud-Like Worm Targets Developers via npm and AI Tools

Starkiller: New ‘Commercial-Grade’ Phishing Kit Bypasses MFA

Future-Proofing Critical Infrastructure: National Gas CTO Darren Curley on IT/OT Security Integration

Why Ransomware Remains One of Cybersecurity’s Most Persistent and Costly Threats

How To Enhance Security Operations with AI-Powered Defenses

What’s Next for AI Identity in 2026

Why Your Organisation Needs Trusted Time Synchronisation

Securing M365 Data and Identity Systems Against Modern Adversaries

Risk-Based IT Compliance: The Case for Business-Driven Cyber Risk Quantification

Revisiting CIA: Developing Your Security Strategy in the SaaS Shared Reality

The Intelligence Edge: Clarity, Context and the Human Advantage in Modern CTI

Future-Proofing Critical Infrastructure: National Gas CTO Darren Curley on IT/OT Security Integration

Hundreds of Malicious Crypto Trading Add-Ons Found in Moltbot/OpenClaw

Russian Cyber Threat Actor Uses GenAI to Compromise Fortinet Firewalls

Why Ransomware Remains One of Cybersecurity’s Most Persistent and Costly Threats

Psychology, AI and the Modern Security Program: A CISO’s Guide to Human Centric Defence

New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

Written by

A New Attack Vector

Affected Tools and Mitigation Strategies

You may also like

What’s Hot on Infosecurity Magazine?