New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

Written by

A novel attack technique that undermines a common safety mechanism in agentic AI systems has been detailed by security researchers, showing how human approval prompts can be manipulated to execute malicious code.

The issue, observed by Checkmarx researchers, centers on Human-in-the-Loop (HITL) dialogs, which are designed to ask users for confirmation before an AI agent performs potentially risky actions such as running operating system commands.

The research, published on Tuesday, describes how attackers can forge or manipulate these dialogs so they appear harmless, even though approving them triggers arbitrary code execution.

The technique, dubbed Lies-in-the-Loop (LITL), exploits the trust users place in confirmation prompts, turning a safeguard into an attack vector.

A New Attack Vector

The analysis expands on earlier work by showing that attackers are not limited to hiding malicious commands out of view. They can also prepend benign-looking text, tamper with metadata that summarizes the action being taken and exploit Markdown rendering flaws in user interfaces.

In some cases, injected content can alter how a dialog is displayed, making dangerous commands appear safe or replacing them with innocuous ones.

The problem is particularly acute for privileged AI agents such as code assistants, which often rely heavily on HITL dialogs and lack other defensive layers recommended by OWASP.

HITL prompts are cited by OWASP as mitigations for prompt injection and excessive agency, making their compromise especially concerning.

“Once the HITL dialog itself is compromised, the human safeguard becomes trivially easy to bypass,” the researchers wrote.

The attack can originate from indirect prompt injections that poison the agent’s context long before the dialog is shown.

Read more on AI agent security: AI Agents Need Security Training – Just Like Your Employees

Affected Tools and Mitigation Strategies

The research references demonstrations involving Claude Code and Microsoft Copilot Chat in VS Code.

In Claude Code, attackers were shown to tamper with dialog content and metadata. In Copilot Chat, improper Markdown sanitization allowed injected elements to render in ways that could mislead users after approval.

The disclosure timeline shows that Anthropic acknowledged reports in August 2025 but classified them as informational. Microsoft acknowledged a report in October 2025 and later marked it as completed without a fix, stating the behavior did not meet its criteria for a security vulnerability.

The researchers stress that no single fix can eliminate LITL attacks, but they recommend a defense-in-depth approach, including:

  • Improving user awareness and training

  • Strengthening visual clarity of approval dialogs

  • Validating and sanitizing inputs, including Markdown

  • Using safe OS APIs that separate commands from arguments

  • Applying guardrails and reasonable length limits to dialogs

“Developers adopting a defense-in-depth strategy with multiple protective layers [...] can significantly reduce the risks for their users,” Checkmarx wrote.

“At the same time, users can strengthen resilience through greater awareness, attentiveness and a healthy degree of skepticism.”

What’s Hot on Infosecurity Magazine?