AI-generated · automated quality-gated

Misdirection beats refusal: bounded ASR for agentic crypto systems

A new defense framework proves that feeding automated jailbreak attacks misleading responses caps their success rate, while conventional blocking fails. For on-chain AI agents, this changes the threat model from 'exploitable at scale' to 'not worth the gas.'

June 21, 2026 · research synthesis

Crypto implications

Conventional detect-and-block defenses fail against automated attacks with large query budgets any on-chain agent using simple refusal guards (e.g., LLM-based MEV searchers, DAO treasury bots) is trivially exploitable.
Detect-and-misdirect yields bounded asymptotic ASR agentic crypto systems (e.g., Olas agents, UniswapX fillers) can deploy CMPE as a middleware layer to cap exploit success rates.
CMPE reduces ASR upper bounds by up to two orders of magnitude the difference between 'exploitable at scale' and 'not worth the gas' for automated prompt-injection campaigns.
CMPE nearly eliminates verified attack success in PAIR and GPTFuzz standard automated jailbreak frameworks are defeated, meaning the defense holds against the most widely used attack automation.
Probabilistic model unifies target, defense, and attacker judge provides a template for proving security bounds on agentic interfaces, analogous to game-theoretic proofs for consensus protocols.

Conventional detect-and-block defenses against prompt injection have a fatal flaw: they let attacker success rate (ASR) approach one as the query budget grows, because predictable refusals become feedback signals for automated search [^claim_122]. For any on-chain agent — an MEV searcher parsing user intents, a DAO treasury bot, or an intent-solver like those in UniswapX — this means a simple block/refuse guard is trivially bypassed by an attacker with enough queries. The refusal itself leaks information that guides the next attack attempt.

A new framework from arXiv:2606.20470 proposes detect-and-misdirect: instead of blocking, the defense feeds detected malicious interactions controlled, non-operational responses designed to induce false-positive errors in the attacker’s automated judge [^claim_128]. This reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR [^claim_123]. The paper models the target system, defense, and attacker’s judge as a unified probabilistic game [^claim_127], providing a theoretical bound — not just empirical hand-waving.

The concrete implementation, Contextual Misdirection via Progressive Engagement (CMPE), replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings [^claim_124]. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude [^claim_125]. In end-to-end runs against PAIR and GPTFuzz — the standard automated jailbreak frameworks — CMPE nearly eliminates verified attack success [^claim_126].

For crypto, the implication is direct. Autonomous AI agents that execute on-chain actions (e.g., Olas agents, Across fillers, or any LLM-powered smart-contract auditor) face precisely these automated attack patterns. A defense that defeats PAIR/GPTFuzz end-to-end is deployable as a middleware layer for agentic crypto systems. The difference between a 10% ASR and a 0.1% ASR is the difference between “exploitable at scale” and “not worth the attacker’s gas costs.” The probabilistic model also offers a template for proving security bounds on agentic interfaces, analogous to game-theoretic proofs for consensus protocols — a missing primitive for mechanism design in DeFi.

Evidence & Provenance

Every claim is hash-locked to its source span. Click any ^[N] marker above to verify.

Claim 122 Conventional detect-and-block defenses allow attacker success rate (ASR) to approach one as the query budget grows, because predictable refusals provide useful feedback to automated search.

Source span

conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search.

SHA-256 1628d1ebf6c172fc7d258b0a2a3fa73c81e7d249a3fb1796d2a96206be528773

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 123 Detect-and-misdirect strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.

Source span

This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.

SHA-256 6db0db7eb6d0018fdde3da5b208f376efa195b9411d346d19612090b13b84015

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 124 Contextual Misdirection via Progressive Engagement (CMPE) is a lightweight conversational misdirection method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.

Source span

Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.

SHA-256 ec173dfcb57e4caa1e1150158e5d3d154b4f4d4ac59b9d6d826a60a9f05270a6

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 125 CMPE reduces estimated ASR upper bounds by up to two orders of magnitude on jailbreak benchmarks.

Source span

CMPE reduces estimated ASR upper bounds by up to two orders of magnitude

SHA-256 4451cf9273cb76113aa8a01f3f25deba0cd46ec3718fb2654cdb5318f6b7883b

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 126 CMPE nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

Source span

nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

SHA-256 383bf2095d3774221a87707dfeb347ba96d0acf02e4f524450eeb680d2610b1e

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 127 The paper analyzes the attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.

Source span

This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.

SHA-256 5896f01e6715fa6106df4eb322e433a8ba4a669d98a4a38edfe6847ea6210368

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Claim 128 Detect-and-misdirect works by giving detected malicious interactions controlled, non-operational responses designed to induce false-positive errors in the attacker's judge.

Source span

detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge.

SHA-256 4f156a26b80db4acc7496c94d3ae7bd795541d402dc714b12df848e9cc6b2a5f

Source Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Sources

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Evidence & Provenance

Sources

Get the synthesis