Misdirection beats refusal: bounded ASR for agentic crypto systems
A new defense framework proves that feeding automated jailbreak attacks misleading responses caps their success rate, while conventional blocking fails. For on-chain AI agents, this changes the threat model from 'exploitable at scale' to 'not worth the gas.'
Conventional detect-and-block defenses against prompt injection have a fatal flaw: they let attacker success rate (ASR) approach one as the query budget grows, because predictable refusals become feedback signals for automated search [^claim_122]. For any on-chain agent — an MEV searcher parsing user intents, a DAO treasury bot, or an intent-solver like those in UniswapX — this means a simple block/refuse guard is trivially bypassed by an attacker with enough queries. The refusal itself leaks information that guides the next attack attempt.
A new framework from arXiv:2606.20470 proposes detect-and-misdirect: instead of blocking, the defense feeds detected malicious interactions controlled, non-operational responses designed to induce false-positive errors in the attacker’s automated judge [^claim_128]. This reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR [^claim_123]. The paper models the target system, defense, and attacker’s judge as a unified probabilistic game [^claim_127], providing a theoretical bound — not just empirical hand-waving.
The concrete implementation, Contextual Misdirection via Progressive Engagement (CMPE), replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings [^claim_124]. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude [^claim_125]. In end-to-end runs against PAIR and GPTFuzz — the standard automated jailbreak frameworks — CMPE nearly eliminates verified attack success [^claim_126].
For crypto, the implication is direct. Autonomous AI agents that execute on-chain actions (e.g., Olas agents, Across fillers, or any LLM-powered smart-contract auditor) face precisely these automated attack patterns. A defense that defeats PAIR/GPTFuzz end-to-end is deployable as a middleware layer for agentic crypto systems. The difference between a 10% ASR and a 0.1% ASR is the difference between “exploitable at scale” and “not worth the attacker’s gas costs.” The probabilistic model also offers a template for proving security bounds on agentic interfaces, analogous to game-theoretic proofs for consensus protocols — a missing primitive for mechanism design in DeFi.
Evidence & Provenance
Every claim is hash-locked to its source span. Click any [N] marker above to verify.
Claim 122 Conventional detect-and-block defenses allow attacker success rate (ASR) to approach one as the query budget grows, because predictable refusals provide useful feedback to automated search.
conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search.
1628d1ebf6c172fc7d258b0a2a3fa73c81e7d249a3fb1796d2a96206be528773 Claim 123 Detect-and-misdirect strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
6db0db7eb6d0018fdde3da5b208f376efa195b9411d346d19612090b13b84015 Claim 124 Contextual Misdirection via Progressive Engagement (CMPE) is a lightweight conversational misdirection method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
ec173dfcb57e4caa1e1150158e5d3d154b4f4d4ac59b9d6d826a60a9f05270a6 Claim 125 CMPE reduces estimated ASR upper bounds by up to two orders of magnitude on jailbreak benchmarks.
CMPE reduces estimated ASR upper bounds by up to two orders of magnitude
4451cf9273cb76113aa8a01f3f25deba0cd46ec3718fb2654cdb5318f6b7883b Claim 126 CMPE nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
383bf2095d3774221a87707dfeb347ba96d0acf02e4f524450eeb680d2610b1e Claim 127 The paper analyzes the attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
5896f01e6715fa6106df4eb322e433a8ba4a669d98a4a38edfe6847ea6210368 Claim 128 Detect-and-misdirect works by giving detected malicious interactions controlled, non-operational responses designed to induce false-positive errors in the attacker's judge.
detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge.
4f156a26b80db4acc7496c94d3ae7bd795541d402dc714b12df848e9cc6b2a5f