AI-generated · automated quality-gated

Defensive Misdirection Bounds ASR Where Refusals Leak Signal to Automated Jailbreaks

A new probabilistic model shows that predictable refusals enable attacker success to approach 1.0, while controlled misdirection caps it — with direct implications for LLM-powered crypto agents.

June 21, 2026 · research synthesis

Crypto implications

Predictable refusals leak information to automated adversaries On-chain agents (MEV searcher bots, DeFi governance delegates using LLMs) become vulnerable to iterative prompt-injection attacks, analogous to oracle attacks in smart contracts.
Detect-and-misdirect caps asymptotic ASR TEE-based inference endpoints for on-chain agents can return plausible-but-non-operational outputs, starving the attacker's optimization loop of ground-truth signal.
CMPE reduces ASR by ~100x Crypto protocols with LLM-powered interfaces (DeFi intent parsers, DAO chatbots, cross-chain bridge assistants) can deploy CMPE as a drop-in defense at negligible cost.
CMPE nearly eliminates PAIR/GPTFuzz attack success Same attack pipelines could target LLM-powered wallets; CMPE offers a practical defense, though robustness against adaptive adversaries needs further study.
Probabilistic model unifies target, defense, and attacker This framing can be adapted to MEV auctions (searchers as attackers, builders as defenders) and any protocol with an automated adjudicator (optimistic rollup challengers, oracle dispute resolvers).

Conventional detect-and-block defenses are asymptotically broken under automated attack. A new probabilistic model shows that when a defender issues predictable refusals, the attacker’s automated judge can use those refusals as feedback to iteratively refine prompts, causing attacker success rate (ASR) to approach 1.0 as query budget grows [^claim_137]. This is not a theoretical curiosity — it is the exact dynamic that makes LLM-powered crypto agents vulnerable to automated prompt-injection campaigns.

The paper’s core insight is that the defender can break this feedback loop by switching from blocking to misdirection. The detect-and-misdirect strategy returns controlled, non-operational responses that induce false-positive errors in the attacker’s judge, reducing the positive predictive value of attacker-selected candidates and yielding a bounded asymptotic ASR [^claim_138]. This is the security analogue of a honeypot: instead of revealing which inputs are malicious (a binary signal), the defender poisons the attacker’s optimization signal.

The authors instantiate this strategy with CMPE (Contextual Misdirection via Progressive Engagement), a lightweight conversational method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings [^claim_139]. The results are stark: on jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude [^claim_140], and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs [^claim_141].

For crypto, the implications are immediate. Any on-chain agent that exposes an LLM interface — a DeFi intent parser, a DAO chatbot, a cross-chain bridge assistant — currently faces an asymmetric threat: automated adversaries can iterate against a fixed refusal pattern until they find a bypass. This is directly analogous to oracle attacks in smart contracts where a binary response enables binary search. Deploying CMPE-style misdirection instead of hard refusals could reduce the success rate of automated prompt-injection campaigns by ~100× at negligible additional inference cost, with no retraining or model swap required.

The paper’s formal model covers the target system, defense mechanism, and attacker’s automated judge as a unified probabilistic system [^claim_142]. This game-theoretic framing could be adapted to analyze MEV auction dynamics where searchers use LLMs to generate bundles and the block builder must decide whether to include or misdirect — the same detect-vs-misdirect tradeoff applies when the builder’s inclusion logic is predictable. More broadly, any crypto protocol with an automated adjudicator (optimistic rollup challengers, oracle dispute resolvers) faces the same fundamental tension: predictable binary responses enable automated search.

Bottom line: the era of hard refusals for LLM-powered crypto agents is over. The next generation of defenses must misdirect, not block — or accept that ASR will converge to 1.0 against any automated adversary with a sufficient query budget.

Evidence & Provenance

Every claim is hash-locked to its source span. Click any ^[N] marker above to verify.

Claim 137 Conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, because predictable refusals provide useful feedback to automated search.

Source span

conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search.

SHA-256 1628d1ebf6c172fc7d258b0a2a3fa73c81e7d249a3fb1796d2a96206be528773

Source https://arxiv.org/abs/2606.20470

Claim 138 Detect-and-misdirect strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.

Source span

detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.

SHA-256 fc175087ef3ca62b63c5750ca28c22d300aa2604ad1269cfa1195eb8b8784190

Source https://arxiv.org/abs/2606.20470

Claim 139 CMPE (Contextual Misdirection via Progressive Engagement) is a lightweight conversational misdirection method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.

Source span

Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.

SHA-256 ec173dfcb57e4caa1e1150158e5d3d154b4f4d4ac59b9d6d826a60a9f05270a6

Source https://arxiv.org/abs/2606.20470

Claim 140 On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude.

Source span

CMPE reduces estimated ASR upper bounds by up to two orders of magnitude

SHA-256 4451cf9273cb76113aa8a01f3f25deba0cd46ec3718fb2654cdb5318f6b7883b

Source https://arxiv.org/abs/2606.20470

Claim 141 CMPE nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

Source span

nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

SHA-256 383bf2095d3774221a87707dfeb347ba96d0acf02e4f524450eeb680d2610b1e

Source https://arxiv.org/abs/2606.20470

Claim 142 The paper analyzes the attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.

Source span

This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.

SHA-256 5896f01e6715fa6106df4eb322e433a8ba4a669d98a4a38edfe6847ea6210368

Source https://arxiv.org/abs/2606.20470

Sources

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Evidence & Provenance

Sources

Get the synthesis