Defensive Misdirection Bounds ASR Where Refusals Leak Signal to Automated Jailbreaks
A new probabilistic model shows that predictable refusals enable attacker success to approach 1.0, while controlled misdirection caps it — with direct implications for LLM-powered crypto agents.
Conventional detect-and-block defenses are asymptotically broken under automated attack. A new probabilistic model shows that when a defender issues predictable refusals, the attacker’s automated judge can use those refusals as feedback to iteratively refine prompts, causing attacker success rate (ASR) to approach 1.0 as query budget grows [^claim_137]. This is not a theoretical curiosity — it is the exact dynamic that makes LLM-powered crypto agents vulnerable to automated prompt-injection campaigns.
The paper’s core insight is that the defender can break this feedback loop by switching from blocking to misdirection. The detect-and-misdirect strategy returns controlled, non-operational responses that induce false-positive errors in the attacker’s judge, reducing the positive predictive value of attacker-selected candidates and yielding a bounded asymptotic ASR [^claim_138]. This is the security analogue of a honeypot: instead of revealing which inputs are malicious (a binary signal), the defender poisons the attacker’s optimization signal.
The authors instantiate this strategy with CMPE (Contextual Misdirection via Progressive Engagement), a lightweight conversational method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings [^claim_139]. The results are stark: on jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude [^claim_140], and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs [^claim_141].
For crypto, the implications are immediate. Any on-chain agent that exposes an LLM interface — a DeFi intent parser, a DAO chatbot, a cross-chain bridge assistant — currently faces an asymmetric threat: automated adversaries can iterate against a fixed refusal pattern until they find a bypass. This is directly analogous to oracle attacks in smart contracts where a binary response enables binary search. Deploying CMPE-style misdirection instead of hard refusals could reduce the success rate of automated prompt-injection campaigns by ~100× at negligible additional inference cost, with no retraining or model swap required.
The paper’s formal model covers the target system, defense mechanism, and attacker’s automated judge as a unified probabilistic system [^claim_142]. This game-theoretic framing could be adapted to analyze MEV auction dynamics where searchers use LLMs to generate bundles and the block builder must decide whether to include or misdirect — the same detect-vs-misdirect tradeoff applies when the builder’s inclusion logic is predictable. More broadly, any crypto protocol with an automated adjudicator (optimistic rollup challengers, oracle dispute resolvers) faces the same fundamental tension: predictable binary responses enable automated search.
Bottom line: the era of hard refusals for LLM-powered crypto agents is over. The next generation of defenses must misdirect, not block — or accept that ASR will converge to 1.0 against any automated adversary with a sufficient query budget.
Evidence & Provenance
Every claim is hash-locked to its source span. Click any [N] marker above to verify.
Claim 137 Conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, because predictable refusals provide useful feedback to automated search.
conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search.
1628d1ebf6c172fc7d258b0a2a3fa73c81e7d249a3fb1796d2a96206be528773 Claim 138 Detect-and-misdirect strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
fc175087ef3ca62b63c5750ca28c22d300aa2604ad1269cfa1195eb8b8784190 Claim 139 CMPE (Contextual Misdirection via Progressive Engagement) is a lightweight conversational misdirection method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
ec173dfcb57e4caa1e1150158e5d3d154b4f4d4ac59b9d6d826a60a9f05270a6 Claim 140 On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude.
CMPE reduces estimated ASR upper bounds by up to two orders of magnitude
4451cf9273cb76113aa8a01f3f25deba0cd46ec3718fb2654cdb5318f6b7883b Claim 141 CMPE nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
383bf2095d3774221a87707dfeb347ba96d0acf02e4f524450eeb680d2610b1e Claim 142 The paper analyzes the attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
5896f01e6715fa6106df4eb322e433a8ba4a669d98a4a38edfe6847ea6210368