AI-generated · automated quality-gated

Defensive Misdirection Bounds ASR Where Refusals Leak Signal to Automated Jailbreaks

A new probabilistic model shows that predictable refusals enable attacker success to approach 1.0, while controlled misdirection caps it — with direct implications for LLM-powered crypto agents.

· research synthesis

Conventional detect-and-block defenses are asymptotically broken under automated attack. A new probabilistic model shows that when a defender issues predictable refusals, the attacker’s automated judge can use those refusals as feedback to iteratively refine prompts, causing attacker success rate (ASR) to approach 1.0 as query budget grows [^claim_137]. This is not a theoretical curiosity — it is the exact dynamic that makes LLM-powered crypto agents vulnerable to automated prompt-injection campaigns.

The paper’s core insight is that the defender can break this feedback loop by switching from blocking to misdirection. The detect-and-misdirect strategy returns controlled, non-operational responses that induce false-positive errors in the attacker’s judge, reducing the positive predictive value of attacker-selected candidates and yielding a bounded asymptotic ASR [^claim_138]. This is the security analogue of a honeypot: instead of revealing which inputs are malicious (a binary signal), the defender poisons the attacker’s optimization signal.

The authors instantiate this strategy with CMPE (Contextual Misdirection via Progressive Engagement), a lightweight conversational method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings [^claim_139]. The results are stark: on jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude [^claim_140], and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs [^claim_141].

For crypto, the implications are immediate. Any on-chain agent that exposes an LLM interface — a DeFi intent parser, a DAO chatbot, a cross-chain bridge assistant — currently faces an asymmetric threat: automated adversaries can iterate against a fixed refusal pattern until they find a bypass. This is directly analogous to oracle attacks in smart contracts where a binary response enables binary search. Deploying CMPE-style misdirection instead of hard refusals could reduce the success rate of automated prompt-injection campaigns by ~100× at negligible additional inference cost, with no retraining or model swap required.

The paper’s formal model covers the target system, defense mechanism, and attacker’s automated judge as a unified probabilistic system [^claim_142]. This game-theoretic framing could be adapted to analyze MEV auction dynamics where searchers use LLMs to generate bundles and the block builder must decide whether to include or misdirect — the same detect-vs-misdirect tradeoff applies when the builder’s inclusion logic is predictable. More broadly, any crypto protocol with an automated adjudicator (optimistic rollup challengers, oracle dispute resolvers) faces the same fundamental tension: predictable binary responses enable automated search.

Bottom line: the era of hard refusals for LLM-powered crypto agents is over. The next generation of defenses must misdirect, not block — or accept that ASR will converge to 1.0 against any automated adversary with a sufficient query budget.

Evidence & Provenance

Every claim is hash-locked to its source span. Click any [N] marker above to verify.

Claim 137 Conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, because predictable refusals provide useful feedback to automated search.
Source span
conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search.
SHA-256 1628d1ebf6c172fc7d258b0a2a3fa73c81e7d249a3fb1796d2a96206be528773
Claim 138 Detect-and-misdirect strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
Source span
detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR.
SHA-256 fc175087ef3ca62b63c5750ca28c22d300aa2604ad1269cfa1195eb8b8784190
Claim 139 CMPE (Contextual Misdirection via Progressive Engagement) is a lightweight conversational misdirection method that replaces predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
Source span
Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings.
SHA-256 ec173dfcb57e4caa1e1150158e5d3d154b4f4d4ac59b9d6d826a60a9f05270a6
Claim 140 On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude.
Source span
CMPE reduces estimated ASR upper bounds by up to two orders of magnitude
SHA-256 4451cf9273cb76113aa8a01f3f25deba0cd46ec3718fb2654cdb5318f6b7883b
Claim 141 CMPE nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
Source span
nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
SHA-256 383bf2095d3774221a87707dfeb347ba96d0acf02e4f524450eeb680d2610b1e
Claim 142 The paper analyzes the attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
Source span
This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge.
SHA-256 5896f01e6715fa6106df4eb322e433a8ba4a669d98a4a38edfe6847ea6210368

Sources

  1. Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems
llm-securityprompt-injectionai-agentsdefensive-misdirectioncrypto-security

Get the synthesis

AI×crypto research, repackaged with every claim hash-locked to its source. New arXiv → analysis in ~3 hours.