AI-generated · automated quality-gated

CWE-Trace Reveals LLMs Achieve Calibration Without Comprehension in Vulnerability Detection

A new framework shows fine-tuned LLMs shift output thresholds without genuine security reasoning, with best detection at only 52.1% and CWE Top-1 accuracy below 1.3%.

· research synthesis

A new study introduces CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set) and preserves context-aware vulnerable–patched pairs, enabling rigorous evaluation of fine-tuned models.

The central finding is that fine-tuning produces what the authors call “calibration without comprehension”: output distributions shift to match training data while the underlying security reasoning remains absent. Fine-tuning shifts the output threshold without changing the decision policy. This is evidenced by stable systematic failure modes, with the Directional Failure Index (DFI) ranging from -85.5 to +94.8 percentage points, persisting from historical to post-cutoff data and resisting correction.

Data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

Interestingly, the weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. This suggests that improving one metric does not imply improvement in the other.

These findings have direct implications for AI×crypto applications. On-chain vulnerability oracles relying on fine-tuned LLMs to audit smart contracts inherit the same failure modes, potentially missing nearly half of vulnerabilities. MEV and consensus-layer security tools using LLM-based agents may be evaded by adversarial inputs that exploit backbone directional priors. Zero-knowledge proof auditing, which requires precise classification of cryptographic bugs, is undermined by the <1.3% Top-1 CWE accuracy. Decentralized AI inference markets cannot infer reasoning quality from benchmark performance alone, as detection and understanding are decoupled. Finally, token-incentivized data-labeling markets cannot reliably price contamination risk, since contaminated samples carry weak memorization signals and frequent misclassifications.

Evidence & Provenance

Every claim is hash-locked to its source span. Click any [N] marker above to verify.

Claim 18 CWE-Trace is a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs.
Source span
We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs.
SHA-256 0078606fc0f64d5a2d2f4c70f545a77355574387243946d60812a2ae50d89448
Claim 19 The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set) and preserves context-aware vulnerable--patched pairs.
Source span
The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD).
SHA-256 243aea3be1d11687abacc616e939828fa18405052bb52bbf06178952a79e4f02
Claim 20 Data contamination provides no measurable advantage: 84% of nominally contaminated samples carry no usable memorization signal.
Source span
First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification.
SHA-256 ef5379d1dc133ac68326a3a8791dff8ba26a91319f63e54c476cd026ffdbf1a8
Claim 21 Backbone directional priors dominate fine-tuning; models exhibit stable systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data.
Source span
Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction.
SHA-256 fa4b1d1e42fc39f761f3c4e56e1fed52b2b30aff4071f2542fff6a371dba0e4f
Claim 22 Fine-tuning shifts the output threshold without changing the decision policy — calibration without comprehension.
Source span
Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent.
SHA-256 205126e6953c7511519a8c289dd274c65cdcce9dd9e15e91ec90150454a3f6a1
Claim 23 The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing detection and understanding are decoupled capabilities.
Source span
The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities.
SHA-256 4bdb04e9c85cc539ad6cb1d6127a61f655a3cca5d1051786e4df810e338cb415
Claim 24 The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy.
Source span
The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.
SHA-256 649574ef90580536c9f63dd1493e06f690b31efcef639a645883ec673646e206
Claim 25 ~31% of contaminated samples carry CWE misclassification.
Source span
~31% of contaminated samples carry CWE misclassification.
SHA-256 2aca4451e3860b4eb5cb48e037bfa43096d44355b385803a3427b102ea47030d

Sources

  1. Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software
llm-vulnerability-detectioncalibration-without-comprehensioncwe-traceai-crypto-securitysmart-contract-auditingdecentralized-ai

Get the synthesis

AI×crypto research, repackaged with every claim hash-locked to its source. New arXiv → analysis in ~3 hours.