CWE-Trace Reveals LLMs Achieve Calibration Without Comprehension in Vulnerability Detection
A new framework shows fine-tuned LLMs shift output thresholds without genuine security reasoning, with best detection at only 52.1% and CWE Top-1 accuracy below 1.3%.
A new study introduces CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set) and preserves context-aware vulnerable–patched pairs, enabling rigorous evaluation of fine-tuned models.
The central finding is that fine-tuning produces what the authors call “calibration without comprehension”: output distributions shift to match training data while the underlying security reasoning remains absent. Fine-tuning shifts the output threshold without changing the decision policy. This is evidenced by stable systematic failure modes, with the Directional Failure Index (DFI) ranging from -85.5 to +94.8 percentage points, persisting from historical to post-cutoff data and resisting correction.
Data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.
Interestingly, the weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. This suggests that improving one metric does not imply improvement in the other.
These findings have direct implications for AI×crypto applications. On-chain vulnerability oracles relying on fine-tuned LLMs to audit smart contracts inherit the same failure modes, potentially missing nearly half of vulnerabilities. MEV and consensus-layer security tools using LLM-based agents may be evaded by adversarial inputs that exploit backbone directional priors. Zero-knowledge proof auditing, which requires precise classification of cryptographic bugs, is undermined by the <1.3% Top-1 CWE accuracy. Decentralized AI inference markets cannot infer reasoning quality from benchmark performance alone, as detection and understanding are decoupled. Finally, token-incentivized data-labeling markets cannot reliably price contamination risk, since contaminated samples carry weak memorization signals and frequent misclassifications.
Evidence & Provenance
Every claim is hash-locked to its source span. Click any [N] marker above to verify.
Claim 18 CWE-Trace is a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs.
We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs.
0078606fc0f64d5a2d2f4c70f545a77355574387243946d60812a2ae50d89448 Claim 19 The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set) and preserves context-aware vulnerable--patched pairs.
The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD).
243aea3be1d11687abacc616e939828fa18405052bb52bbf06178952a79e4f02 Claim 20 Data contamination provides no measurable advantage: 84% of nominally contaminated samples carry no usable memorization signal.
First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification.
ef5379d1dc133ac68326a3a8791dff8ba26a91319f63e54c476cd026ffdbf1a8 Claim 21 Backbone directional priors dominate fine-tuning; models exhibit stable systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data.
Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction.
fa4b1d1e42fc39f761f3c4e56e1fed52b2b30aff4071f2542fff6a371dba0e4f Claim 22 Fine-tuning shifts the output threshold without changing the decision policy — calibration without comprehension.
Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent.
205126e6953c7511519a8c289dd274c65cdcce9dd9e15e91ec90150454a3f6a1 Claim 23 The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing detection and understanding are decoupled capabilities.
The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities.
4bdb04e9c85cc539ad6cb1d6127a61f655a3cca5d1051786e4df810e338cb415 Claim 24 The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy.
The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.
649574ef90580536c9f63dd1493e06f690b31efcef639a645883ec673646e206 Claim 25 ~31% of contaminated samples carry CWE misclassification.
~31% of contaminated samples carry CWE misclassification.
2aca4451e3860b4eb5cb48e037bfa43096d44355b385803a3427b102ea47030d