Computer Science > Cryptography and Security
计算机科学>密码学与安全
[Submitted on 5 Jun 2025]
[提交日期:2025年6月5日]
[提交日期:2025年6月5日]
Title:Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
为什么 LLM 安全护栏在微调后崩溃:对齐和微调数据集之间的相似性分析
View PDF
HTML (experimental)
Abstract:Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.
大型语言模型 (LLM) 的最新进展凸显了它们对安全对齐越狱的脆弱性,特别是在进行下游微调时。然而,现有的缓解策略主要侧重于在安全护栏受到损害后被动解决越狱事件,在微调过程中消除有害梯度,或在整个微调过程中不断加强安全对齐。因此,他们往往会忽略一个关键的上游因素:原始安全对准数据的作用。因此,本文通过上游对齐数据集和下游微调任务之间的表示相似性的视角研究了安全护栏的退化。我们的实验表明,这些数据集之间的高度相似性显着削弱了安全护栏,使模型更容易越狱。相反,这两种类型的数据集之间的低相似性会产生更稳健的模型,从而将危害性评分降低多达 10.33%。通过强调上游数据集设计在构建持久安全护栏和减少现实世界对越狱攻击的脆弱性方面的重要性,这些发现为微调服务提供商提供了可行的见解。
Submission history 提交历史
From: Lei Hsiung [view email]发件人:Lei Hsiung [ 查看电子邮件 ]
[v1] Thu, 5 Jun 2025 17:59:55 UTC (424 KB)
[第 1 节] 2025 年 6 月 5 日星期四 17:59:55 UTC (424 KB)
询问ChatGPT 询问 ChatGPT
Current browse context:
当前浏览上下文:
当前浏览上下文:
cs.CR cs。铬
Change to browse by:
更改为浏览方式:
更改为浏览方式:
References & Citations 参考文献和引文
export BibTeX citation 导出 BibTeX 引文
Loading...