Skip to main content
Cornell University

In just 5 minutes help us improve arXiv:
只需 5 分钟,帮助我们改进 arXiv:

Annual Global Survey
年度全球调查
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors.
我们感谢西蒙斯基金会、 成员机构和所有贡献者的支持。
Donate  捐
arxiv logo > cs > arXiv:2506.05346   arXiv:2506.05346

Help | Advanced Search
帮助 | 高级搜索

Computer Science > Cryptography and Security
计算机科学>密码学与安全

(cs)
[Submitted on 5 Jun 2025]
[提交日期:2025年6月5日]

Title:Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
为什么 LLM 安全护栏在微调后崩溃:对齐和微调数据集之间的相似性分析

Authors:Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
熊磊 , 彭天宇 , 唐永辰 , 宋琳月 , 何宗怡 , 陈品宇 , 杨耀青
View a PDF of the paper titled Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets, by Lei Hsiung and 6 other authors
View PDF HTML (experimental)
Abstract:Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.
大型语言模型 (LLM) 的最新进展凸显了它们对安全对齐越狱的脆弱性,特别是在进行下游微调时。然而,现有的缓解策略主要侧重于在安全护栏受到损害后被动解决越狱事件,在微调过程中消除有害梯度,或在整个微调过程中不断加强安全对齐。因此,他们往往会忽略一个关键的上游因素:原始安全对准数据的作用。因此,本文通过上游对齐数据集和下游微调任务之间的表示相似性的视角研究了安全护栏的退化。我们的实验表明,这些数据集之间的高度相似性显着削弱了安全护栏,使模型更容易越狱。相反,这两种类型的数据集之间的低相似性会产生更稳健的模型,从而将危害性评分降低多达 10.33%。通过强调上游数据集设计在构建持久安全护栏和减少现实世界对越狱攻击的脆弱性方面的重要性,这些发现为微调服务提供商提供了可行的见解。
Comments:  评论: Project Page: this https URL
项目页面: 此 https URL
Subjects:  科目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
密码学与安全 (cs.CR);计算和语言 (cs.CL);机器学习 (cs.LG)
Cite as:  引用为: arXiv:2506.05346 [cs.CR]
arXiv:2506.05346 [cs.CR]
  (or arXiv:2506.05346v1 [cs.CR] for this version)
(或 arXiv:2506.05346v1 [cs.CR] 对于此版本)
  https://doi.org/10.48550/arXiv.2506.05346
arXiv-issued DOI via DataCite

Submission history  提交历史

From: Lei Hsiung [view email]
发件人:Lei Hsiung [ 查看电子邮件 ]

[v1] Thu, 5 Jun 2025 17:59:55 UTC (424 KB)
[第 1 节] 2025 年 6 月 5 日星期四 17:59:55 UTC (424 KB)

询问ChatGPT  询问 ChatGPT

Full-text links:

Access Paper:  访问文件:

    View a PDF of the paper titled Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets, by Lei Hsiung and 6 other authors
  • View PDF  查看 PDF
  • HTML (experimental)  HTML(实验性)
  • TeX Source   TeX 源
  • 双语版本(沉浸式翻译)
view license  查看许可证
Current browse context:
当前浏览上下文:
cs.CR  cs。铬
< prev   |   next >
< 上一页   |   下一页 >

new | recent | 2025-06
新 | 近期 | 2025-06
Change to browse by:
更改为浏览方式:
cs
cs.CL  cs。CL 的
cs.LG  cs。LG 的

References & Citations  参考文献和引文

  • NASA ADS  美国宇航局广告
  • Google Scholar  谷歌学术
  • Semantic Scholar  语义学者
export BibTeX citation  导出 BibTeX 引文 Loading...

Bookmark  书签

BibSonomy logo Reddit logo

Bibliographic and Citation Tools
书目和引文工具

Bibliographic Explorer (What is the Explorer?)
书目浏览器 ( 什么是浏览器?
Connected Papers (What is Connected Papers?)
Connected Papers( 什么是互联论文?
Litmaps (What is Litmaps?)
Litmaps( 什么是 Litmaps?
scite Smart Citations (What are Smart Citations?)
scite Smart Citations( 什么是智能引文?
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
本文的哪些作者是背书人?| 禁用 MathJax( 什么是 MathJax?
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status

back to tab