Demystifying LLM Attacks And Defense

A Comprehensive Study with Improved Attack Technique

More Info
expand_more

Abstract

Large Language Models (LLMs) have emerged as pivotal in content generation, offering profound societal impacts. Previous research has highlighted their propensity to generate content that breaches societal norms. Misuse of LLMs poses significant ethical concerns, including misinformation spread, social unrest, and political manipulation. To mitigate such risks, safety training techniques have been employed, instructing LLMs to avoid generating harmful content during inference time. Nonetheless, securing LLMs against extit{Prompt Injection} and extit{Jailbreak} attacks remains challenging, as evidenced by recent studies and abundant malicious instructions available online. To make things worse, these attacks are normally transferable due to their format in natural language, posing substantial security threats, as people without AI could also use these attacks. Although various defense techniques exist, their effectiveness against diverse attacks is largely untested.

This thesis, therefore, provides the first comprehensive evaluation of the interplay between attack techniques and defense techniques, focusing particularly on the extit{Jailbreak} type. Our analysis encompasses nine different attack methodologies and seven defense techniques, applied to three unique LLMs: Vicuna, LLama, and GPT-3.5 Turbo, with the objective of assessing their efficacy. Our results indicate that white-box attacks are generally less effective than universal approaches and that the inclusion of particular tokens in the input can significantly influence the success rate of attacks.

Moreover, we identify that research into the vulnerabilities presented by continuous embeddings has been scant, with prior approaches mainly relying on the addition of discrete or continuous suffixes to prompts. Our investigation introduces a new approach for direct attacks on LLM inputs that bypasses the necessity for suffix appending or posing specific questions, as long as the output is pre-specified. We also notice that improper initialization of random continuous input or an excessive amount of iterations can lead to overfitting scenarios. To address this, we suggest an effective method, termed extbf{Clip}, to alleviate the issue of overfitting.

In conclusion, we contribute to the field by conducting the first study of the interaction between attack and defense techniques and by presenting a benchmark through our shared datasets, an easily integrable testing framework, and an attack algorithm to encourage further investigation into the security of LLMs.

Files