Researchers have identified key components in large language models (LLMs) that play a critical role in ensuring these AI systems provide safe responses to user queries. The researchers used these insights to develop and demonstrate AI training techniques that improve LLM safety while minimizing the “alignment tax,” meaning the AI becomes safer without significantly affecting performance.
LLMs, such as ChatGPT, are being used for an increasing number of applications – including people asking for advice or instructions on how to perform a variety of tasks. The nature of some of these applications means that it is important for LLMs to generate safe responses to user queries.
“We don’t want LLMs to tell people to harm themselves or to give them information they can use to harm other people,” says Jung-Eun Kim, corresponding author of a paper on the work and an assistant professor of computer science at North Carolina State University.
At issue is a model’s safety alignment, or training protocols designed to ensure that the AI’s outputs are consistent with human values.
“There are two challenges here,” says Kim. “The first challenge is the so-called alignment tax, which refers to the fact that incorporating safety alignment has an adverse effect on the accuracy of a model’s outputs.”
“The second challenge is that existing LLMs generally incorporate safety alignment at a superficial level, which makes it possible for users to circumvent safety features,” says Jianwei Li, first author of the paper and a Ph.D. student at NC State. “For example, if a user asks for instructions to steal money, a model will likely refuse. But if a user asks for instructions to steal money in order to help people, the model would be more likely to provide that information.
“This second challenge can be exacerbated when users ‘fine-tune’ an LLM – modifying it to operate in a specific domain,” says Li. “For example, an LLM may have good safety performance. But if a user wants to modify that LLM for use in the context of a specific business or organization, the user may train that LLM on additional data. Previous research shows us that fine-tuning can weaken safety performance.
“Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs.”
To that end, the researchers created the Superficial Safety Alignment Hypothesis (SSAH), which neatly captures how safety alignment currently works in LLMs. Basically, it holds that superficial safety alignment views a user request as binary, either safe or unsafe. In addition, the SSAH notes that LLMs currently make the binary determination on whether to answer the request at the beginning of the answer-generating process. If the request is deemed safe, a response is generated and provided to the user. If the request is deemed not safe, the model declines to generate a response.
The researchers also identified safety-critical “neurons” in LLM neural networks that are critical for determining whether the model should fulfill or refuse a user request.
“We found that ‘freezing’ these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain,” says Li.
“And we demonstrated that we can minimize the alignment tax while preserving safety alignment during the fine-tuning process,” says Kim.
“The big picture here is that we have developed a hypothesis that serves as a conceptual framework for understanding the challenges associated with safety alignment in LLMs, used that framework to identify a technique that helps us address one of those challenges, and then demonstrated that the technique works,” says Kim.
“Moving forward, our work here highlights the need to develop techniques that will allow models to continuously re-evaluate and re-select their reasoning direction – safe or unsafe –throughout the response generation process,” says Li.
The paper, “Superficial Safety Alignment Hypothesis,” will be presented at the Fourteenth International Conference on Learning Representations (ICLR2026), being held April 23-27 in Rio de Janeiro, Brazil.
The researchers have made relevant code and additional information available at: https://ssa-h.github.io/ .
Computational simulation/modeling
Not applicable
Superficial Safety Alignment Hypothesis
no conflicts of interest