Bluesky Facebook Reddit Email

A new method to steer AI output uncovers vulnerabilities and potential improvements

02.19.26 | University of California - San Diego

Apple iPhone 17 Pro

Apple iPhone 17 Pro delivers top performance and advanced cameras for field documentation, data collection, and secure research communications.

A team of researchers has found a way to steer the output of large language models by manipulating specific concepts inside these models. The new method could lead to more reliable, more efficient, and less computationally expensive training of LLMs. But it also exposes potential vulnerabilities.

The researchers, led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at the Massachusetts Institute of Technology, present their findings in the Feb. 19, 2026, issue of the journal Science.

In the study, researchers went under the hood of several LLMs to locate specific concepts. They then mathematically increased or decreased the importance of these concepts in the LLM’s output. The work builds on a 2024 Science paper led by Belkin and Radhakrishnan, in which they described predictive algorithms known as Recursive Feature Machines. These machines identify patterns within a series of mathematical operations inside LLMs that encode specific concepts.

“We found that we could mathematically modify these patterns with math that is surprisingly simple,” said Mikhail Belkin, a professor in the Halıcıoğlu Data Science Institute, which is part of the School of Computing, Information and Data Sciences at UC San Diego.

Using this steering approach, the research team conducted experiments on some of the largest open-source LLMs in use today, such as Llama and Deepseek, identifying and influencing 512 concepts within five classes, ranging from fears, to moods, to locations. The method worked not only in English, but also in languages such as Chinese and Hindi.

Both studies are particularly important because, until recently, the processes inside LLMs have been essentially locked inside a black box, making it hard to understand how the models arrive at the answers they give users with varying levels of accuracy.

Improving performance and uncovering vulnerabilities

Researchers found that steering can be used to improve LLM output. For example, the researchers showed steering improved LLM performance on narrow, precise tasks, such as translating from Python to C++ code. The researchers also used the method to identify hallucinations.

But the method can also be used as an attack against LLMs. By decreasing the importance of the concept of refusal, the researchers found that their method could get an LLM to operate outside of its guardrails, a practice known as jailbreaking. An LLM gave instructions about how to use cocaine. It also provided Social Security numbers, although it’s unclear whether they were real or fabricated.

The method can also be used to boost political bias and a conspiracy theory mindset inside an LLM. In one instance, an LLM claimed that a satellite image of the Earth was the result of a NASA conspiracy to cover up that the Earth is flat. An LLM also claimed that the COVID vaccine was poisonous.

Computational savings and next steps

The approach is more computationally efficient than existing methods. Using a single NVIDIA Ampere series (A100) graphics processing unit (GPU), it took less than one minute and fewer than 500 training samples to identify the patterns and steer them toward a concept of interest. This shows that the method could be easily integrated into standard LLM training methods.

Researchers were not able to test their approach on commercial, closed LLMs, such as Claude. But they believe this type of steering would work with any open-source models. “We observed that newer and larger LLMs were more steerable,” they write. The method also might work on smaller, open-source models that can run on a laptop.

Next steps include improving the steering method to adapt to specific inputs and specific applications.

“These results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements,” the research team writes.

This work was supported in part by the National Science Foundation, the Simons Foundation, the UC San Diego-led TILOS institute and the U.S. Office of Naval Research.

Toward universal steering and monitoring of AI models

Daniel Beaglehole and Mikhail Belkin, University of California San Diego, Department of Computer Science and Engineering, Jacobs School of Engineering and Halıcıoğlu Data Science Institute

Adityanarayanan Radhakrishnan, Massachusetts Institute of Technology, Broad Institute, MIT and Harvard

Enric Boix-Adserà, Wharton School, University of Pennsylvania

Beaglehole and Radhakrishnan contributed to the work equally

Science

Experimental study

Not applicable

19-Feb-2026

Keywords

Article Information

Contact Information

Ioana Patringenaru
University of California - San Diego
ipatring@ucsd.edu

How to Cite This Article

APA:
University of California - San Diego. (2026, February 19). A new method to steer AI output uncovers vulnerabilities and potential improvements. Brightsurf News. https://www.brightsurf.com/news/8J4OE24L/a-new-method-to-steer-ai-output-uncovers-vulnerabilities-and-potential-improvements.html
MLA:
"A new method to steer AI output uncovers vulnerabilities and potential improvements." Brightsurf News, Feb. 19 2026, https://www.brightsurf.com/news/8J4OE24L/a-new-method-to-steer-ai-output-uncovers-vulnerabilities-and-potential-improvements.html.