AI models have their own internal representations of knowledge or concepts that are often difficult to discern, even as they are critical to the models’ output. For instance, knowing more about a model’s representation of a concept would help explain why an AI model might “hallucinate” information, or why certain prompts can trick it into responses that dodge its built-in safeguards. Daniel Beaglehole and colleagues now introduce a robust method to extract these representations of concepts, which works across several large-scale language, reasoning, and vision AI models. Their technique uses a feature extraction algorithm called the Recursive Feature Machine. By extracting concept representations with this technique, Beaglehole et al. were able to monitor these models in ways that exposed some of their vulnerabilities to behaviors like hallucinations and to steer them toward improved output responses. Surprisingly, the concept representations were transferable between different languages and could be combined with other concept representations for multi-concept steering, the researchers noted. “Together, these results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements,” the authors write.
Science
Toward universal steering and monitoring of AI models
19-Feb-2026