Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Abstract
A novel method called Steering Target Atoms isolates and manipulates disentangled knowledge components in language models to improve safety, robustness, and flexibility, especially in adversarial scenarios.
Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
Community
We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios.
We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models (2025)
- Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders (2025)
- Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models (2025)
- Representation Bending for Large Language Model Safety (2025)
- AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender (2025)
- Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering (2025)
- Steering off Course: Reliability Challenges in Steering Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper