Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Abstract
Identifying values within AI models using LitmusValues and evaluating them through AIRiskDilemmas and HarmBench can predict both known and unknown risky behaviors.
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
Community
Character/Propensity/Value Eval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset (2025)
- A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient (2025)
- CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs (2025)
- XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs (2025)
- Evaluating Frontier Models for Stealth and Situational Awareness (2025)
- Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement (2025)
- The State of AI Governance Research: AI Safety and Reliability in Real World Commercial Deployment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper