Phare: A Safety Probe for Large Language Models
Abstract
Phare evaluates large language models across safety dimensions to uncover specific failure modes, offering insights for building more robust systems.
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.
Community
Phare is a multilingual framework to probe LLM across multiple safety dimensions, including: hallucination, biases and stereotypes, and harmful content.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- aiXamine: Simplified LLM Safety and Security (2025)
- The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models (2025)
- Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge (2025)
- CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs (2025)
- Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions (2025)
- FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning (2025)
- RealHarm: A Collection of Real-World Language Model Application Failures (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper