arxiv:2505.11365

Phare: A Safety Probe for Large Language Models

Published on May 16

· Submitted by

pierlj on May 21

Upvote

Authors:

Pierre Le Jeune ,

Benoît Malézieux ,

Weixuan Xiao ,

Abstract

Phare evaluates large language models across safety dimensions to uncover specific failure modes, offering insights for building more robust systems.

AI-generated summary

Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

pierlj

Paper author Paper submitter 10 days ago

Phare is a multilingual framework to probe LLM across multiple safety dimensions, including: hallucination, biases and stereotypes, and harmful content.

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.11365 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.11365 in a Space README.md to link it from this page.