arxiv:2505.20088

Multi-Domain Explainability of Preferences

Published on May 26

· Submitted by

Authors:

Abstract

A new automated method using concept-based vectors and a Hierarchical Multi-Domain Regression model improves preference explanations and predictions for large language models.

AI-generated summary

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

View arXiv page View PDF Add to collection

Community

nitay

Paper submitter 4 days ago

(based on a thread on Twitter)

Preferences drive modern LLM research and development: from model alignment to evaluation.
But how well do we understand them?

Excited to share our new preprint:
Multi-domain Explainability of Preferences

We propose a fully automated method for explaining the preferences of three mechanism types:
👥 Human preferences (used to train reward models and evaluation)
🤖 LLM-as-a-Judge (de facto standard for automatic evaluation)
🏅 Reward models (used in RLHF/RLAIF for alignment)

Our four-stage method:

Use LLM to discover concepts that distinguish between chosen and rejected responses.
Represent responses as concept vectors.
Train a logistic regression model to predict preferences.
Extract concept importance from model weights.

Our special focus is on multi-domain learning:
Concepts affect preference decisions differently across domains.
A concept that is important in one domain may be irrelevant in another.

To address this, we introduce a white-box Hierarchical Multi-Domain Regression (HMDR) model:

The HMDR model is optimized to:
• Make shared weights strongly predictive → improves OOD generalization.
• Encourage sparsity (L1 regularization) → simpler explanations.

Finally, concept importance is the lift in probability (% change when increasing a concept by one unit)

The resulting explanations are quite interesting 🤩
Below is an example of human preferences across five domains 💬🧑‍💻👩‍⚖️🧑‍🍳🧳

How to read it?
◻️Light bars show the shared contribution to the score,
◼️while dark bars and arrows indicate domain-specific contributions.

How do we know our explanations are good? 🤔
✅ Human Evaluation: LLM concept annotations closely match human annotations.
✅ Preference Prediction: Our method is comparable to human preference models.
The HMDR model outperforms other white-box models both in-domain & OOD.

We assess explanations in two application-driven settings:

Can we "hack" the judge? 👩‍⚖️🤖
Using LLM-as-a-judge explanations, we guide another LLM's responses (by asking it to follow the top concepts).
Result: Judges prefer the explanation-guided outputs over regular prompts.

Breaking Ties in LLM-as-Judges 🤝
LLMs often produce inconsistent preferences when the order of responses is flipped (10–30% of the time!).

We guide LLM judges using top human-derived concepts to break ties.
Result: Clear gains in human preference alignment on tied cases.

Finally, we analyze our explanations by comparing our findings (auto-discovered concepts) with those from prior studies of manually curated concepts.

🔍 We reproduced many!
Humans prioritize clarity, authority, and confidence, while LLMs emphasize accuracy and helpfulness.

Importantly, we found that domain-specific concepts dominate many preference mechanisms.

Our two key contributions:
1⃣ Automatic concept discovery
2⃣ Multi-domain modeling
Together, they provide a scalable and generalizable approach to modeling NLP preferences.

https://arxiv.org/abs/2505.20088