Abstract
A new automated method using concept-based vectors and a Hierarchical Multi-Domain Regression model improves preference explanations and predictions for large language models.
Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.
Community
(based on a thread on Twitter)
Preferences drive modern LLM research and development: from model alignment to evaluation.
But how well do we understand them?
Excited to share our new preprint:
Multi-domain Explainability of Preferences
We propose a fully automated method for explaining the preferences of three mechanism types:
π₯ Human preferences (used to train reward models and evaluation)
π€ LLM-as-a-Judge (de facto standard for automatic evaluation)
π
Reward models (used in RLHF/RLAIF for alignment)
Our four-stage method:
- Use LLM to discover concepts that distinguish between chosen and rejected responses.
- Represent responses as concept vectors.
- Train a logistic regression model to predict preferences.
- Extract concept importance from model weights.
Our special focus is on multi-domain learning:
Concepts affect preference decisions differently across domains.
A concept that is important in one domain may be irrelevant in another.
To address this, we introduce a white-box Hierarchical Multi-Domain Regression (HMDR) model:
The HMDR model is optimized to:
β’ Make shared weights strongly predictive β improves OOD generalization.
β’ Encourage sparsity (L1 regularization) β simpler explanations.
Finally, concept importance is the lift in probability (% change when increasing a concept by one unit)
The resulting explanations are quite interesting π€©
Below is an example of human preferences across five domains π¬π§βπ»π©ββοΈπ§βπ³π§³
How to read it?
β»οΈLight bars show the shared contribution to the score,
βΌοΈwhile dark bars and arrows indicate domain-specific contributions.
How do we know our explanations are good? π€
β
Human Evaluation: LLM concept annotations closely match human annotations.
β
Preference Prediction: Our method is comparable to human preference models.
The HMDR model outperforms other white-box models both in-domain & OOD.
We assess explanations in two application-driven settings:
Can we "hack" the judge? π©ββοΈπ€
Using LLM-as-a-judge explanations, we guide another LLM's responses (by asking it to follow the top concepts).
Result: Judges prefer the explanation-guided outputs over regular prompts.
Breaking Ties in LLM-as-Judges π€
LLMs often produce inconsistent preferences when the order of responses is flipped (10β30% of the time!).
We guide LLM judges using top human-derived concepts to break ties.
Result: Clear gains in human preference alignment on tied cases.
Finally, we analyze our explanations by comparing our findings (auto-discovered concepts) with those from prior studies of manually curated concepts.
π We reproduced many!
Humans prioritize clarity, authority, and confidence, while LLMs emphasize accuracy and helpfulness.
Importantly, we found that domain-specific concepts dominate many preference mechanisms.
Our two key contributions:
1β£ Automatic concept discovery
2β£ Multi-domain modeling
Together, they provide a scalable and generalizable approach to modeling NLP preferences.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Flex-Judge: Think Once, Judge Anywhere (2025)
- R3: Robust Rubric-Agnostic Reward Models (2025)
- Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge (2025)
- Do LLM Evaluators Prefer Themselves for a Reason? (2025)
- Synergistic Weak-Strong Collaboration by Aligning Preferences (2025)
- Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment (2025)
- CHARM: Calibrating Reward Models With Chatbot Arena Scores (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper