File size: 5,452 Bytes
f723398
 
084b69e
 
 
 
 
 
 
 
156246a
 
 
 
fd65808
f580f83
 
 
156246a
 
 
 
 
 
 
 
f580f83
3651a22
 
 
 
f580f83
 
3651a22
f580f83
 
 
 
 
 
 
 
 
 
 
 
 
fd65808
 
 
 
 
 
 
 
 
 
 
 
 
1b50cd7
 
 
fd65808
 
 
34a2011
fd65808
 
 
 
 
 
 
 
0f56399
 
 
 
 
 
 
 
 
 
f580f83
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: mit
datasets:
- llm-blender/Unified-Feedback
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
---

## Introduction 

The reward model finetunes [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the '[llm-blender/Unified-Feedback](https://huggingface.co/datasets/llm-blender/Unified-Feedback)' dataset.
This model achieves an accuracy of **0.7740** on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs.

The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including:
* openai/summarize_from_feedback
* openai/webgpt_comparisons
* Dahoas/instruct-synthetic-prompt-responses
* Anthropic/hh-rlhf
* lmsys/chatbot_arena_conversations
* openbmb/UltraFeedback
* argilla/ultrafeedback-binarized-preferences-cleaned
* berkeley-nest/Nectar.

## Training Code and Blog

We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience.
  

## Evaluation
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.

|       Model               | Average       |  Chat     |     Chat Hard      |     Safety      |     Reasoning     |       Prior Sets  | 
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
|         berkeley-nest/Starling-RM-34B (34B)                               |     81.5     |     96.9  |     59   |   89.9  |    90.3    |        71.4    |
|  **Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback**(Ours, 7B) |    78.75     |   97.84   |   52.85  |   85.94 | 87.02      |   73.92        |
|    berkeley-nest/Starling-RM-7B-alpha      (7B)                          |    74.6      |   98      |   43.4   |   88.6  |    74.6    |          68.6  |
|      openbmb/UltraRM-13b             (13B)                                 |    71.3      |   96.1    |   55.3   |   45.8  |    82      |   77.2        |
|      IDEA-CCNL/Ziya-LLaMA-7B-Reward          (7B)                         |     66       |   88      |   41.3    |   62.5 |    73.7  |    64.6     |
|      OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5      (1.4B)           |     65.1     |   88.5     |   47.9   |   62.1 |    61.4  |    65.8     |
|      OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1         (7B)           |     64       |   94.4     |   36.6   |   59.4 |    70  |    59.4     |
|      llm-blender/PairRM-hf                            (0.4B)               |     60.9       |   90.2     |   53   |  31.5 |   60  |    69.6     |


## Usage
```
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback')
reward_model = AutoModelForSequenceClassification.from_pretrained(
                'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback',
                num_labels=1, torch_dtype=torch.float16,
                device_map=0,
                )
message = [
  {'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone.  But I can't do that while I'm at the movie.  Can you help by impersonating me by chat with her?"},
  {'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way.  I'm not willing to behave so dishonestly.  Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
# it will look like this: "<s><s> [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone.  But I can't do that while I'm at the movie.  Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way.  I'm not willing to behave so dishonestly.  Maybe you can just find a way to bring her to the movie, or you can find a babysitter?</s>"

kwargs = {"padding": 'longest', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)

with torch.no_grad():
  reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
  reward = reward_tensor.cpu().detach().item()
```


## Citation
This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite 
```
@article{yang2024regularizing,
  title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
  author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
  journal={arXiv preprint arXiv:2406.10216},
  year={2024}
}
```