IDEA-CCNL
/

Randeng-DELLA-226M-Chinese

@@ -28,9 +28,6 @@ A deep VAE model pretrained on Wudao dataset. Both encoder and decoder are based
 参考论文：[Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation](https://arxiv.org/abs/2207.06130)
-基于[Randeng-Pegasus-523M-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-Pegasus-523M-Chinese)，我们在收集的7个中文领域的文本摘要数据集（约4M个样本）上微调了它，得到了summary版本。这7个数据集为：education, new2016zh, nlpcc, shence, sohu, thucnews和weibo。
-Based on [Randeng-Pegasus-523M-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-Pegasus-523M-Chinese), we fine-tuned a text summarization version (summary) on 7 Chinese text summarization datasets, with totaling around 4M samples. The datasets include: education, new2016zh, nlpcc, shence, sohu, thucnews and weibo.
 ### 下游效果 Performance
@@ -41,27 +38,65 @@ Based on [Randeng-Pegasus-523M-Chinese](https://huggingface.co/IDEA-CCNL/Randeng
 ## 使用 Usage
 ```python
-from transformers import PegasusForConditionalGeneration
-# Need to download tokenizers_pegasus.py and other Python script from Fengshenbang-LM github repo in advance,
-# or you can download tokenizers_pegasus.py and data_utils.py in https://huggingface.co/IDEA-CCNL/Randeng_Pegasus_523M/tree/main
-# Strongly recommend you git clone the Fengshenbang-LM repo:
-# 1. git clone https://github.com/IDEA-CCNL/Fengshenbang-LM
-# 2. cd Fengshenbang-LM/fengshen/examples/pegasus/
-# and then you will see the tokenizers_pegasus.py and data_utils.py which are needed by pegasus model
-from tokenizers_pegasus import PegasusTokenizer
-model = PegasusForConditionalGeneration.from_pretrained("IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese")
-tokenizer = PegasusTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese")
-text = "据微信公众号“界面”报道，4日上午10点左右，中国发改委反垄断调查小组突击查访奔驰上海办事处，调取数据材料，并对多名奔驰高管进行了约谈。截止昨日晚9点，包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内"
-inputs = tokenizer(text, max_length=1024, return_tensors="pt")
-# Generate Summary
-summary_ids = model.generate(inputs["input_ids"])
-tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-# model Output: 反垄断调查小组突击查访奔驰上海办事处，对多名奔��高管进行约谈
 ```
 ## 引用 Citation

 参考论文：[Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation](https://arxiv.org/abs/2207.06130)
 ### 下游效果 Performance
 ## 使用 Usage
 ```python
+# Checkout the latest Fengshenbang-LM directory and run following script under Fengshenbang-LM root directory
+import sys
+import torch
+import argparse
+from torch.nn.utils.rnn import pad_sequence
+from fengshen.models.deepVAE.vae_pl_module import DeepVAEModule
+if __name__ == "__main__":
+    # TODO: Update this path to the downloaded directory
+    checkpoint_path = '..../Randeng-DELLA-226M-Chinese'
+    gpt2_model_path = '..../Randeng-DELLA-226M-Chinese'
+    args_parser = argparse.ArgumentParser()
+    args_parser.add_argument("--checkpoint_path", type=str, default=checkpoint_path)
+    args_parser.add_argument("--gpt2_model_path", type=str, default=gpt2_model_path)
+    args_parser.add_argument("--latent_dim", type=int, default=256)
+    args_parser.add_argument("--beta_kl_constraints_start", type=float, default=1e-5)
+    args_parser.add_argument("--beta_kl_constraints_stop", type=float, default=1.)
+    args_parser.add_argument("--beta_n_cycles", type=int, default=10)
+    args_parser.add_argument("--latent_lmf_rank", type=int, default=4)
+    args_parser.add_argument("--CVAE", action='store_true')
+    args_parser.add_argument("--share_param", action='store_false',
+        help="specify this argument if we want to share dec's and enc's params")
+    args, unknown_args = args_parser.parse_known_args()
+    # load model
+    model, tokenizer =  DeepVAEModule.load_model(args, labels_dict=None)
+    # VAE generation
+    sentence =  "本模型是在通用数据集下预训练的VAE模型，如要获得最佳效果请在特定领域微调后使用。"
+    tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence))
+    decoder_target = [tokenizer.bos_token_id] + tokenized_text + [tokenizer.eos_token_id]
+    inputs = []
+    inputs.append(torch.tensor(decoder_target, dtype=torch.long))
+    inputs = pad_sequence(inputs, batch_first=True, padding_value=0)
+    max_length = 256
+    top_p = 0.5
+    top_k = 0
+    temperature = .7
+    repetition_penalty = 1.0
+    sample = False
+    device = 0
+    model = model.eval()
+    model = model.to(device)
+    outputs = model.inference(inputs.to(device), top_p=top_p, top_k=top_k, max_length=max_length, sample=sample,
+        temperature=temperature, repetition_penalty=repetition_penalty)
+    for gen_sent, orig_sent in zip(outputs, inputs):
+        print('orig_sent:', tokenizer.decode(orig_sent).replace(' ', ''))
+        print('gen_sent:', tokenizer.decode(gen_sent).replace(' ', ''))
+        print("-"*20)
 ```
 ## 引用 Citation