Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,61 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
-
|
4 |
-
|
5 |
-
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Model description
|
2 |
+
|
3 |
+
This model is t5-base fine-tuned on the 300K Vietnamese News such as vnexpress.net, dantri.com.vn, laodong.vn for tagging articles using the textual content as input. In Vietnamese model, there three state-of-the art ViELECTRA [1], PhoNLP [2], and ViT5 [3], ViDeBERTa [4]. These models well are applied for Part of Speech Tagging
|
4 |
+
(POS), Dependency Parsing, Named Entity Recognition (NER), summarization problems. However, for tagging problem, we find that no Vietnamese model has unknown in current.
|
5 |
+
We then introduce SOTA Vietnamese models as followings:
|
6 |
+
|
7 |
+
- ViDeBerTa [4]: They presented three versions of our model, ViDeBERTaxsmall, ViDeBERTabase, and ViDeBERTalarge with 22M, 86M, and 304M backbone parameters, respectively.
|
8 |
+
- PhoNLP [2]: They generated two versions with 135M parameters for PhoBERTbase and 370M parameters for PhoBERTlarge.
|
9 |
+
- Vit5 [3]: They adapted the t5-base (310M parameters) and t5-large (866M parameters) models. Therefore, they had two model: vit5-small and vit5-large.
|
10 |
+
|
11 |
+
For tagging problem, due to application of vit5, we developed the vit5-base (or large) fine-tuned and comparison their efficiency in terms of quality and running time.
|
12 |
+
|
13 |
+
Dataset
|
14 |
+
|
15 |
+
The dataset is composed of Vietnamese news and their tags while the tags are done by human. We crawl 300K Vietnamese News such as vnexpress.net, dantri.com.vn, laodong.vn. We divide them into two parts: 250K for training and 50K for testing. In each article, we extract two components such as title, and tags. They are the input of training phase. All data must be preprocessed.
|
16 |
+
|
17 |
+
Evaluation results
|
18 |
+
|
19 |
+
We evaluate our models according to some metrics:
|
20 |
+
|
21 |
+
[{'rouge1': 0.4159717966204846},
|
22 |
+
{'rouge2': 0.25983482833746485},
|
23 |
+
{'rougeL': 0.3770318612006469},
|
24 |
+
{'rougeLsum': 0.37699834479994276}]
|
25 |
+
|
26 |
+
Training hyperparameters
|
27 |
+
The following hyperparameters were used during training:
|
28 |
+
|
29 |
+
do_train=True,
|
30 |
+
do_eval=False,
|
31 |
+
num_train_epochs=3,
|
32 |
+
learning_rate=1e-5,
|
33 |
+
warmup_ratio=0.05,
|
34 |
+
weight_decay=0.01,
|
35 |
+
per_device_train_batch_size=4,
|
36 |
+
per_device_eval_batch_size=4,
|
37 |
+
group_by_length=True,
|
38 |
+
save_strategy="epoch",
|
39 |
+
save_total_limit=10,
|
40 |
+
|
41 |
+
I also evaluated the models on 20K youtube dataset. We crawl 20K Vietnamese video dataset from youtube. We extract the title and tags (if possible) which is the input of our models. With videos with tags, we directly compare our tags from the models with existing tags. Otherwise, the obtained tags are evaluated by human. We see the results from link: https://drive.google.com/drive/folders/1RvywNl41QYNa2lthp-O8hakVCMsfX456
|
42 |
+
|
43 |
+
[1] T. V. Bui, O. T. Tran, P. Le-Hong, Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models, Proceedings of PACLIC 2020. link: https://github.com/fpt-corp/vELECTRA.
|
44 |
+
[2] Dat Quoc Nguyen and Anh-Tuan Nguyen. 2020. Phobert: Pre-trained language models for vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042.
|
45 |
+
[3] Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022. Vit5: Pretrained text-to-text transformer for vietnamese language generation. arXiv preprint arXiv:2205.06457. link: https://github.com/vietai/ViT5
|
46 |
+
|
47 |
+
How to use the model
|
48 |
+
|
49 |
+
tokenizer = AutoTokenizer.from_pretrained("banhabang/t5_model_75000")
|
50 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("banhabang/t5_model_75000")
|
51 |
+
model.to('cuda')
|
52 |
+
|
53 |
+
encoding = tokenizer(ytb['Title'][i], return_tensors="pt")
|
54 |
+
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
|
55 |
+
outputs = model.generate(
|
56 |
+
input_ids=input_ids, attention_mask=attention_masks,
|
57 |
+
max_length=30,
|
58 |
+
early_stopping=True
|
59 |
+
)
|
60 |
+
for output in outputs:
|
61 |
+
outputs = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
|