projecte-aina
/

roberta-base-ca-v2

RoBERTa-base-ca-v2

Catalan Textual Corpus

Model card Files Files and versions Community

gonzalez-agirre commited on Nov 29, 2022

Commit

9bb0c8c

·

1 Parent(s): ff189ac

Update README.md

Files changed (1) hide show

README.md +2 -15

README.md CHANGED Viewed

@@ -116,22 +116,9 @@ The training corpus consists of several corpora gathered from web crawling and p
 ### Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
-used in the original [RoBERTA](https://github.com/p
-### Author
-Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ([email protected])
-### Contact information
-For further information, send an email to <[email protected]>
-### Copyright
-Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
-### Licensing informationytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
-with the same hyperparameters as in the original work.
-The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
 ## Evaluation

 ### Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
+used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
+with the same hyperparameters as in the original work. The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
 ## Evaluation