Dr-BERT
/

CAS-Biomedical-POS-Tagging

 library_name: transformers
 tags:
 - medical
+---
+<p align="center">
+  <img src="https://github.com/qanastek/DrBERT/blob/main/assets/logo.png?raw=true" alt="drawing" width="250"/>
+</p>
+# DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
+In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.
+In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks.
+Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
+# CAS: French Corpus with Clinical Cases
+|           | Train | Dev   | Test  |
+|:---------:|:-----:|:-----:|:-----:|
+| Documents | 5,306 | 1,137 | 1,137 |
+The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.
+# Model Metric
+```plain
+ precision    recall  f1-score   support
+         ABR     0.8683    0.8480    0.8580       171
+         ADJ     0.9634    0.9751    0.9692      4018
+         ADV     0.9935    0.9849    0.9892       926
+     DET:ART     0.9982    0.9997    0.9989      3308
+     DET:POS     1.0000    1.0000    1.0000       133
+         INT     1.0000    0.7000    0.8235        10
+         KON     0.9883    0.9976    0.9929       845
+         NAM     0.9144    0.9353    0.9247       834
+         NOM     0.9827    0.9803    0.9815      7980
+         NUM     0.9825    0.9845    0.9835      1422
+     PRO:DEM     0.9924    1.0000    0.9962       131
+     PRO:IND     0.9630    1.0000    0.9811        78
+     PRO:PER     0.9948    0.9931    0.9939       579
+     PRO:REL     1.0000    0.9908    0.9954       109
+         PRP     0.9989    0.9982    0.9985      3785
+     PRP:det     1.0000    0.9985    0.9993       681
+         PUN     0.9996    0.9958    0.9977      2376
+     PUN:cit     0.9756    0.9524    0.9639        84
+        SENT     1.0000    0.9974    0.9987      1174
+         SYM     0.9495    1.0000    0.9741        94
+    VER:cond     1.0000    1.0000    1.0000        11
+    VER:futu     1.0000    0.9444    0.9714        18
+    VER:impf     1.0000    0.9963    0.9981       804
+    VER:infi     1.0000    0.9585    0.9788       193
+    VER:pper     0.9742    0.9564    0.9652      1261
+    VER:ppre     0.9617    0.9901    0.9757       203
+    VER:pres     0.9833    0.9904    0.9868       830
+    VER:simp     0.9123    0.7761    0.8387        67
+    VER:subi     1.0000    0.7000    0.8235        10
+    VER:subp     1.0000    0.8333    0.9091        18
+    accuracy                         0.9842     32153
+   macro avg     0.9799    0.9492    0.9623     32153
+weighted avg     0.9843    0.9842    0.9842     32153
+```
+# Citation BibTeX
+```bibtex
+@misc{labrak2023drbert,
+      title={DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains},
+      author={Yanis Labrak and Adrien Bazoge and Richard Dufour and Mickael Rouvier and Emmanuel Morin and Béatrice Daille and Pierre-Antoine Gourraud},
+      year={2023},
+      eprint={2304.00958},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}