Update README.md
Browse files
README.md
CHANGED
@@ -9,4 +9,77 @@ metrics:
|
|
9 |
library_name: transformers
|
10 |
tags:
|
11 |
- medical
|
12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
library_name: transformers
|
10 |
tags:
|
11 |
- medical
|
12 |
+
---
|
13 |
+
|
14 |
+
|
15 |
+
<p align="center">
|
16 |
+
<img src="https://github.com/qanastek/DrBERT/blob/main/assets/logo.png?raw=true" alt="drawing" width="250"/>
|
17 |
+
</p>
|
18 |
+
|
19 |
+
# DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
|
20 |
+
|
21 |
+
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.
|
22 |
+
In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks.
|
23 |
+
Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
|
24 |
+
|
25 |
+
# CAS: French Corpus with Clinical Cases
|
26 |
+
|
27 |
+
|
28 |
+
| | Train | Dev | Test |
|
29 |
+
|:---------:|:-----:|:-----:|:-----:|
|
30 |
+
| Documents | 5,306 | 1,137 | 1,137 |
|
31 |
+
|
32 |
+
The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.
|
33 |
+
|
34 |
+
# Model Metric
|
35 |
+
|
36 |
+
```plain
|
37 |
+
precision recall f1-score support
|
38 |
+
|
39 |
+
ABR 0.8683 0.8480 0.8580 171
|
40 |
+
ADJ 0.9634 0.9751 0.9692 4018
|
41 |
+
ADV 0.9935 0.9849 0.9892 926
|
42 |
+
DET:ART 0.9982 0.9997 0.9989 3308
|
43 |
+
DET:POS 1.0000 1.0000 1.0000 133
|
44 |
+
INT 1.0000 0.7000 0.8235 10
|
45 |
+
KON 0.9883 0.9976 0.9929 845
|
46 |
+
NAM 0.9144 0.9353 0.9247 834
|
47 |
+
NOM 0.9827 0.9803 0.9815 7980
|
48 |
+
NUM 0.9825 0.9845 0.9835 1422
|
49 |
+
PRO:DEM 0.9924 1.0000 0.9962 131
|
50 |
+
PRO:IND 0.9630 1.0000 0.9811 78
|
51 |
+
PRO:PER 0.9948 0.9931 0.9939 579
|
52 |
+
PRO:REL 1.0000 0.9908 0.9954 109
|
53 |
+
PRP 0.9989 0.9982 0.9985 3785
|
54 |
+
PRP:det 1.0000 0.9985 0.9993 681
|
55 |
+
PUN 0.9996 0.9958 0.9977 2376
|
56 |
+
PUN:cit 0.9756 0.9524 0.9639 84
|
57 |
+
SENT 1.0000 0.9974 0.9987 1174
|
58 |
+
SYM 0.9495 1.0000 0.9741 94
|
59 |
+
VER:cond 1.0000 1.0000 1.0000 11
|
60 |
+
VER:futu 1.0000 0.9444 0.9714 18
|
61 |
+
VER:impf 1.0000 0.9963 0.9981 804
|
62 |
+
VER:infi 1.0000 0.9585 0.9788 193
|
63 |
+
VER:pper 0.9742 0.9564 0.9652 1261
|
64 |
+
VER:ppre 0.9617 0.9901 0.9757 203
|
65 |
+
VER:pres 0.9833 0.9904 0.9868 830
|
66 |
+
VER:simp 0.9123 0.7761 0.8387 67
|
67 |
+
VER:subi 1.0000 0.7000 0.8235 10
|
68 |
+
VER:subp 1.0000 0.8333 0.9091 18
|
69 |
+
|
70 |
+
accuracy 0.9842 32153
|
71 |
+
macro avg 0.9799 0.9492 0.9623 32153
|
72 |
+
weighted avg 0.9843 0.9842 0.9842 32153
|
73 |
+
```
|
74 |
+
|
75 |
+
# Citation BibTeX
|
76 |
+
|
77 |
+
```bibtex
|
78 |
+
@misc{labrak2023drbert,
|
79 |
+
title={DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains},
|
80 |
+
author={Yanis Labrak and Adrien Bazoge and Richard Dufour and Mickael Rouvier and Emmanuel Morin and Béatrice Daille and Pierre-Antoine Gourraud},
|
81 |
+
year={2023},
|
82 |
+
eprint={2304.00958},
|
83 |
+
archivePrefix={arXiv},
|
84 |
+
primaryClass={cs.CL}
|
85 |
+
}
|