qanastek commited on
Commit
438c111
·
1 Parent(s): 9050453

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -1
README.md CHANGED
@@ -9,4 +9,77 @@ metrics:
9
  library_name: transformers
10
  tags:
11
  - medical
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  library_name: transformers
10
  tags:
11
  - medical
12
+ ---
13
+
14
+
15
+ <p align="center">
16
+ <img src="https://github.com/qanastek/DrBERT/blob/main/assets/logo.png?raw=true" alt="drawing" width="250"/>
17
+ </p>
18
+
19
+ # DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
20
+
21
+ In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains.
22
+ In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks.
23
+ Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
24
+
25
+ # CAS: French Corpus with Clinical Cases
26
+
27
+
28
+ | | Train | Dev | Test |
29
+ |:---------:|:-----:|:-----:|:-----:|
30
+ | Documents | 5,306 | 1,137 | 1,137 |
31
+
32
+ The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.
33
+
34
+ # Model Metric
35
+
36
+ ```plain
37
+ precision recall f1-score support
38
+
39
+ ABR 0.8683 0.8480 0.8580 171
40
+ ADJ 0.9634 0.9751 0.9692 4018
41
+ ADV 0.9935 0.9849 0.9892 926
42
+ DET:ART 0.9982 0.9997 0.9989 3308
43
+ DET:POS 1.0000 1.0000 1.0000 133
44
+ INT 1.0000 0.7000 0.8235 10
45
+ KON 0.9883 0.9976 0.9929 845
46
+ NAM 0.9144 0.9353 0.9247 834
47
+ NOM 0.9827 0.9803 0.9815 7980
48
+ NUM 0.9825 0.9845 0.9835 1422
49
+ PRO:DEM 0.9924 1.0000 0.9962 131
50
+ PRO:IND 0.9630 1.0000 0.9811 78
51
+ PRO:PER 0.9948 0.9931 0.9939 579
52
+ PRO:REL 1.0000 0.9908 0.9954 109
53
+ PRP 0.9989 0.9982 0.9985 3785
54
+ PRP:det 1.0000 0.9985 0.9993 681
55
+ PUN 0.9996 0.9958 0.9977 2376
56
+ PUN:cit 0.9756 0.9524 0.9639 84
57
+ SENT 1.0000 0.9974 0.9987 1174
58
+ SYM 0.9495 1.0000 0.9741 94
59
+ VER:cond 1.0000 1.0000 1.0000 11
60
+ VER:futu 1.0000 0.9444 0.9714 18
61
+ VER:impf 1.0000 0.9963 0.9981 804
62
+ VER:infi 1.0000 0.9585 0.9788 193
63
+ VER:pper 0.9742 0.9564 0.9652 1261
64
+ VER:ppre 0.9617 0.9901 0.9757 203
65
+ VER:pres 0.9833 0.9904 0.9868 830
66
+ VER:simp 0.9123 0.7761 0.8387 67
67
+ VER:subi 1.0000 0.7000 0.8235 10
68
+ VER:subp 1.0000 0.8333 0.9091 18
69
+
70
+ accuracy 0.9842 32153
71
+ macro avg 0.9799 0.9492 0.9623 32153
72
+ weighted avg 0.9843 0.9842 0.9842 32153
73
+ ```
74
+
75
+ # Citation BibTeX
76
+
77
+ ```bibtex
78
+ @misc{labrak2023drbert,
79
+ title={DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains},
80
+ author={Yanis Labrak and Adrien Bazoge and Richard Dufour and Mickael Rouvier and Emmanuel Morin and Béatrice Daille and Pierre-Antoine Gourraud},
81
+ year={2023},
82
+ eprint={2304.00958},
83
+ archivePrefix={arXiv},
84
+ primaryClass={cs.CL}
85
+ }