Question Answering
Transformers
Safetensors
French
modernbert
bourdoiscatie commited on
Commit
f37ca12
·
verified ·
1 Parent(s): a638b4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -55
README.md CHANGED
@@ -1,62 +1,217 @@
1
  ---
 
 
 
 
 
 
 
2
  library_name: transformers
3
  license: mit
4
  base_model: almanach/moderncamembert-cv2-base
5
- tags:
6
- - generated_from_trainer
7
- datasets:
8
- - french_qa
9
- model-index:
10
- - name: moderncamembert-cv2-base-QA
11
- results: []
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- # moderncamembert-cv2-base-QA
18
-
19
- This model is a fine-tuned version of [almanach/moderncamembert-cv2-base](https://huggingface.co/almanach/moderncamembert-cv2-base) on the french_qa dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 1.0697
22
-
23
- ## Model description
24
-
25
- More information needed
26
-
27
- ## Intended uses & limitations
28
-
29
- More information needed
30
-
31
- ## Training and evaluation data
32
-
33
- More information needed
34
-
35
- ## Training procedure
36
-
37
- ### Training hyperparameters
38
-
39
- The following hyperparameters were used during training:
40
- - learning_rate: 3e-05
41
- - train_batch_size: 8
42
- - eval_batch_size: 8
43
- - seed: 42
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: linear
46
- - num_epochs: 3
47
-
48
- ### Training results
49
-
50
- | Training Loss | Epoch | Step | Validation Loss |
51
- |:-------------:|:-----:|:-----:|:---------------:|
52
- | 0.4741 | 1.0 | 27790 | 0.6610 |
53
- | 0.2795 | 2.0 | 55580 | 0.7165 |
54
- | 0.1419 | 3.0 | 83370 | 1.0697 |
55
-
56
-
57
- ### Framework versions
58
 
59
- - Transformers 4.51.3
60
- - Pytorch 2.6.0+cu124
61
- - Datasets 2.16.0
62
- - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: fr
3
+ datasets:
4
+ - etalab-ia/piaf
5
+ - fquad
6
+ - lincoln/newsquadfr
7
+ - pragnakalp/squad_v2_french_translated
8
+ - CATIE-AQ/frenchQA
9
  library_name: transformers
10
  license: mit
11
  base_model: almanach/moderncamembert-cv2-base
12
+ metrics:
13
+ - f1
14
+ - exact_match
15
+ widget:
16
+ - text: Combien de personnes utilisent le français tous les jours ?
17
+ context: >-
18
+ Le français est une langue indo-européenne de la famille des langues romanes
19
+ dont les locuteurs sont appelés francophones. Elle est parfois surnommée la
20
+ langue de Molière. Le français est parlé, en 2023, sur tous les continents
21
+ par environ 321 millions de personnes : 235 millions l'emploient
22
+ quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80
23
+ millions d'élèves et étudiants s'instruisent en français dans le monde.
24
+ Selon l'Organisation internationale de la francophonie (OIF), il pourrait y
25
+ avoir 700 millions de francophones sur Terre en 2050.
26
+ co2_eq_emissions: 46
27
  ---
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ # ModernQAmembert
31
+
32
+ ## Model Description
33
+ We present **ModernQAmembert**, which is a [Moderncamembert-cv2-base](https://huggingface.co/almanach/moderncamembert-cv2-base) fine-tuned for the Question-Answering task for the French language on four French Q&A datasets composed of contexts and questions with their answers inside the context (= SQuAD 1.0 format) but also contexts and questions with their answers not inside the context (= SQuAD 2.0 format).
34
+ All these datasets were concatenated into a single dataset that we called [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA).
35
+ This represents a total of over **221,348 context/question/answer triplets used to finetune this model and 6,376 to test it**.
36
+ Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/QA_en/) or [French](https://blog.vaniila.ai/QA/).
37
+
38
+
39
+ ## Results (french QA test split)
40
+ | Model | Parameters | Context | Exact_match | F1 | Answer_F1 | NoAnswer_F1 |
41
+ | ----------- | ----------- | ----------- | ----------- | ----------- |----------- |----------- |
42
+ | [etalab/camembert-base-squadFR-fquad-piaf](https://huggingface.co/AgentPublic/camembert-base-squadFR-fquad-piaf) | 110M | 512 tokens | 39.30 | 51.55 | 79.54 | 23.58
43
+ | [QAmembert](https://huggingface.co/CATIE-AQ/QAmembert)| 110M | 512 tokens | 77.14 | 86.88 | 75.66 | 98.11
44
+ | [QAmembert2](https://huggingface.co/CATIE-AQ/QAmembert2)| 112M | 1024 tokens | 76.47 | 88.25 | 78.66 | 97.84
45
+ | [QAmemberta](https://huggingface.co/CATIE-AQ/QAmemberta) | 111M | 1024 tokens | **78.18** | **89.53** | **81.40** | 97.64
46
+ | ModernQAmembert (this version) | 136M | 8 192 tokens | 76.73 | 88.85 | 79.45 | 98.24
47
+ | [QAmembert-large](https://huggingface.co/CATIE-AQ/QAmembert-large)| 336M | 512 tokens | 77.14 | 88.74 | 78.83 | **98.65**
48
+
49
+
50
+ Looking at the “Answer_f1” column, Etalab's model appears to be competitive on texts where the answer to the question is indeed in the text provided (it does better than QAmemBERT-large, for example). However, the fact that it doesn't handle texts where the answer to the question is not in the text provided is a drawback.
51
+ In all cases, whether in terms of metrics, number of parameters or context size, QAmemBERTa achieves the best results.
52
+ We therefore invite the reader to choose this model.
53
+
54
+ ### Usage
55
+
56
+ ```python
57
+ from transformers import pipeline
58
+
59
+ qa = pipeline('question-answering', model='CATIE-AQ/ModernQAmembert', tokenizer='CATIE-AQ/ModernQAmembert')
60
+
61
+ result = qa({
62
+ 'question': "Combien de personnes utilisent le français tous les jours ?",
63
+ 'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière. Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
64
+ })
65
+
66
+ if result['score'] < 0.01:
67
+ print("La réponse n'est pas dans le contexte fourni.")
68
+ else :
69
+ print(result['answer'])
70
+ ```
71
+
72
+
73
+ ## Environmental Impact
74
+
75
+ *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
76
+ - **Hardware Type:** A100 PCIe 40/80GB
77
+ - **Hours used:** 5h and 38 min
78
+ - **Cloud Provider:** Private Infrastructure
79
+ - **Carbon Efficiency (kg/kWh):** 0.032kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) ; we take the carbon intensity in France for November 20, 2024.)
80
+ - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: **0.046 kg eq. CO2**
81
+
82
+
83
+
84
+ ## Citations
85
+
86
+
87
+ ### ModernQAmembert
88
+ ```
89
+ @misc {NERmemberta2024,
90
+ author = { {BOURDOIS, Loïck} },
91
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
92
+ title = { ModernQAmembert},
93
+ year = 2025,
94
+ url = { https://huggingface.co/CATIE-AQ/ModernQAmembert },
95
+ doi = { 10.57967/hf/3640 },
96
+ publisher = { Hugging Face }
97
+ }
98
+ ```
99
+
100
+ ### Moderncamembert-cv2-base
101
+ ```
102
+ @misc{antoun2025modernbertdebertav3examiningarchitecture,
103
+ title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance},
104
+ author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
105
+ year={2025},
106
+ eprint={2504.08716},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CL},
109
+ url={https://arxiv.org/abs/2504.08716},
110
+ }
111
+ ```
112
+
113
+ ### QAmemBERT2 & QAmemBERTa
114
+ ```
115
+ @misc {qamemberta2024,
116
+ author = { {BOURDOIS, Loïck} },
117
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
118
+ title = { QAmemberta (Revision 976a70b) },
119
+ year = 2024,
120
+ url = { https://huggingface.co/CATIE-AQ/QAmemberta },
121
+ doi = { 10.57967/hf/3639 },
122
+ publisher = { Hugging Face }
123
+ }
124
+ ```
125
+
126
+ ### CamemBERT 2.0
127
+ ```
128
+ @misc{antoun2024camembert20smarterfrench,
129
+ title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
130
+ author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
131
+ year={2024},
132
+ eprint={2411.08868},
133
+ archivePrefix={arXiv},
134
+ primaryClass={cs.CL},
135
+ url={https://arxiv.org/abs/2411.08868},
136
+ }
137
+ ```
138
+
139
+ ### QAmemBERT
140
+ ```
141
+ @misc {qamembert2023,
142
+ author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
143
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
144
+ title = { QAmembert (Revision 9685bc3) },
145
+ year = 2023,
146
+ url = { https://huggingface.co/CATIE-AQ/QAmembert},
147
+ doi = { 10.57967/hf/0821 },
148
+ publisher = { Hugging Face }
149
+ }
150
+ ```
151
+
152
+ ### CamemBERT
153
+ ```
154
+ @inproceedings{martin2020camembert,
155
+ title={CamemBERT: a Tasty French Language Model},
156
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
157
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
158
+ year={2020}
159
+ }
160
+ ```
161
+
162
+ ### frenchQA
163
+ ```
164
+ @misc {frenchQA2023,
165
+ author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
166
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
167
+ title = { frenchQA (Revision 6249cd5) },
168
+ year = 2023,
169
+ url = { https://huggingface.co/CATIE-AQ/frenchQA },
170
+ doi = { 10.57967/hf/0862 },
171
+ publisher = { Hugging Face }
172
+ }
173
+ ```
174
+
175
+ ### PIAF
176
+ ```
177
+ @inproceedings{KeraronLBAMSSS20,
178
+ author = {Rachel Keraron and
179
+ Guillaume Lancrenon and
180
+ Mathilde Bras and
181
+ Fr{\'{e}}d{\'{e}}ric Allary and
182
+ Gilles Moyse and
183
+ Thomas Scialom and
184
+ Edmundo{-}Pavel Soriano{-}Morales and
185
+ Jacopo Staiano},
186
+ title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
187
+ booktitle = {{LREC}},
188
+ pages = {5481--5490},
189
+ publisher = {European Language Resources Association},
190
+ year = {2020}
191
+ }
192
+ ```
193
+
194
+ ### FQuAD
195
+ ```
196
+ @article{dHoffschmidt2020FQuADFQ,
197
+ title={FQuAD: French Question Answering Dataset},
198
+ author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
199
+ journal={ArXiv},
200
+ year={2020},
201
+ volume={abs/2002.06071}
202
+ }
203
+ ```
204
+
205
+ ### lincoln/newsquadfr
206
+ ```
207
+ Hugging Face repository: https://hf.co/datasets/lincoln/newsquadfr
208
+ ```
209
+
210
+ ### pragnakalp/squad_v2_french_translated
211
+ ```
212
+ Hugging Face repository: https://hf.co/datasets/pragnakalp/squad_v2_french_translated
213
+ ```
214
+
215
+
216
+ ## License
217
+ MIT