Muhammad Farrukh Mehmood commited on
Commit
b890846
·
verified ·
1 Parent(s): 67c0d2e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -93
README.md CHANGED
@@ -1,93 +1,104 @@
1
- ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: google-bert/bert-base-uncased
5
- tags:
6
- - generated_from_trainer
7
- datasets:
8
- - conll2003
9
- metrics:
10
- - precision
11
- - recall
12
- - f1
13
- - accuracy
14
- model-index:
15
- - name: modernbert-conll-ner
16
- results:
17
- - task:
18
- name: Token Classification
19
- type: token-classification
20
- dataset:
21
- name: conll2003
22
- type: conll2003
23
- config: conll2003
24
- split: None
25
- args: conll2003
26
- metrics:
27
- - name: Precision
28
- type: precision
29
- value: 0.9358846918489065
30
- - name: Recall
31
- type: recall
32
- value: 0.9506900033658701
33
- - name: F1
34
- type: f1
35
- value: 0.943229253631658
36
- - name: Accuracy
37
- type: accuracy
38
- value: 0.9879263507395111
39
- ---
40
-
41
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
42
- should probably proofread and complete it, then remove this comment. -->
43
-
44
- # modernbert-conll-ner
45
-
46
- This model is a fine-tuned version of [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) on the conll2003 dataset.
47
- It achieves the following results on the evaluation set:
48
- - Loss: 0.0649
49
- - Precision: 0.9359
50
- - Recall: 0.9507
51
- - F1: 0.9432
52
- - Accuracy: 0.9879
53
-
54
- ## Model description
55
-
56
- More information needed
57
-
58
- ## Intended uses & limitations
59
-
60
- More information needed
61
-
62
- ## Training and evaluation data
63
-
64
- More information needed
65
-
66
- ## Training procedure
67
-
68
- ### Training hyperparameters
69
-
70
- The following hyperparameters were used during training:
71
- - learning_rate: 2e-05
72
- - train_batch_size: 8
73
- - eval_batch_size: 8
74
- - seed: 42
75
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
76
- - lr_scheduler_type: linear
77
- - num_epochs: 3
78
-
79
- ### Training results
80
-
81
- | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
82
- |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
83
- | 0.023 | 1.0 | 1756 | 0.0683 | 0.9201 | 0.9416 | 0.9307 | 0.9859 |
84
- | 0.0222 | 2.0 | 3512 | 0.0614 | 0.9345 | 0.9514 | 0.9429 | 0.9874 |
85
- | 0.0097 | 3.0 | 5268 | 0.0649 | 0.9359 | 0.9507 | 0.9432 | 0.9879 |
86
-
87
-
88
- ### Framework versions
89
-
90
- - Transformers 4.47.1
91
- - Pytorch 2.5.1+cu121
92
- - Datasets 3.2.0
93
- - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: BERT for Named Entity Recognition (NER)
2
+
3
+ ## Model Overview
4
+
5
+ This model, **sbert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities.
6
+
7
+ ### Model Architecture
8
+ - **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture.
9
+ - **Task**: Token Classification (NER).
10
+
11
+ ## Training Dataset
12
+
13
+ - **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
14
+ - **Classes**:
15
+ - `PER` (Person)
16
+ - `ORG` (Organization)
17
+ - `LOC` (Location)
18
+ - `MISC` (Miscellaneous)
19
+ - `O` (Outside of any entity span)
20
+
21
+ ## Performance Metrics
22
+
23
+ The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:
24
+
25
+ | Metric | Value |
26
+ |-------------|------------|
27
+ | **Loss** | 0.0649 |
28
+ | **Precision** | 93.59% |
29
+ | **Recall** | 95.07% |
30
+ | **F1 Score** | 94.32% |
31
+ | **Accuracy** | 98.79% |
32
+
33
+ These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.
34
+
35
+ ## Training Details
36
+
37
+ - **Optimizer**: AdamW (Adam with weight decay)
38
+ - **Learning Rate**: 2e-5
39
+ - **Batch Size**: 8
40
+ - **Number of Epochs**: 3
41
+ - **Scheduler**: Linear scheduler with warm-up steps
42
+ - **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens
43
+
44
+ ## Model Input/Output
45
+
46
+ - **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`.
47
+ - **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.).
48
+
49
+
50
+
51
+ ## How to Use the Model
52
+
53
+ ### Installation
54
+ ```bash
55
+ pip install transformers
56
+ ```
57
+
58
+ ### Loading the Model
59
+ ```python
60
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
63
+ model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")
64
+ ```
65
+
66
+ ### Running Inference
67
+ ```python
68
+ from transformers import pipeline
69
+
70
+ nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
71
+ text = "John lives in New York City."
72
+ result = nlp(text)
73
+ print(result)
74
+ ```
75
+
76
+ ```json
77
+ [{'entity_group': 'PER',
78
+ 'score': 0.99912304,
79
+ 'word': 'john',
80
+ 'start': 0,
81
+ 'end': 4},
82
+ {'entity_group': 'LOC',
83
+ 'score': 0.9993351,
84
+ 'word': 'new york city',
85
+ 'start': 14,
86
+ 'end': 27}]
87
+ ```
88
+
89
+ ## Limitations
90
+
91
+ 1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
92
+ 2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled.
93
+ ## Recommendations
94
+
95
+ - For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
96
+ - Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.
97
+
98
+ ## Acknowledgements
99
+
100
+ - **Transformers Library**: Hugging Face
101
+ - **Dataset**: CoNLL-2003
102
+ - **Base Model**: `bert-base-uncased` by Google
103
+
104
+