File size: 3,446 Bytes
3d52acf
 
 
 
 
 
 
c42a082
3d52acf
 
 
 
 
b890846
 
 
 
24b1490
b890846
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3d52acf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: mit
datasets:
- eriktks/conll2003
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
tags:
- ner
---
# Model Card: BERT for Named Entity Recognition (NER)

## Model Overview

This model, **bert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities.

### Model Architecture
- **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture.
- **Task**: Token Classification (NER).

## Training Dataset

- **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
- **Classes**:
  - `PER` (Person)
  - `ORG` (Organization)
  - `LOC` (Location)
  - `MISC` (Miscellaneous)
  - `O` (Outside of any entity span)

## Performance Metrics

The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:

| Metric      | Value      |
|-------------|------------|
| **Loss**    | 0.0649     |
| **Precision** | 93.59%    |
| **Recall**  | 95.07%     |
| **F1 Score** | 94.32%    |
| **Accuracy** | 98.79%    |

These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.

## Training Details

- **Optimizer**: AdamW (Adam with weight decay)
- **Learning Rate**: 2e-5
- **Batch Size**: 8
- **Number of Epochs**: 3
- **Scheduler**: Linear scheduler with warm-up steps
- **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens

## Model Input/Output

- **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`.
- **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.).



## How to Use the Model

### Installation
```bash
pip install transformers
```

### Loading the Model
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")
```

### Running Inference
```python
from transformers import pipeline

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John lives in New York City."
result = nlp(text)
print(result)
```

```json
[{'entity_group': 'PER',
  'score': 0.99912304,
  'word': 'john',
  'start': 0,
  'end': 4},
 {'entity_group': 'LOC',
  'score': 0.9993351,
  'word': 'new york city',
  'start': 14,
  'end': 27}]
```

## Limitations

1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled.
## Recommendations

- For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
- Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.

## Acknowledgements

- **Transformers Library**: Hugging Face
- **Dataset**: CoNLL-2003
- **Base Model**: `bert-base-uncased` by Google