File size: 3,446 Bytes
3d52acf c42a082 3d52acf b890846 24b1490 b890846 3d52acf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: mit
datasets:
- eriktks/conll2003
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
tags:
- ner
---
# Model Card: BERT for Named Entity Recognition (NER)
## Model Overview
This model, **bert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities.
### Model Architecture
- **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture.
- **Task**: Token Classification (NER).
## Training Dataset
- **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
- **Classes**:
- `PER` (Person)
- `ORG` (Organization)
- `LOC` (Location)
- `MISC` (Miscellaneous)
- `O` (Outside of any entity span)
## Performance Metrics
The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:
| Metric | Value |
|-------------|------------|
| **Loss** | 0.0649 |
| **Precision** | 93.59% |
| **Recall** | 95.07% |
| **F1 Score** | 94.32% |
| **Accuracy** | 98.79% |
These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.
## Training Details
- **Optimizer**: AdamW (Adam with weight decay)
- **Learning Rate**: 2e-5
- **Batch Size**: 8
- **Number of Epochs**: 3
- **Scheduler**: Linear scheduler with warm-up steps
- **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens
## Model Input/Output
- **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`.
- **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.).
## How to Use the Model
### Installation
```bash
pip install transformers
```
### Loading the Model
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")
```
### Running Inference
```python
from transformers import pipeline
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John lives in New York City."
result = nlp(text)
print(result)
```
```json
[{'entity_group': 'PER',
'score': 0.99912304,
'word': 'john',
'start': 0,
'end': 4},
{'entity_group': 'LOC',
'score': 0.9993351,
'word': 'new york city',
'start': 14,
'end': 27}]
```
## Limitations
1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled.
## Recommendations
- For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
- Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.
## Acknowledgements
- **Transformers Library**: Hugging Face
- **Dataset**: CoNLL-2003
- **Base Model**: `bert-base-uncased` by Google |