--- license: mit datasets: - eriktks/conll2003 language: - en base_model: - google-bert/bert-base-uncased pipeline_tag: token-classification library_name: transformers tags: - ner --- # Model Card: BERT for Named Entity Recognition (NER) ## Model Overview This model, **bert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities. ### Model Architecture - **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture. - **Task**: Token Classification (NER). ## Training Dataset - **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans. - **Classes**: - `PER` (Person) - `ORG` (Organization) - `LOC` (Location) - `MISC` (Miscellaneous) - `O` (Outside of any entity span) ## Performance Metrics The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set: | Metric | Value | |-------------|------------| | **Loss** | 0.0649 | | **Precision** | 93.59% | | **Recall** | 95.07% | | **F1 Score** | 94.32% | | **Accuracy** | 98.79% | These metrics indicate the model's high accuracy and robustness in identifying and classifying entities. ## Training Details - **Optimizer**: AdamW (Adam with weight decay) - **Learning Rate**: 2e-5 - **Batch Size**: 8 - **Number of Epochs**: 3 - **Scheduler**: Linear scheduler with warm-up steps - **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens ## Model Input/Output - **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`. - **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.). ## How to Use the Model ### Installation ```bash pip install transformers ``` ### Loading the Model ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner") model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner") ``` ### Running Inference ```python from transformers import pipeline nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = "John lives in New York City." result = nlp(text) print(result) ``` ```json [{'entity_group': 'PER', 'score': 0.99912304, 'word': 'john', 'start': 0, 'end': 4}, {'entity_group': 'LOC', 'score': 0.9993351, 'word': 'new york city', 'start': 14, 'end': 27}] ``` ## Limitations 1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset. 2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled. ## Recommendations - For domain-specific tasks, consider fine-tuning this model further on a relevant dataset. - Use a pre-processing pipeline to handle long texts by splitting them into smaller segments. ## Acknowledgements - **Transformers Library**: Hugging Face - **Dataset**: CoNLL-2003 - **Base Model**: `bert-base-uncased` by Google