File size: 5,197 Bytes
b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 0916f9e b8aeb11 c283881 0916f9e c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 b8aeb11 c283881 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
library_name: transformers
tags:
- chunking
- RAG
license: mit
datasets:
- bookcorpus/bookcorpus
- JeanKaddour/minipile
language:
- en
base_model:
- answerdotai/ModernBERT-large
---
# Chonky modernbert large v1
__Chonky__ is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.
## Model Description
The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
⚠️This model was fine-tuned on sequence of length 1024 (by default ModernBERT supports sequence length up to 8192).
## How to use
I've made a small python library for this model: [chonky](https://github.com/mirth/chonky)
Here is the usage:
```
from chonky import ParagraphSplitter
# on the first run it will download the transformer model
splitter = ParagraphSplitter(
model_id="mirth/chonky_modernbert_large_1",
device="cpu"
)
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
```
### Sample Output
```
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--
```
But you can use this model using standart NER pipeline:
```
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_modernbert_large_1"
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
pipe(text)
```
### Sample Output
```
[
{'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
{'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
{'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]
```
## Training Data
The model was trained to split paragraphs from minipile and bookcorpus datasets.
## Metrics
Token based metrics for minipile:
| Metric | Value |
| -------- | ------|
| F1 | 0.85 |
| Precision| 0.87 |
| Recall | 0.82 |
| Accuracy | 0.99 |
Token based metrics for bookcorpus:
| Metric | Value |
| -------- | ------|
| F1 | 0.79 |
| Precision| 0.85 |
| Recall | 0.74 |
| Accuracy | 0.99 |
## Hardware
Model was fine-tuned on a single H100 for a several hours |