--- language: - en tags: - charboundary - sentence-boundary-detection - paragraph-detection - legal-text - legal-nlp - text-segmentation - onnx - cpu - document-processing - rag - optimized-inference license: mit library_name: charboundary pipeline_tag: text-classification datasets: - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries - alea-institute/kl3m-data-snapshot-20250324 metrics: - accuracy - f1 - precision - recall - throughput papers: - https://arxiv.org/abs/2504.04131 --- # CharBoundary large ONNX Model This is the large ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0), a fast character-based sentence and paragraph boundary detection system optimized for legal text. ## Model Details - **Size**: large - **Model Size**: 12.0 MB (ONNX compressed) - **Memory Usage**: 5734 MB at runtime (non-ONNX version) - **Training Data**: Legal text with ~5,000,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324) - **Model Type**: Random Forest (100 trees, max depth 24) converted to ONNX - **Format**: ONNX optimized for inference - **Task**: Character-level boundary detection for text segmentation - **License**: MIT - **Throughput**: ~518K characters/second (base model; ONNX is typically 2-4x faster) ## Usage > **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments. ```python # Make sure to install with the onnx extra to get ONNX runtime support # pip install charboundary[onnx] from charboundary import get_large_onnx_segmenter # First load can be slow segmenter = get_large_onnx_segmenter() # Use the model text = "This is a test sentence. Here's another one!" sentences = segmenter.segment_to_sentences(text) print(sentences) # Output: ['This is a test sentence.', " Here's another one!"] # Segment to spans sentence_spans = segmenter.get_sentence_spans(text) print(sentence_spans) # Output: [(0, 24), (24, 44)] ``` ## Performance ONNX models provide significantly faster inference compared to the standard scikit-learn models while maintaining the same accuracy metrics. The performance differences between model sizes are shown below. ### Base Model Performance | Dataset | Precision | F1 | Recall | |---------|-----------|-------|--------| | ALEA SBD Benchmark | 0.637 | 0.727 | 0.847 | | SCOTUS | 0.950 | 0.778 | 0.658 | | Cyber Crime | 0.968 | 0.853 | 0.762 | | BVA | 0.963 | 0.881 | 0.813 | | Intellectual Property | 0.954 | 0.890 | 0.834 | ### Size and Speed Comparison | Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score | |-------|--------|-----------|--------------|------------------------|----------| | Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 | | Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 | | Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 | ## Paper and Citation This model is part of the research presented in the following paper: ``` @article{bommarito2025precise, title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary}, author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian}, journal={arXiv preprint arXiv:2504.04131}, year={2025} } ``` For more details on the model architecture, training, and evaluation, please see: - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131) - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary) - [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries) ## Contact This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries: - GitHub: https://github.com/alea-institute/kl3m-model-research - Email: hello@aleainstitute.ai - Website: https://aleainstitute.ai For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research). ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)