File size: 8,706 Bytes
6b82e4e 5574e25 6b82e4e 089118d 6b82e4e 089118d 6b82e4e f420d95 5574e25 f420d95 6b82e4e 089118d 6b82e4e 5574e25 6b82e4e 5574e25 89ba66d 5574e25 6fbc142 6b82e4e 5574e25 089118d 5574e25 089118d 5574e25 089118d 5574e25 089118d 6b82e4e 089118d 6b82e4e 089118d 6b82e4e 089118d 6fbc142 5574e25 6fbc142 a039991 6b82e4e 5574e25 6b82e4e acfb962 6b82e4e 5574e25 6b82e4e a039991 6b82e4e 5574e25 089118d acfb962 6b82e4e 5574e25 089118d 5574e25 089118d 5574e25 6b82e4e 5574e25 6b82e4e 089118d 6b82e4e 089118d 6b82e4e 5574e25 6b82e4e 089118d 6b82e4e 5574e25 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
### Goal: solve a task of archive page images sorting (for their further content-based processing)
**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class π·οΈ (category) results of top
N predictions output, predictions summarizing into a tabular format,
HF π hub support for the model
## Versions π
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF π hub [^1] π
| Version | Base | Pages | PDFs | Description |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
| `v2.0` | `vit-base-path16-224` | 10073 | **3896** | annotations with mistakes, more heterogenous data |
| `v2.1` | `vit-base-path16-224` | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes |
| `v2.2` | `vit-base-path16-224` | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0` |
| `v3.2` | `vit-base-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but a bit larger model base with higher resolution |
| `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but the largest model base with higher resolution |
## Model description π
π² Fine-tuned model repository: vit-historical-page [^1] π
π³ Base model repository: Google's **vit-base-patch16-224**, **vit-base-patch16-384**, **vit-large-patch16-284** [^2] [^13] [^14] π
### Data π
Training set of the model: **8950** images for v2.0
Training set of the model: **10745** images for v2.1
### Categories π·οΈ
**v2.0 version Categories πͺ§**:
| LabelοΈ | Ratio | Description |
|----------:|:------:|:------------------------------------------------------------------------------|
| `DRAW` | 11.89% | **π - drawings, maps, paintings with text** |
| `DRAW_L` | 8.17% | **ππ - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 5.99% | **βοΈπ - handwritten text lines inside tabular layout / forms** |
| `LINE_P` | 6.06% | **π - printed text lines inside tabular layout / forms** |
| `LINE_T` | 13.39% | **π - machine typed text lines inside tabular layout / forms** |
| `PHOTO` | 10.21% | **π - photos with text** |
| `PHOTO_L` | 7.86% | **ππ - photos inside tabular layout / forms or with a tabular annotation** |
| `TEXT` | 8.58% | **π° - mixed types of printed and handwritten texts** |
| `TEXT_HW` | 7.36% | **βοΈπ - only handwritten text** |
| `TEXT_P` | 6.95% | **π - only printed text** |
| `TEXT_T` | 13.53% | **π - only machine typed text** |
**v2.1 version Categories πͺ§**:
| LabelοΈ | Ratio | Description |
|----------:|:-----:|:------------------------------------------------------------------------------|
| `DRAW` | 9.12% | **π - drawings, maps, paintings with text** |
| `DRAW_L` | 9.14% | **ππ - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 8.84% | **βοΈπ - handwritten text lines inside tabular layout / forms** |
| `LINE_P` | 9.15% | **π - printed text lines inside tabular layout / forms** |
| `LINE_T` | 9.2% | **π - machine typed text lines inside tabular layout / forms** |
| `PHOTO` | 9.05% | **π - photos with text** |
| `PHOTO_L` | 9.1% | **ππ - photos inside tabular layout / forms or with a tabular annotation** |
| `TEXT` | 9.14% | **π° - mixed types of printed and handwritten texts** |
| `TEXT_HW` | 9.14% | **βοΈπ - only handwritten text** |
| `TEXT_P` | 9.07% | **π - only printed text** |
| `TEXT_T` | 9.05% | **π - only machine typed text** |
Evaluation set (same proportions): **995** images for v2.0
Evaluation set (same proportions): **1194** images for v2.1
#### Data preprocessing
During training the following transforms were applied randomly with a 50% chance:
* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
### Training Hyperparameters
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"
### Results π
**v2.0** Evaluation set's accuracy (**Top-3**): **99.6%**

**v2.1** Evaluation set's accuracy (**Top-3**): **99.75%**

**v2.0** Evaluation set's accuracy (**Top-1**): **97.3%**

**v2.1** Evaluation set's accuracy (**Top-1**): **96.82%**

#### Result tables
- **v2.0** Manually β **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1426_model_1119_3_TOP-3_EVAL.csv) π
- **v2.0** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1431_model_1119_3_TOP-1_EVAL.csv) π
- **v2.1** Manually β **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1044_model_672_3_TOP-3_EVAL.csv) π
- **v2.1** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1050_model_672_3_TOP-1_EVAL.csv) π
#### Table columns
- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category π·οΈ, guess TOP-N
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
- **TRUE** - actual label of the category π·οΈ
### Contacts π§
For support write to π§ [email protected] π§
Official repository: UFAL [^3]
### Acknowledgements π
- **Developed by** UFAL [^5] π₯
- **Funded by** ATRIUM [^4] π°
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] π or 384x384 [^13] [^14] π resolution size
**Β©οΈ 2022 UFAL & ATRIUM**
[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384
|