ufal
/

File size: 8,706 Bytes
6b82e4e
 
 
 
 
 
 
5574e25
 
6b82e4e
 
 
 
 
 
089118d
6b82e4e
 
 
 
089118d
6b82e4e
f420d95
 
 
5574e25
 
 
 
 
 
 
 
 
 
f420d95
 
6b82e4e
 
089118d
6b82e4e
5574e25
6b82e4e
 
 
5574e25
89ba66d
5574e25
6fbc142
6b82e4e
 
5574e25
089118d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5574e25
089118d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5574e25
089118d
5574e25
089118d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b82e4e
 
089118d
6b82e4e
 
089118d
 
 
6b82e4e
089118d
 
 
6fbc142
5574e25
6fbc142
a039991
6b82e4e
5574e25
6b82e4e
acfb962
6b82e4e
5574e25
6b82e4e
a039991
6b82e4e
5574e25
089118d
acfb962
6b82e4e
 
 
5574e25
089118d
5574e25
089118d
5574e25
6b82e4e
5574e25
6b82e4e
 
 
 
 
 
 
 
 
089118d
6b82e4e
 
 
 
 
089118d
6b82e4e
 
 
 
5574e25
6b82e4e
 
 
089118d
6b82e4e
 
 
 
5574e25
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---

# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

### Goal: solve a task of archive page images sorting (for their further content-based processing)

**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class 🏷️ (category) results of top
N predictions output, predictions summarizing into a tabular format, 
HF 😊 hub support for the model

## Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, 
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF 😊 hub [^1] πŸ”— 

| Version | Base                   | Pages |   PDFs   | Description                                                               |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
|  `v2.0` | `vit-base-path16-224`  | 10073 | **3896** | annotations with mistakes, more heterogenous data                         |
|  `v2.1` | `vit-base-path16-224`  | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes     |
|  `v2.2` | `vit-base-path16-224`  | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0`                     |
|  `v3.2` | `vit-base-path16-384`  | 15855 | **5730** | same data as `v2.0.2`, but a bit larger model base with higher resolution |
|  `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but the largest model base with higher resolution  |


## Model description πŸ“‡

πŸ”² Fine-tuned model repository:  vit-historical-page [^1] πŸ”—

πŸ”³ Base model repository: Google's **vit-base-patch16-224**,  **vit-base-patch16-384**,  **vit-large-patch16-284** [^2] [^13] [^14] πŸ”—

### Data πŸ“œ

Training set of the model: **8950** images for v2.0

Training set of the model: **10745** images for v2.1

### Categories 🏷️

**v2.0 version Categories πŸͺ§**:

|    Label️ | Ratio  | Description                                                                   |
|----------:|:------:|:------------------------------------------------------------------------------|
|    `DRAW` | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text**                                  |
|  `DRAW_L` | 8.17%  | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 5.99%  | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms**               |
|  `LINE_P` | 6.06%  | **πŸ“ - printed text lines inside tabular layout / forms**                     |
|  `LINE_T` | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms**               |
|   `PHOTO` | 10.21% | **πŸŒ„ - photos with text**                                                     |
| `PHOTO_L` | 7.86%  | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation**  |
|    `TEXT` | 8.58%  | **πŸ“° - mixed types of printed and handwritten texts**                         |
| `TEXT_HW` | 7.36%  | **βœοΈπŸ“„ - only handwritten text**                                              |
|  `TEXT_P` | 6.95%  | **πŸ“„ - only printed text**                                                    |
|  `TEXT_T` | 13.53% | **πŸ“„ - only machine typed text**                                              |

**v2.1 version Categories πŸͺ§**:

|    Label️ | Ratio | Description                                                                   |
|----------:|:-----:|:------------------------------------------------------------------------------|
|    `DRAW` | 9.12% | **πŸ“ˆ - drawings, maps, paintings with text**                                  |
|  `DRAW_L` | 9.14% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 8.84% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms**               |
|  `LINE_P` | 9.15% | **πŸ“ - printed text lines inside tabular layout / forms**                     |
|  `LINE_T` | 9.2%  | **πŸ“ - machine typed text lines inside tabular layout / forms**               |
|   `PHOTO` | 9.05% | **πŸŒ„ - photos with text**                                                     |
| `PHOTO_L` | 9.1%  | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation**  |
|    `TEXT` | 9.14% | **πŸ“° - mixed types of printed and handwritten texts**                         |
| `TEXT_HW` | 9.14% | **βœοΈπŸ“„ - only handwritten text**                                              |
|  `TEXT_P` | 9.07% | **πŸ“„ - only printed text**                                                    |
|  `TEXT_T` | 9.05% | **πŸ“„ - only machine typed text**                                              |

Evaluation set (same proportions):	**995** images for v2.0

Evaluation set (same proportions):	**1194** images for v2.1


#### Data preprocessing 

During training the following transforms were applied randomly with a 50% chance:

* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

### Training Hyperparameters
        
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"      

### Results πŸ“Š

**v2.0** Evaluation set's accuracy (**Top-3**):  **99.6%** 

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1430_conf_mat_TOP-3.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-3**):  **99.75%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1049_conf_mat_TOP-3.png?raw=true)

**v2.0** Evaluation set's accuracy (**Top-1**):  **97.3%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1436_conf_mat_TOP-1.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-1**):  **96.82%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1055_conf_mat_TOP-1.png?raw=true)

#### Result tables

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1426_model_1119_3_TOP-3_EVAL.csv) πŸ”—

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1431_model_1119_3_TOP-1_EVAL.csv) πŸ”—

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1044_model_672_3_TOP-3_EVAL.csv) πŸ”—

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1050_model_672_3_TOP-1_EVAL.csv) πŸ”—

#### Table columns

- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category 🏷️, guess TOP-N 
- **SCORE-N** - score of the category 🏷️, guess TOP-N
- **TRUE** - actual label of the category 🏷️

### Contacts πŸ“§

For support write to πŸ“§ [email protected] πŸ“§

Official repository: UFAL [^3]

### Acknowledgements πŸ™

- **Developed by** UFAL [^5] πŸ‘₯
- **Funded by** ATRIUM [^4]  πŸ’°
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] πŸ”— or 384x384 [^13] [^14] πŸ”— resolution size 

**©️ 2022 UFAL & ATRIUM**

[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384