Update README.md
Browse files
README.md
CHANGED
@@ -11,132 +11,115 @@ license: mit
|
|
11 |
|
12 |
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
|
13 |
|
14 |
-
|
15 |
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
18 |
N predictions output, predictions summarizing into a tabular format,
|
19 |
-
HF π hub support for the model
|
20 |
|
21 |
## Model description π
|
22 |
|
23 |
-
π² Fine-tuned model repository:
|
24 |
|
25 |
-
π³ Base model repository:
|
26 |
-
|
27 |
-
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
|
28 |
-
from the archival documents with paper sources that were scanned into digital form.
|
29 |
-
|
30 |
-
The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π -
|
31 |
-
categories π·οΈ described below were formed based on those archival documents.
|
32 |
-
|
33 |
-
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
|
34 |
-
paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
|
35 |
-
|
36 |
-
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
|
37 |
-
mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π
|
38 |
-
format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
|
39 |
|
40 |
### Data π
|
41 |
|
42 |
-
Training set of the model: **8950** images
|
43 |
-
|
44 |
-
Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) π: **995** images
|
45 |
-
|
46 |
-
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from
|
47 |
-
different sources of the archival documents dated year 1920-2020.
|
48 |
|
49 |
-
|
50 |
-
**NOT** intentional, but rather a result of the source data nature.
|
51 |
-
|
52 |
-
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
|
53 |
-
were one-page long and some were much longer (dozens and hundreds of pages).
|
54 |
-
|
55 |
-
The specific content and language of the
|
56 |
-
source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
|
57 |
-
reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
|
58 |
-
arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
|
59 |
|
60 |
### Categories π·οΈ
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
|
|
65 |
-
|
66 |
-
|
|
67 |
-
|
|
68 |
-
|
|
69 |
-
|
|
70 |
-
|
|
71 |
-
|
|
72 |
-
|
|
73 |
-
|
|
74 |
-
|
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
|
|
111 |
* eval_strategy "epoch"
|
112 |
* save_strategy "epoch"
|
113 |
-
* learning_rate
|
114 |
* per_device_train_batch_size 8
|
115 |
* per_device_eval_batch_size 8
|
116 |
-
* num_train_epochs
|
117 |
-
* warmup_ratio
|
118 |
-
* logging_steps
|
119 |
* load_best_model_at_end True
|
120 |
-
* metric_for_best_model "accuracy"
|
|
|
|
|
121 |
|
122 |
-
|
123 |
|
|
|
124 |
|
125 |
-
|
126 |
|
127 |
-
|
128 |
|
129 |
-
|
130 |
|
131 |
-
|
132 |
|
133 |
-
|
|
|
|
|
134 |
|
135 |
#### Result tables
|
136 |
|
137 |
-
- Manually β **checked** evaluation dataset results (TOP-
|
|
|
|
|
|
|
|
|
138 |
|
139 |
-
- Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/
|
140 |
|
141 |
#### Table columns
|
142 |
|
@@ -146,13 +129,13 @@ Evaluation set's accuracy (**Top-1**): **97.3%**
|
|
146 |
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
|
147 |
- **TRUE** - actual label of the category π·οΈ
|
148 |
|
149 |
-
|
150 |
|
151 |
For support write to π§ [email protected] π§
|
152 |
|
153 |
Official repository: UFAL [^3]
|
154 |
|
155 |
-
|
156 |
|
157 |
- **Developed by** UFAL [^5] π₯
|
158 |
- **Funded by** ATRIUM [^4] π°
|
@@ -161,7 +144,7 @@ Official repository: UFAL [^3]
|
|
161 |
|
162 |
**Β©οΈ 2022 UFAL & ATRIUM**
|
163 |
|
164 |
-
[^1]: https://huggingface.co/
|
165 |
[^2]: https://huggingface.co/google/vit-base-patch16-224
|
166 |
[^3]: https://github.com/ufal/atrium-page-classification
|
167 |
[^4]: https://atrium-research.eu/
|
|
|
11 |
|
12 |
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
|
13 |
|
14 |
+
### Goal: solve a task of archive page images sorting (for their further content-based processing)
|
15 |
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
18 |
N predictions output, predictions summarizing into a tabular format,
|
19 |
+
HF π hub support for the model
|
20 |
|
21 |
## Model description π
|
22 |
|
23 |
+
π² Fine-tuned model repository: vit-historical-page [^1] π
|
24 |
|
25 |
+
π³ Base model repository: google's vit-base-patch16-224 [^2] π
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
### Data π
|
28 |
|
29 |
+
Training set of the model: **8950** images for v1.0
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
+
Training set of the model: **10745** images for v2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
### Categories π·οΈ
|
34 |
|
35 |
+
**v1.0 version Categories πͺ§**:
|
36 |
+
|
37 |
+
| LabelοΈ | Ratio | Description |
|
38 |
+
|----------:|:------:|:------------------------------------------------------------------------------|
|
39 |
+
| `DRAW` | 11.89% | **π - drawings, maps, paintings with text** |
|
40 |
+
| `DRAW_L` | 8.17% | **ππ - drawings, etc with a table legend or inside tabular layout / forms** |
|
41 |
+
| `LINE_HW` | 5.99% | **βοΈπ - handwritten text lines inside tabular layout / forms** |
|
42 |
+
| `LINE_P` | 6.06% | **π - printed text lines inside tabular layout / forms** |
|
43 |
+
| `LINE_T` | 13.39% | **π - machine typed text lines inside tabular layout / forms** |
|
44 |
+
| `PHOTO` | 10.21% | **π - photos with text** |
|
45 |
+
| `PHOTO_L` | 7.86% | **ππ - photos inside tabular layout / forms or with a tabular annotation** |
|
46 |
+
| `TEXT` | 8.58% | **π° - mixed types of printed and handwritten texts** |
|
47 |
+
| `TEXT_HW` | 7.36% | **βοΈπ - only handwritten text** |
|
48 |
+
| `TEXT_P` | 6.95% | **π - only printed text** |
|
49 |
+
| `TEXT_T` | 13.53% | **π - only machine typed text** |
|
50 |
+
|
51 |
+
**v2.0 version Categories πͺ§**:
|
52 |
+
|
53 |
+
| LabelοΈ | Ratio | Description |
|
54 |
+
|----------:|:-----:|:------------------------------------------------------------------------------|
|
55 |
+
| `DRAW` | 9.12% | **π - drawings, maps, paintings with text** |
|
56 |
+
| `DRAW_L` | 9.14% | **ππ - drawings, etc with a table legend or inside tabular layout / forms** |
|
57 |
+
| `LINE_HW` | 8.84% | **βοΈπ - handwritten text lines inside tabular layout / forms** |
|
58 |
+
| `LINE_P` | 9.15% | **π - printed text lines inside tabular layout / forms** |
|
59 |
+
| `LINE_T` | 9.2% | **π - machine typed text lines inside tabular layout / forms** |
|
60 |
+
| `PHOTO` | 9.05% | **π - photos with text** |
|
61 |
+
| `PHOTO_L` | 9.1% | **ππ - photos inside tabular layout / forms or with a tabular annotation** |
|
62 |
+
| `TEXT` | 9.14% | **π° - mixed types of printed and handwritten texts** |
|
63 |
+
| `TEXT_HW` | 9.14% | **βοΈπ - only handwritten text** |
|
64 |
+
| `TEXT_P` | 9.07% | **π - only printed text** |
|
65 |
+
| `TEXT_T` | 9.05% | **π - only machine typed text** |
|
66 |
+
|
67 |
+
Evaluation set (same proportions): **995** images for v1.0
|
68 |
+
|
69 |
+
Evaluation set (same proportions): **1194** images for v2.0
|
70 |
+
|
71 |
+
|
72 |
+
#### Data preprocessing
|
73 |
+
|
74 |
+
During training the following transforms were applied randomly with a 50% chance:
|
75 |
+
|
76 |
+
* transforms.ColorJitter(brightness 0.5)
|
77 |
+
* transforms.ColorJitter(contrast 0.5)
|
78 |
+
* transforms.ColorJitter(saturation 0.5)
|
79 |
+
* transforms.ColorJitter(hue 0.5)
|
80 |
+
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
|
81 |
+
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
|
82 |
+
|
83 |
+
### Training Hyperparameters
|
84 |
+
|
85 |
* eval_strategy "epoch"
|
86 |
* save_strategy "epoch"
|
87 |
+
* learning_rate 5e-5
|
88 |
* per_device_train_batch_size 8
|
89 |
* per_device_eval_batch_size 8
|
90 |
+
* num_train_epochs 3
|
91 |
+
* warmup_ratio 0.1
|
92 |
+
* logging_steps 10
|
93 |
* load_best_model_at_end True
|
94 |
+
* metric_for_best_model "accuracy"
|
95 |
+
|
96 |
+
### Results π
|
97 |
|
98 |
+
**v1.0** Evaluation set's accuracy (**Top-3**): **99.6%**
|
99 |
|
100 |
+

|
101 |
|
102 |
+
**v2.0** Evaluation set's accuracy (**Top-3**): **99.92%**
|
103 |
|
104 |
+

|
105 |
|
106 |
+
**v1.0** Evaluation set's accuracy (**Top-1**): **97.3%**
|
107 |
|
108 |
+

|
109 |
|
110 |
+
**v2.0** Evaluation set's accuracy (**Top-1**): **96.9%**
|
111 |
+
|
112 |
+

|
113 |
|
114 |
#### Result tables
|
115 |
|
116 |
+
- **v1.0** Manually β **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1602_model_1119_3_TOP-5_EVAL.csv) π
|
117 |
+
|
118 |
+
- **v1.0** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1606_model_1119_3_TOP-1_EVAL.csv) π
|
119 |
+
|
120 |
+
- **v2.0** Manually β **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1218_model_672_5_TOP-5_EVAL.csv) π
|
121 |
|
122 |
+
- **v2.0** Manually β **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1148_model_672_5_TOP-1_EVAL.csv) π
|
123 |
|
124 |
#### Table columns
|
125 |
|
|
|
129 |
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
|
130 |
- **TRUE** - actual label of the category π·οΈ
|
131 |
|
132 |
+
### Contacts π§
|
133 |
|
134 |
For support write to π§ [email protected] π§
|
135 |
|
136 |
Official repository: UFAL [^3]
|
137 |
|
138 |
+
### Acknowledgements π
|
139 |
|
140 |
- **Developed by** UFAL [^5] π₯
|
141 |
- **Funded by** ATRIUM [^4] π°
|
|
|
144 |
|
145 |
**Β©οΈ 2022 UFAL & ATRIUM**
|
146 |
|
147 |
+
[^1]: https://huggingface.co/k4tel/vit-historical-page
|
148 |
[^2]: https://huggingface.co/google/vit-base-patch16-224
|
149 |
[^3]: https://github.com/ufal/atrium-page-classification
|
150 |
[^4]: https://atrium-research.eu/
|