ufal
/

k4tel commited on
Commit
089118d
Β·
verified Β·
1 Parent(s): 89ba66d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -98
README.md CHANGED
@@ -11,132 +11,115 @@ license: mit
11
 
12
  # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
13
 
14
- ## Goal: solve a task of archive page images sorting (for their further content-based processing)
15
 
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
18
  N predictions output, predictions summarizing into a tabular format,
19
- HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
20
 
21
  ## Model description πŸ“‡
22
 
23
- πŸ”² Fine-tuned model repository: **UFAL's vit-historical-page** [^1] πŸ”—
24
 
25
- πŸ”³ Base model repository: **Google's vit-base-patch16-224** [^2] πŸ”—
26
-
27
- The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
28
- from the archival documents with paper sources that were scanned into digital form.
29
-
30
- The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ -
31
- categories 🏷️ described below were formed based on those archival documents.
32
-
33
- The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
34
- paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
35
-
36
- In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
37
- mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“
38
- format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.
39
 
40
  ### Data πŸ“œ
41
 
42
- Training set of the model: **8950** images
43
-
44
- Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) πŸ“Ž: **995** images
45
-
46
- Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from
47
- different sources of the archival documents dated year 1920-2020.
48
 
49
- Disproportion of the categories 🏷️ is
50
- **NOT** intentional, but rather a result of the source data nature.
51
-
52
- In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
53
- were one-page long and some were much longer (dozens and hundreds of pages).
54
-
55
- The specific content and language of the
56
- source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
57
- reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
58
- arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
59
 
60
  ### Categories 🏷️
61
 
62
- | Label️ | Ratio | Description |
63
- |------------:|:-------:|:------------------------------------------------------------------------------|
64
- | **DRAW** | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text** |
65
- | **DRAW_L** | 8.17% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
66
- | **LINE_HW** | 5.99% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** |
67
- | **LINE_P** | 6.06% | **πŸ“ - printed text lines inside tabular layout / forms** |
68
- | **LINE_T** | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms** |
69
- | **PHOTO** | 10.21% | **πŸŒ„ - photos with text** |
70
- | **PHOTO_L** | 7.86% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** |
71
- | **TEXT** | 8.58% | **πŸ“° - mixed types of printed and handwritten texts** |
72
- | **TEXT_HW** | 7.36% | **βœοΈπŸ“„ - only handwritten text** |
73
- | **TEXT_P** | 6.95% | **πŸ“„ - only printed text** |
74
- | **TEXT_T** | 13.53% | **πŸ“„ - only machine typed text** |
75
-
76
- The categories were chosen to sort the pages by the following criterion:
77
-
78
- - **presence of graphical elements** (drawings πŸ“ˆ OR photos πŸŒ„)
79
- - **type of text** πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
80
- - **presence of tabular layout / forms** πŸ“
81
-
82
- The reasons for such distinction are different processing pipelines for different types of pages, that would be
83
- applied after the classification.
84
-
85
- ### Training
86
-
87
- During training image transformations were applied sequentially with a 50% chance.
88
-
89
- <details>
90
-
91
- <summary>Image preprocessing steps πŸ‘€</summary>
92
-
93
- * transforms.ColorJitter(**brightness** 0.5)
94
- * transforms.ColorJitter(**contrast** 0.5)
95
- * transforms.ColorJitter(**saturation** 0.5)
96
- * transforms.ColorJitter(**hue** 0.5)
97
- * transforms.Lambda(lambda img: ImageEnhance.**Sharpness**(img).enhance(random.uniform(0.5, 1.5)))
98
- * transforms.Lambda(lambda img: img.filter(ImageFilter.**GaussianBlur**(radius=random.uniform(0, 2))))
99
-
100
- </details>
101
-
102
- > [!NOTE]
103
- > No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The
104
- > reason behind this are pages containing specific form types, general text orientation on the pages, and the default
105
- > reshape of the model input to the square 224x224 resolution images.
106
-
107
- <details>
108
-
109
- <summary>Training hyperparameters πŸ‘€</summary>
110
-
 
111
  * eval_strategy "epoch"
112
  * save_strategy "epoch"
113
- * learning_rate **5e-5**
114
  * per_device_train_batch_size 8
115
  * per_device_eval_batch_size 8
116
- * num_train_epochs **3**
117
- * warmup_ratio **0.1**
118
- * logging_steps **10**
119
  * load_best_model_at_end True
120
- * metric_for_best_model "accuracy"
 
 
121
 
122
- </details>
123
 
 
124
 
125
- ## Results πŸ“Š
126
 
127
- Evaluation set's accuracy (**Top-3**): **99.6%**
128
 
129
- ![TOP-3 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250209-1526_conf_mat.png?raw=true)
130
 
131
- Evaluation set's accuracy (**Top-1**): **97.3%**
132
 
133
- ![TOP-1 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250218-1523_conf_mat.png?raw=true)
 
 
134
 
135
  #### Result tables
136
 
137
- - Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250209-1534_model_1119_3_TOP-3_EVAL.csv) πŸ”—
 
 
 
 
138
 
139
- - Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250218-1519_model_1119_3_TOP-1_EVAL.csv) πŸ”—
140
 
141
  #### Table columns
142
 
@@ -146,13 +129,13 @@ Evaluation set's accuracy (**Top-1**): **97.3%**
146
  - **SCORE-N** - score of the category 🏷️, guess TOP-N
147
  - **TRUE** - actual label of the category 🏷️
148
 
149
- ## Contacts πŸ“§
150
 
151
  For support write to πŸ“§ [email protected] πŸ“§
152
 
153
  Official repository: UFAL [^3]
154
 
155
- ## Acknowledgements πŸ™
156
 
157
  - **Developed by** UFAL [^5] πŸ‘₯
158
  - **Funded by** ATRIUM [^4] πŸ’°
@@ -161,7 +144,7 @@ Official repository: UFAL [^3]
161
 
162
  **©️ 2022 UFAL & ATRIUM**
163
 
164
- [^1]: https://huggingface.co/ufal/vit-historical-page
165
  [^2]: https://huggingface.co/google/vit-base-patch16-224
166
  [^3]: https://github.com/ufal/atrium-page-classification
167
  [^4]: https://atrium-research.eu/
 
11
 
12
  # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
13
 
14
+ ### Goal: solve a task of archive page images sorting (for their further content-based processing)
15
 
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
18
  N predictions output, predictions summarizing into a tabular format,
19
+ HF 😊 hub support for the model
20
 
21
  ## Model description πŸ“‡
22
 
23
+ πŸ”² Fine-tuned model repository: vit-historical-page [^1] πŸ”—
24
 
25
+ πŸ”³ Base model repository: google's vit-base-patch16-224 [^2] πŸ”—
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Data πŸ“œ
28
 
29
+ Training set of the model: **8950** images for v1.0
 
 
 
 
 
30
 
31
+ Training set of the model: **10745** images for v2.0
 
 
 
 
 
 
 
 
 
32
 
33
  ### Categories 🏷️
34
 
35
+ **v1.0 version Categories πŸͺ§**:
36
+
37
+ | Label️ | Ratio | Description |
38
+ |----------:|:------:|:------------------------------------------------------------------------------|
39
+ | `DRAW` | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text** |
40
+ | `DRAW_L` | 8.17% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
41
+ | `LINE_HW` | 5.99% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** |
42
+ | `LINE_P` | 6.06% | **πŸ“ - printed text lines inside tabular layout / forms** |
43
+ | `LINE_T` | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms** |
44
+ | `PHOTO` | 10.21% | **πŸŒ„ - photos with text** |
45
+ | `PHOTO_L` | 7.86% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** |
46
+ | `TEXT` | 8.58% | **πŸ“° - mixed types of printed and handwritten texts** |
47
+ | `TEXT_HW` | 7.36% | **βœοΈπŸ“„ - only handwritten text** |
48
+ | `TEXT_P` | 6.95% | **πŸ“„ - only printed text** |
49
+ | `TEXT_T` | 13.53% | **πŸ“„ - only machine typed text** |
50
+
51
+ **v2.0 version Categories πŸͺ§**:
52
+
53
+ | Label️ | Ratio | Description |
54
+ |----------:|:-----:|:------------------------------------------------------------------------------|
55
+ | `DRAW` | 9.12% | **πŸ“ˆ - drawings, maps, paintings with text** |
56
+ | `DRAW_L` | 9.14% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
57
+ | `LINE_HW` | 8.84% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** |
58
+ | `LINE_P` | 9.15% | **πŸ“ - printed text lines inside tabular layout / forms** |
59
+ | `LINE_T` | 9.2% | **πŸ“ - machine typed text lines inside tabular layout / forms** |
60
+ | `PHOTO` | 9.05% | **πŸŒ„ - photos with text** |
61
+ | `PHOTO_L` | 9.1% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** |
62
+ | `TEXT` | 9.14% | **πŸ“° - mixed types of printed and handwritten texts** |
63
+ | `TEXT_HW` | 9.14% | **βœοΈπŸ“„ - only handwritten text** |
64
+ | `TEXT_P` | 9.07% | **πŸ“„ - only printed text** |
65
+ | `TEXT_T` | 9.05% | **πŸ“„ - only machine typed text** |
66
+
67
+ Evaluation set (same proportions): **995** images for v1.0
68
+
69
+ Evaluation set (same proportions): **1194** images for v2.0
70
+
71
+
72
+ #### Data preprocessing
73
+
74
+ During training the following transforms were applied randomly with a 50% chance:
75
+
76
+ * transforms.ColorJitter(brightness 0.5)
77
+ * transforms.ColorJitter(contrast 0.5)
78
+ * transforms.ColorJitter(saturation 0.5)
79
+ * transforms.ColorJitter(hue 0.5)
80
+ * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
81
+ * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
82
+
83
+ ### Training Hyperparameters
84
+
85
  * eval_strategy "epoch"
86
  * save_strategy "epoch"
87
+ * learning_rate 5e-5
88
  * per_device_train_batch_size 8
89
  * per_device_eval_batch_size 8
90
+ * num_train_epochs 3
91
+ * warmup_ratio 0.1
92
+ * logging_steps 10
93
  * load_best_model_at_end True
94
+ * metric_for_best_model "accuracy"
95
+
96
+ ### Results πŸ“Š
97
 
98
+ **v1.0** Evaluation set's accuracy (**Top-3**): **99.6%**
99
 
100
+ ![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250209-1526_conf_mat.png?raw=true)
101
 
102
+ **v2.0** Evaluation set's accuracy (**Top-3**): **99.92%**
103
 
104
+ ![TOP-3 confusion matrix - trained ViT](https://github.com/atrium-page-classification/blob/main/result/plots/20250416-1158_conf_mat_TOP-3.png?raw=true)
105
 
106
+ **v1.0** Evaluation set's accuracy (**Top-1**): **97.3%**
107
 
108
+ ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250218-1523_conf_mat.png?raw=true)
109
 
110
+ **v2.0** Evaluation set's accuracy (**Top-1**): **96.9%**
111
+
112
+ ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1153_conf_mat_TOP-1.png?raw=true)
113
 
114
  #### Result tables
115
 
116
+ - **v1.0** Manually ✍ **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1602_model_1119_3_TOP-5_EVAL.csv) πŸ”—
117
+
118
+ - **v1.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1606_model_1119_3_TOP-1_EVAL.csv) πŸ”—
119
+
120
+ - **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1218_model_672_5_TOP-5_EVAL.csv) πŸ”—
121
 
122
+ - **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1148_model_672_5_TOP-1_EVAL.csv) πŸ”—
123
 
124
  #### Table columns
125
 
 
129
  - **SCORE-N** - score of the category 🏷️, guess TOP-N
130
  - **TRUE** - actual label of the category 🏷️
131
 
132
+ ### Contacts πŸ“§
133
 
134
  For support write to πŸ“§ [email protected] πŸ“§
135
 
136
  Official repository: UFAL [^3]
137
 
138
+ ### Acknowledgements πŸ™
139
 
140
  - **Developed by** UFAL [^5] πŸ‘₯
141
  - **Funded by** ATRIUM [^4] πŸ’°
 
144
 
145
  **©️ 2022 UFAL & ATRIUM**
146
 
147
+ [^1]: https://huggingface.co/k4tel/vit-historical-page
148
  [^2]: https://huggingface.co/google/vit-base-patch16-224
149
  [^3]: https://github.com/ufal/atrium-page-classification
150
  [^4]: https://atrium-research.eu/