ufal
/

k4tel commited on
Commit
6fbc142
Β·
verified Β·
1 Parent(s): 7214f6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -40
README.md CHANGED
@@ -16,61 +16,107 @@ license: mit
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
18
  N predictions output, predictions summarizing into a tabular format,
19
- HF 😊 hub support for the model
20
 
21
  ## Model description πŸ“‡
22
 
23
- πŸ”² Fine-tuned model repository: vit-historical-page [^1] πŸ”—
24
 
25
- πŸ”³ Base model repository: google's vit-base-patch16-224 [^2] πŸ”—
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Data πŸ“œ
28
 
29
  Training set of the model: **8950** images
30
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ### Categories 🏷️
32
 
33
- | Label️ | Ratio | Description |
34
- |------------:|:-------:|:-----------------------------------------------------------------------------|
35
- | **DRAW** | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text** |
36
- | **DRAW_L** | 8.17% | **πŸ“ˆπŸ“ - drawings ... with a table legend or inside tabular layout / forms** |
37
- | **LINE_HW** | 5.99% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** |
38
- | **LINE_P** | 6.06% | **πŸ“ - printed text lines inside tabular layout / forms** |
39
- | **LINE_T** | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms** |
40
- | **PHOTO** | 10.21% | **πŸŒ„ - photos with text** |
41
- | **PHOTO_L** | 7.86% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** |
42
- | **TEXT** | 8.58% | **πŸ“° - mixed types of printed and handwritten texts** |
43
- | **TEXT_HW** | 7.36% | **βœοΈπŸ“„ - only handwritten text** |
44
- | **TEXT_P** | 6.95% | **πŸ“„ - only printed text** |
45
- | **TEXT_T** | 13.53% | **πŸ“„ - only machine typed text** |
46
-
47
- Evaluation set (same proportions): **995** images
48
-
49
- #### Data preprocessing
50
-
51
- During training the following transforms were applied randomly with a 50% chance:
52
-
53
- * transforms.ColorJitter(brightness 0.5)
54
- * transforms.ColorJitter(contrast 0.5)
55
- * transforms.ColorJitter(saturation 0.5)
56
- * transforms.ColorJitter(hue 0.5)
57
- * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
58
- * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
59
-
60
- ### Training Hyperparameters
61
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  * eval_strategy "epoch"
63
  * save_strategy "epoch"
64
- * learning_rate 5e-5
65
  * per_device_train_batch_size 8
66
  * per_device_eval_batch_size 8
67
- * num_train_epochs 3
68
- * warmup_ratio 0.1
69
- * logging_steps 10
70
  * load_best_model_at_end True
71
- * metric_for_best_model "accuracy"
 
 
 
72
 
73
- ### Results πŸ“Š
74
 
75
  Evaluation set's accuracy (**Top-3**): **99.6%**
76
 
@@ -94,13 +140,13 @@ Evaluation set's accuracy (**Top-1**): **97.3%**
94
  - **SCORE-N** - score of the category 🏷️, guess TOP-N
95
  - **TRUE** - actual label of the category 🏷️
96
 
97
- ### Contacts πŸ“§
98
 
99
  For support write to πŸ“§ [email protected] πŸ“§
100
 
101
  Official repository: UFAL [^3]
102
 
103
- ### Acknowledgements πŸ™
104
 
105
  - **Developed by** UFAL [^5] πŸ‘₯
106
  - **Funded by** ATRIUM [^4] πŸ’°
 
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
18
  N predictions output, predictions summarizing into a tabular format,
19
+ HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
20
 
21
  ## Model description πŸ“‡
22
 
23
+ πŸ”² Fine-tuned model repository: **UFAL's vit-historical-page** [^1] πŸ”—
24
 
25
+ πŸ”³ Base model repository: **Google's vit-base-patch16-224** [^2] πŸ”—
26
+
27
+ The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
28
+ from the archival documents with paper sources that were scanned into digital form. The images contain various
29
+ combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories 🏷️ described below were formed based on those
30
+ archival documents.
31
+
32
+ The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
33
+ paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
34
+ In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
35
+ mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“
36
+ format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.
37
 
38
  ### Data πŸ“œ
39
 
40
  Training set of the model: **8950** images
41
 
42
+ Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) πŸ“Ž: **995** images
43
+
44
+ Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from
45
+ different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is
46
+ **NOT** intentional, but rather a result of the source data nature.
47
+
48
+ In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
49
+ were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the
50
+ source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
51
+ reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
52
+ arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
53
+
54
  ### Categories 🏷️
55
 
56
+ | Label️ | Ratio | Description |
57
+ |------------:|:-------:|:------------------------------------------------------------------------------|
58
+ | **DRAW** | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text** |
59
+ | **DRAW_L** | 8.17% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** |
60
+ | **LINE_HW** | 5.99% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** |
61
+ | **LINE_P** | 6.06% | **πŸ“ - printed text lines inside tabular layout / forms** |
62
+ | **LINE_T** | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms** |
63
+ | **PHOTO** | 10.21% | **πŸŒ„ - photos with text** |
64
+ | **PHOTO_L** | 7.86% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** |
65
+ | **TEXT** | 8.58% | **πŸ“° - mixed types of printed and handwritten texts** |
66
+ | **TEXT_HW** | 7.36% | **βœοΈπŸ“„ - only handwritten text** |
67
+ | **TEXT_P** | 6.95% | **πŸ“„ - only printed text** |
68
+ | **TEXT_T** | 13.53% | **πŸ“„ - only machine typed text** |
69
+
70
+ The categories were chosen to sort the pages by the following criterion:
71
+
72
+ - **presence of graphical elements** (drawings πŸ“ˆ OR photos πŸŒ„)
73
+ - **type of text** πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
74
+ - **presence of tabular layout / forms** πŸ“
75
+
76
+ The reasons for such distinction are different processing pipelines for different types of pages, that would be
77
+ applied after the classification.
78
+
79
+ ### Training
80
+
81
+ During training image transformations were applied sequentially with a 50% chance.
82
+
83
+ <details>
84
+
85
+ <summary>Image preprocessing steps πŸ‘€</summary>
86
+
87
+ * transforms.ColorJitter(**brightness** 0.5)
88
+ * transforms.ColorJitter(**contrast** 0.5)
89
+ * transforms.ColorJitter(**saturation** 0.5)
90
+ * transforms.ColorJitter(**hue** 0.5)
91
+ * transforms.Lambda(lambda img: ImageEnhance.**Sharpness**(img).enhance(random.uniform(0.5, 1.5)))
92
+ * transforms.Lambda(lambda img: img.filter(ImageFilter.**GaussianBlur**(radius=random.uniform(0, 2))))
93
+
94
+ </details>
95
+
96
+ > [!NOTE]
97
+ > No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The
98
+ > reason behind this are pages containing specific form types, general text orientation on the pages, and the default
99
+ > reshape of the model input to the square 224x224 resolution images.
100
+
101
+ <details>
102
+
103
+ <summary>Training hyperparameters πŸ‘€</summary>
104
+
105
  * eval_strategy "epoch"
106
  * save_strategy "epoch"
107
+ * learning_rate **5e-5**
108
  * per_device_train_batch_size 8
109
  * per_device_eval_batch_size 8
110
+ * num_train_epochs **3**
111
+ * warmup_ratio **0.1**
112
+ * logging_steps **10**
113
  * load_best_model_at_end True
114
+ * metric_for_best_model "accuracy"
115
+
116
+ </details>
117
+
118
 
119
+ ## Results πŸ“Š
120
 
121
  Evaluation set's accuracy (**Top-3**): **99.6%**
122
 
 
140
  - **SCORE-N** - score of the category 🏷️, guess TOP-N
141
  - **TRUE** - actual label of the category 🏷️
142
 
143
+ ## Contacts πŸ“§
144
 
145
  For support write to πŸ“§ [email protected] πŸ“§
146
 
147
  Official repository: UFAL [^3]
148
 
149
+ ## Acknowledgements πŸ™
150
 
151
  - **Developed by** UFAL [^5] πŸ‘₯
152
  - **Funded by** ATRIUM [^4] πŸ’°