Update README.md
Browse files
README.md
CHANGED
@@ -16,61 +16,107 @@ license: mit
|
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
18 |
N predictions output, predictions summarizing into a tabular format,
|
19 |
-
HF π hub support for the model
|
20 |
|
21 |
## Model description π
|
22 |
|
23 |
-
π² Fine-tuned model repository:
|
24 |
|
25 |
-
π³ Base model repository:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
### Data π
|
28 |
|
29 |
Training set of the model: **8950** images
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
### Categories π·οΈ
|
32 |
|
33 |
-
| LabelοΈ | Ratio | Description
|
34 |
-
|
35 |
-
| **DRAW** | 11.89% | **π - drawings, maps, paintings with text**
|
36 |
-
| **DRAW_L** | 8.17% | **ππ - drawings
|
37 |
-
| **LINE_HW** | 5.99% | **βοΈπ - handwritten text lines inside tabular layout / forms**
|
38 |
-
| **LINE_P** | 6.06% | **π - printed text lines inside tabular layout / forms**
|
39 |
-
| **LINE_T** | 13.39% | **π - machine typed text lines inside tabular layout / forms**
|
40 |
-
| **PHOTO** | 10.21% | **π - photos with text**
|
41 |
-
| **PHOTO_L** | 7.86% | **ππ - photos inside tabular layout / forms or with a tabular annotation**
|
42 |
-
| **TEXT** | 8.58% | **π° - mixed types of printed and handwritten texts**
|
43 |
-
| **TEXT_HW** | 7.36% | **βοΈπ - only handwritten text**
|
44 |
-
| **TEXT_P** | 6.95% | **π - only printed text**
|
45 |
-
| **TEXT_T** | 13.53% | **π - only machine typed text**
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
* eval_strategy "epoch"
|
63 |
* save_strategy "epoch"
|
64 |
-
* learning_rate 5e-5
|
65 |
* per_device_train_batch_size 8
|
66 |
* per_device_eval_batch_size 8
|
67 |
-
* num_train_epochs 3
|
68 |
-
* warmup_ratio 0.1
|
69 |
-
* logging_steps 10
|
70 |
* load_best_model_at_end True
|
71 |
-
* metric_for_best_model "accuracy"
|
|
|
|
|
|
|
72 |
|
73 |
-
|
74 |
|
75 |
Evaluation set's accuracy (**Top-3**): **99.6%**
|
76 |
|
@@ -94,13 +140,13 @@ Evaluation set's accuracy (**Top-1**): **97.3%**
|
|
94 |
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
|
95 |
- **TRUE** - actual label of the category π·οΈ
|
96 |
|
97 |
-
|
98 |
|
99 |
For support write to π§ [email protected] π§
|
100 |
|
101 |
Official repository: UFAL [^3]
|
102 |
|
103 |
-
|
104 |
|
105 |
- **Developed by** UFAL [^5] π₯
|
106 |
- **Funded by** ATRIUM [^4] π°
|
|
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
18 |
N predictions output, predictions summarizing into a tabular format,
|
19 |
+
HF π hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
|
20 |
|
21 |
## Model description π
|
22 |
|
23 |
+
π² Fine-tuned model repository: **UFAL's vit-historical-page** [^1] π
|
24 |
|
25 |
+
π³ Base model repository: **Google's vit-base-patch16-224** [^2] π
|
26 |
+
|
27 |
+
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
|
28 |
+
from the archival documents with paper sources that were scanned into digital form. The images contain various
|
29 |
+
combinations of texts οΈπ, tables π, drawings π, and photos π - categories π·οΈ described below were formed based on those
|
30 |
+
archival documents.
|
31 |
+
|
32 |
+
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
|
33 |
+
paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
|
34 |
+
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
|
35 |
+
mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π
|
36 |
+
format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
|
37 |
|
38 |
### Data π
|
39 |
|
40 |
Training set of the model: **8950** images
|
41 |
|
42 |
+
Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) π: **995** images
|
43 |
+
|
44 |
+
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from
|
45 |
+
different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories π·οΈ is
|
46 |
+
**NOT** intentional, but rather a result of the source data nature.
|
47 |
+
|
48 |
+
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
|
49 |
+
were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the
|
50 |
+
source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
|
51 |
+
reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
|
52 |
+
arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
|
53 |
+
|
54 |
### Categories π·οΈ
|
55 |
|
56 |
+
| LabelοΈ | Ratio | Description |
|
57 |
+
|------------:|:-------:|:------------------------------------------------------------------------------|
|
58 |
+
| **DRAW** | 11.89% | **π - drawings, maps, paintings with text** |
|
59 |
+
| **DRAW_L** | 8.17% | **ππ - drawings, etc with a table legend or inside tabular layout / forms** |
|
60 |
+
| **LINE_HW** | 5.99% | **βοΈπ - handwritten text lines inside tabular layout / forms** |
|
61 |
+
| **LINE_P** | 6.06% | **π - printed text lines inside tabular layout / forms** |
|
62 |
+
| **LINE_T** | 13.39% | **π - machine typed text lines inside tabular layout / forms** |
|
63 |
+
| **PHOTO** | 10.21% | **π - photos with text** |
|
64 |
+
| **PHOTO_L** | 7.86% | **ππ - photos inside tabular layout / forms or with a tabular annotation** |
|
65 |
+
| **TEXT** | 8.58% | **π° - mixed types of printed and handwritten texts** |
|
66 |
+
| **TEXT_HW** | 7.36% | **βοΈπ - only handwritten text** |
|
67 |
+
| **TEXT_P** | 6.95% | **π - only printed text** |
|
68 |
+
| **TEXT_T** | 13.53% | **π - only machine typed text** |
|
69 |
+
|
70 |
+
The categories were chosen to sort the pages by the following criterion:
|
71 |
+
|
72 |
+
- **presence of graphical elements** (drawings π OR photos π)
|
73 |
+
- **type of text** π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
|
74 |
+
- **presence of tabular layout / forms** π
|
75 |
+
|
76 |
+
The reasons for such distinction are different processing pipelines for different types of pages, that would be
|
77 |
+
applied after the classification.
|
78 |
+
|
79 |
+
### Training
|
80 |
+
|
81 |
+
During training image transformations were applied sequentially with a 50% chance.
|
82 |
+
|
83 |
+
<details>
|
84 |
+
|
85 |
+
<summary>Image preprocessing steps π</summary>
|
86 |
+
|
87 |
+
* transforms.ColorJitter(**brightness** 0.5)
|
88 |
+
* transforms.ColorJitter(**contrast** 0.5)
|
89 |
+
* transforms.ColorJitter(**saturation** 0.5)
|
90 |
+
* transforms.ColorJitter(**hue** 0.5)
|
91 |
+
* transforms.Lambda(lambda img: ImageEnhance.**Sharpness**(img).enhance(random.uniform(0.5, 1.5)))
|
92 |
+
* transforms.Lambda(lambda img: img.filter(ImageFilter.**GaussianBlur**(radius=random.uniform(0, 2))))
|
93 |
+
|
94 |
+
</details>
|
95 |
+
|
96 |
+
> [!NOTE]
|
97 |
+
> No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The
|
98 |
+
> reason behind this are pages containing specific form types, general text orientation on the pages, and the default
|
99 |
+
> reshape of the model input to the square 224x224 resolution images.
|
100 |
+
|
101 |
+
<details>
|
102 |
+
|
103 |
+
<summary>Training hyperparameters π</summary>
|
104 |
+
|
105 |
* eval_strategy "epoch"
|
106 |
* save_strategy "epoch"
|
107 |
+
* learning_rate **5e-5**
|
108 |
* per_device_train_batch_size 8
|
109 |
* per_device_eval_batch_size 8
|
110 |
+
* num_train_epochs **3**
|
111 |
+
* warmup_ratio **0.1**
|
112 |
+
* logging_steps **10**
|
113 |
* load_best_model_at_end True
|
114 |
+
* metric_for_best_model "accuracy"
|
115 |
+
|
116 |
+
</details>
|
117 |
+
|
118 |
|
119 |
+
## Results π
|
120 |
|
121 |
Evaluation set's accuracy (**Top-3**): **99.6%**
|
122 |
|
|
|
140 |
- **SCORE-N** - score of the category π·οΈ, guess TOP-N
|
141 |
- **TRUE** - actual label of the category π·οΈ
|
142 |
|
143 |
+
## Contacts π§
|
144 |
|
145 |
For support write to π§ [email protected] π§
|
146 |
|
147 |
Official repository: UFAL [^3]
|
148 |
|
149 |
+
## Acknowledgements π
|
150 |
|
151 |
- **Developed by** UFAL [^5] π₯
|
152 |
- **Funded by** ATRIUM [^4] π°
|