ufal
/

k4tel commited on
Commit
6b82e4e
·
verified ·
1 Parent(s): 4eae04a

initial commit

Browse files
Files changed (4) hide show
  1. README.md +116 -3
  2. config.json +50 -0
  3. model.safetensors +3 -0
  4. preprocessor_config.json +23 -0
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - page
5
+ - classification
6
+ base_model:
7
+ - google/vit-base-patch16-224
8
+ pipeline_tag: image-classification
9
+ license: mit
10
+ ---
11
+
12
+ # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
13
+
14
+ ### Goal: solve a task of archive page images sorting (for their further content-based processing)
15
+
16
+ **Scope:** Processing of images, training and evaluation of ViT model,
17
+ input file/directory processing, class 🏷️ (category) results of top
18
+ N predictions output, predictions summarizing into a tabular format,
19
+ HF 😊 hub support for the model
20
+
21
+ ## Model description 📇
22
+
23
+ 🔲 Fine-tuned model repository: vit-historical-page [^1] 🔗
24
+
25
+ 🔳 Base model repository: google's vit-base-patch16-224 [^2] 🔗
26
+
27
+ ### Data 📜
28
+
29
+ Training set of the model: **8950** images
30
+
31
+ ### Categories 🏷️
32
+
33
+ | Label️ | Ratio | Description |
34
+ |------------:|:-------:|:-----------------------------------------------------------------------------|
35
+ | **DRAW** | 11.89% | **📈 - drawings, maps, paintings with text** |
36
+ | **DRAW_L** | 8.17% | **📈📏 - drawings ... with a table legend or inside tabular layout / forms** |
37
+ | **LINE_HW** | 5.99% | **✏️📏 - handwritten text lines inside tabular layout / forms** |
38
+ | **LINE_P** | 6.06% | **📏 - printed text lines inside tabular layout / forms** |
39
+ | **LINE_T** | 13.39% | **📏 - machine typed text lines inside tabular layout / forms** |
40
+ | **PHOTO** | 10.21% | **🌄 - photos with text** |
41
+ | **PHOTO_L** | 7.86% | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation** |
42
+ | **TEXT** | 8.58% | **📰 - mixed types of printed and handwritten texts** |
43
+ | **TEXT_HW** | 7.36% | **✏️📄 - only handwritten text** |
44
+ | **TEXT_P** | 6.95% | **📄 - only printed text** |
45
+ | **TEXT_T** | 13.53% | **📄 - only machine typed text** |
46
+
47
+ Evaluation set (same proportions): **995** images
48
+
49
+ #### Data preprocessing
50
+
51
+ During training the following transforms were applied randomly with a 50% chance:
52
+
53
+ * transforms.ColorJitter(brightness 0.5)
54
+ * transforms.ColorJitter(contrast 0.5)
55
+ * transforms.ColorJitter(saturation 0.5)
56
+ * transforms.ColorJitter(hue 0.5)
57
+ * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
58
+ * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
59
+
60
+ ### Training Hyperparameters
61
+
62
+ * eval_strategy "epoch"
63
+ * save_strategy "epoch"
64
+ * learning_rate 5e-5
65
+ * per_device_train_batch_size 8
66
+ * per_device_eval_batch_size 8
67
+ * num_train_epochs 3
68
+ * warmup_ratio 0.1
69
+ * logging_steps 10
70
+ * load_best_model_at_end True
71
+ * metric_for_best_model "accuracy"
72
+
73
+ ### Results 📊
74
+
75
+ Evaluation set's accuracy (**Top-3**): **99.6%**
76
+
77
+ ![TOP-3 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250209-1526_conf_mat.png?raw=true)
78
+
79
+ Evaluation set's accuracy (**Top-1**): **97.3%**
80
+
81
+ ![TOP-1 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250218-1523_conf_mat.png?raw=true)
82
+
83
+ #### Result tables
84
+
85
+ - Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250209-1534_model_1119_3_TOP-3_EVAL.csv) 🔗
86
+
87
+ - Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250218-1519_model_1119_3_TOP-1_EVAL.csv) 🔗
88
+
89
+ #### Table columns
90
+
91
+ - **FILE** - name of the file
92
+ - **PAGE** - number of the page
93
+ - **CLASS-N** - label of the category 🏷️, guess TOP-N
94
+ - **SCORE-N** - score of the category 🏷️, guess TOP-N
95
+ - **TRUE** - actual label of the category 🏷️
96
+
97
+ ### Contacts 📧
98
+
99
+ For support write to 📧 [email protected] 📧
100
+
101
+ Official repository: UFAL [^3]
102
+
103
+ ### Acknowledgements 🙏
104
+
105
+ - **Developed by** UFAL [^5] 👥
106
+ - **Funded by** ATRIUM [^4] 💰
107
+ - **Shared by** ATRIUM [^4] & UFAL [^5]
108
+ - **Model type:** fine-tuned ViT [^2] with a 224x224 resolution size
109
+
110
+ **©️ 2022 UFAL & ATRIUM**
111
+
112
+ [^1]: https://huggingface.co/k4tel/vit-historical-page
113
+ [^2]: https://huggingface.co/google/vit-base-patch16-224
114
+ [^3]: https://github.com/ufal/atrium-page-classification
115
+ [^4]: https://atrium-research.eu/
116
+ [^5]: https://ufal.mff.cuni.cz/home-page
config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "k4tel/vit-historical-page",
3
+ "architectures": [
4
+ "ViTForImageClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "encoder_stride": 16,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.0,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2",
15
+ "3": "LABEL_3",
16
+ "4": "LABEL_4",
17
+ "5": "LABEL_5",
18
+ "6": "LABEL_6",
19
+ "7": "LABEL_7",
20
+ "8": "LABEL_8",
21
+ "9": "LABEL_9",
22
+ "10": "LABEL_10"
23
+ },
24
+ "image_size": 224,
25
+ "initializer_range": 0.02,
26
+ "intermediate_size": 3072,
27
+ "label2id": {
28
+ "LABEL_0": 0,
29
+ "LABEL_1": 1,
30
+ "LABEL_10": 10,
31
+ "LABEL_2": 2,
32
+ "LABEL_3": 3,
33
+ "LABEL_4": 4,
34
+ "LABEL_5": 5,
35
+ "LABEL_6": 6,
36
+ "LABEL_7": 7,
37
+ "LABEL_8": 8,
38
+ "LABEL_9": 9
39
+ },
40
+ "layer_norm_eps": 1e-12,
41
+ "model_type": "vit",
42
+ "num_attention_heads": 12,
43
+ "num_channels": 3,
44
+ "num_hidden_layers": 12,
45
+ "patch_size": 16,
46
+ "problem_type": "multi_label_classification",
47
+ "qkv_bias": true,
48
+ "torch_dtype": "float32",
49
+ "transformers_version": "4.48.3"
50
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83902150a34254d747f405f8472cae4ade65c356e9cae0d1f9caa72554175689
3
+ size 343251660
preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": null,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.5,
8
+ 0.5,
9
+ 0.5
10
+ ],
11
+ "image_processor_type": "ViTImageProcessor",
12
+ "image_std": [
13
+ 0.5,
14
+ 0.5,
15
+ 0.5
16
+ ],
17
+ "resample": 2,
18
+ "rescale_factor": 0.00392156862745098,
19
+ "size": {
20
+ "height": 224,
21
+ "width": 224
22
+ }
23
+ }