ufal
/

vit-historical-page

@@ -11,132 +11,115 @@ license: mit
 # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
-## Goal: solve a task of archive page images sorting (for their further content-based processing)
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
 N predictions output, predictions summarizing into a tabular format,
-HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
 ## Model description 📇
-🔲 Fine-tuned model repository: **UFAL's vit-historical-page** [^1] 🔗
-🔳 Base model repository: **Google's vit-base-patch16-224** [^2] 🔗
-The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
-from the archival documents with paper sources that were scanned into digital form.
-The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 -
-categories 🏷️ described below were formed based on those archival documents.
-The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
-paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
-In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
-mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏
-format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.
 ### Data 📜
-Training set of the model: **8950** images
-Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) 📎:	**995** images
-Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from
-different sources of the archival documents dated year 1920-2020.
-Disproportion of the categories 🏷️ is
-**NOT** intentional, but rather a result of the source data nature.
-In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
-were one-page long and some were much longer (dozens and hundreds of pages).
-The specific content and language of the
-source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
-reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
-arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
 ### Categories 🏷️
-|      Label️ |  Ratio  | Description                                                                   |
-|------------:|:-------:|:------------------------------------------------------------------------------|
-|    **DRAW** | 	11.89% | **📈 - drawings, maps, paintings with text**                                  |
-|  **DRAW_L** | 	8.17%  | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
-| **LINE_HW** |  5.99%  | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
-|  **LINE_P** | 	6.06%  | **📏 - printed text lines inside tabular layout / forms**                     |
-|  **LINE_T** | 	13.39% | **📏 - machine typed text lines inside tabular layout / forms**               |
-|   **PHOTO** | 	10.21% | **🌄 - photos with text**                                                     |
-| **PHOTO_L** |  7.86%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
-|    **TEXT** | 	8.58%  | **📰 - mixed types of printed and handwritten texts**                         |
-| **TEXT_HW** |  7.36%  | **✏️📄 - only handwritten text**                                              |
-|  **TEXT_P** | 	6.95%  | **📄 - only printed text**                                                    |
-|  **TEXT_T** | 	13.53% | **📄 - only machine typed text**                                              |
-The categories were chosen to sort the pages by the following criterion:
-- **presence of graphical elements** (drawings 📈 OR photos 🌄)
-- **type of text** 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
-- **presence of tabular layout / forms** 📏
-The reasons for such distinction are different processing pipelines for different types of pages, that would be
-applied after the classification.
-### Training
-During training image transformations were applied sequentially with a 50% chance.
-<details>
-<summary>Image preprocessing steps 👀</summary>
-* transforms.ColorJitter(**brightness** 0.5)
-* transforms.ColorJitter(**contrast** 0.5)
-* transforms.ColorJitter(**saturation** 0.5)
-* transforms.ColorJitter(**hue** 0.5)
-* transforms.Lambda(lambda img: ImageEnhance.**Sharpness**(img).enhance(random.uniform(0.5, 1.5)))
-* transforms.Lambda(lambda img: img.filter(ImageFilter.**GaussianBlur**(radius=random.uniform(0, 2))))
-</details>
-> [!NOTE]
-> No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The
-> reason behind this are pages containing specific form types, general text orientation on the pages, and the default
-> reshape of the model input to the square 224x224 resolution images.
-<details>
-<summary>Training hyperparameters 👀</summary>
 * eval_strategy "epoch"
 * save_strategy "epoch"
-* learning_rate **5e-5**
 * per_device_train_batch_size 8
 * per_device_eval_batch_size 8
-* num_train_epochs **3**
-* warmup_ratio **0.1**
-* logging_steps **10**
 * load_best_model_at_end True
-* metric_for_best_model "accuracy"
-</details>
-## Results 📊
-Evaluation set's accuracy (**Top-3**):  **99.6%**
-![TOP-3 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250209-1526_conf_mat.png?raw=true)
-Evaluation set's accuracy (**Top-1**):  **97.3%**
-![TOP-1 confusion matrix - trained ViT](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/plots/20250218-1523_conf_mat.png?raw=true)
 #### Result tables
-- Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250209-1534_model_1119_3_TOP-3_EVAL.csv) 🔗
-- Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/K4TEL/ltp-ocr/blob/transformer/result/tables/20250218-1519_model_1119_3_TOP-1_EVAL.csv) 🔗
 #### Table columns
@@ -146,13 +129,13 @@ Evaluation set's accuracy (**Top-1**):  **97.3%**
 - **SCORE-N** - score of the category 🏷️, guess TOP-N
 - **TRUE** - actual label of the category 🏷️
-## Contacts 📧
 For support write to 📧 [email protected] 📧
 Official repository: UFAL [^3]
-## Acknowledgements 🙏
 - **Developed by** UFAL [^5] 👥
 - **Funded by** ATRIUM [^4]  💰
@@ -161,7 +144,7 @@ Official repository: UFAL [^3]
 **©️ 2022 UFAL & ATRIUM**
-[^1]: https://huggingface.co/ufal/vit-historical-page
 [^2]: https://huggingface.co/google/vit-base-patch16-224
 [^3]: https://github.com/ufal/atrium-page-classification
 [^4]: https://atrium-research.eu/

 # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
+### Goal: solve a task of archive page images sorting (for their further content-based processing)
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
 N predictions output, predictions summarizing into a tabular format,
+HF 😊 hub support for the model
 ## Model description 📇
+🔲 Fine-tuned model repository:  vit-historical-page [^1] 🔗
+🔳 Base model repository: google's vit-base-patch16-224 [^2] 🔗
 ### Data 📜
+Training set of the model: **8950** images for v1.0
+Training set of the model: **10745** images for v2.0
 ### Categories 🏷️
+**v1.0 version Categories 🪧**:
+|    Label️ | Ratio  | Description                                                                   |
+|----------:|:------:|:------------------------------------------------------------------------------|
+|    `DRAW` | 11.89% | **📈 - drawings, maps, paintings with text**                                  |
+|  `DRAW_L` | 8.17%  | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
+| `LINE_HW` | 5.99%  | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
+|  `LINE_P` | 6.06%  | **📏 - printed text lines inside tabular layout / forms**                     |
+|  `LINE_T` | 13.39% | **📏 - machine typed text lines inside tabular layout / forms**               |
+|   `PHOTO` | 10.21% | **🌄 - photos with text**                                                     |
+| `PHOTO_L` | 7.86%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
+|    `TEXT` | 8.58%  | **📰 - mixed types of printed and handwritten texts**                         |
+| `TEXT_HW` | 7.36%  | **✏️📄 - only handwritten text**                                              |
+|  `TEXT_P` | 6.95%  | **📄 - only printed text**                                                    |
+|  `TEXT_T` | 13.53% | **📄 - only machine typed text**                                              |
+**v2.0 version Categories 🪧**:
+|    Label️ | Ratio | Description                                                                   |
+|----------:|:-----:|:------------------------------------------------------------------------------|
+|    `DRAW` | 9.12% | **📈 - drawings, maps, paintings with text**                                  |
+|  `DRAW_L` | 9.14% | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
+| `LINE_HW` | 8.84% | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
+|  `LINE_P` | 9.15% | **📏 - printed text lines inside tabular layout / forms**                     |
+|  `LINE_T` | 9.2%  | **📏 - machine typed text lines inside tabular layout / forms**               |
+|   `PHOTO` | 9.05% | **🌄 - photos with text**                                                     |
+| `PHOTO_L` | 9.1%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
+|    `TEXT` | 9.14% | **📰 - mixed types of printed and handwritten texts**                         |
+| `TEXT_HW` | 9.14% | **✏️📄 - only handwritten text**                                              |
+|  `TEXT_P` | 9.07% | **📄 - only printed text**                                                    |
+|  `TEXT_T` | 9.05% | **📄 - only machine typed text**                                              |
+Evaluation set (same proportions):	**995** images for v1.0
+Evaluation set (same proportions):	**1194** images for v2.0
+#### Data preprocessing
+During training the following transforms were applied randomly with a 50% chance:
+* transforms.ColorJitter(brightness 0.5)
+* transforms.ColorJitter(contrast 0.5)
+* transforms.ColorJitter(saturation 0.5)
+* transforms.ColorJitter(hue 0.5)
+* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
+* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
+### Training Hyperparameters
 * eval_strategy "epoch"
 * save_strategy "epoch"
+* learning_rate 5e-5
 * per_device_train_batch_size 8
 * per_device_eval_batch_size 8
+* num_train_epochs 3
+* warmup_ratio 0.1
+* logging_steps 10
 * load_best_model_at_end True
+* metric_for_best_model "accuracy"
+### Results 📊
+**v1.0** Evaluation set's accuracy (**Top-3**):  **99.6%**
+![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250209-1526_conf_mat.png?raw=true)
+**v2.0** Evaluation set's accuracy (**Top-3**):  **99.92%**
+![TOP-3 confusion matrix - trained ViT](https://github.com/atrium-page-classification/blob/main/result/plots/20250416-1158_conf_mat_TOP-3.png?raw=true)
+**v1.0** Evaluation set's accuracy (**Top-1**):  **97.3%**
+![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250218-1523_conf_mat.png?raw=true)
+**v2.0** Evaluation set's accuracy (**Top-1**):  **96.9%**
+![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1153_conf_mat_TOP-1.png?raw=true)
 #### Result tables
+- **v1.0** Manually ✍ **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1602_model_1119_3_TOP-5_EVAL.csv) 🔗
+- **v1.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250314-1606_model_1119_3_TOP-1_EVAL.csv) 🔗
+- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-5): [model_TOP-5_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1218_model_672_5_TOP-5_EVAL.csv) 🔗
+- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1148_model_672_5_TOP-1_EVAL.csv) 🔗
 #### Table columns
 - **SCORE-N** - score of the category 🏷️, guess TOP-N
 - **TRUE** - actual label of the category 🏷️
+### Contacts 📧
 For support write to 📧 [email protected] 📧
 Official repository: UFAL [^3]
+### Acknowledgements 🙏
 - **Developed by** UFAL [^5] 👥
 - **Funded by** ATRIUM [^4]  💰
 **©️ 2022 UFAL & ATRIUM**
+[^1]: https://huggingface.co/k4tel/vit-historical-page
 [^2]: https://huggingface.co/google/vit-base-patch16-224
 [^3]: https://github.com/ufal/atrium-page-classification
 [^4]: https://atrium-research.eu/