ufal
/

vit-historical-page

@@ -16,61 +16,107 @@ license: mit
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
 N predictions output, predictions summarizing into a tabular format,
-HF 😊 hub support for the model
 ## Model description 📇
-🔲 Fine-tuned model repository:  vit-historical-page [^1] 🔗
-🔳 Base model repository: google's vit-base-patch16-224 [^2] 🔗
 ### Data 📜
 Training set of the model: **8950** images
 ### Categories 🏷️
-|      Label️ |  Ratio  | Description                                                                  |
-|------------:|:-------:|:-----------------------------------------------------------------------------|
-|    **DRAW** | 	11.89% | **📈 - drawings, maps, paintings with text**                                 |
-|  **DRAW_L** | 	8.17%  | **📈📏 - drawings ... with a table legend or inside tabular layout / forms** |
-| **LINE_HW** |  5.99%  | **✏️📏 - handwritten text lines inside tabular layout / forms**              |
-|  **LINE_P** | 	6.06%  | **📏 - printed text lines inside tabular layout / forms**                    |
-|  **LINE_T** | 	13.39% | **📏 - machine typed text lines inside tabular layout / forms**              |
-|   **PHOTO** | 	10.21% | **🌄 - photos with text**                                                    |
-| **PHOTO_L** |  7.86%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation** |
-|    **TEXT** | 	8.58%  | **📰 - mixed types of printed and handwritten texts**                        |
-| **TEXT_HW** |  7.36%  | **✏️📄 - only handwritten text**                                             |
-|  **TEXT_P** | 	6.95%  | **📄 - only printed text**                                                   |
-|  **TEXT_T** | 	13.53% | **📄 - only machine typed text**                                             |
-Evaluation set (same proportions):	**995** images
-#### Data preprocessing
-During training the following transforms were applied randomly with a 50% chance:
-* transforms.ColorJitter(brightness 0.5)
-* transforms.ColorJitter(contrast 0.5)
-* transforms.ColorJitter(saturation 0.5)
-* transforms.ColorJitter(hue 0.5)
-* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
-* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
-### Training Hyperparameters
 * eval_strategy "epoch"
 * save_strategy "epoch"
-* learning_rate 5e-5
 * per_device_train_batch_size 8
 * per_device_eval_batch_size 8
-* num_train_epochs 3
-* warmup_ratio 0.1
-* logging_steps 10
 * load_best_model_at_end True
-* metric_for_best_model "accuracy"
-### Results 📊
 Evaluation set's accuracy (**Top-3**):  **99.6%**
@@ -94,13 +140,13 @@ Evaluation set's accuracy (**Top-1**):  **97.3%**
 - **SCORE-N** - score of the category 🏷️, guess TOP-N
 - **TRUE** - actual label of the category 🏷️
-### Contacts 📧
 For support write to 📧 [email protected] 📧
 Official repository: UFAL [^3]
-### Acknowledgements 🙏
 - **Developed by** UFAL [^5] 👥
 - **Funded by** ATRIUM [^4]  💰

 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
 N predictions output, predictions summarizing into a tabular format,
+HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
 ## Model description 📇
+🔲 Fine-tuned model repository: **UFAL's vit-historical-page** [^1] 🔗
+🔳 Base model repository: **Google's vit-base-patch16-224** [^2] 🔗
+The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
+from the archival documents with paper sources that were scanned into digital form. The images contain various
+combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🏷️ described below were formed based on those
+archival documents.
+The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
+paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
+In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
+mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏
+format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.
 ### Data 📜
 Training set of the model: **8950** images
+Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) 📎:	**995** images
+Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from
+different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is
+**NOT** intentional, but rather a result of the source data nature.
+In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
+were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the
+source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
+reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
+arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
 ### Categories 🏷️
+|      Label️ |  Ratio  | Description                                                                   |
+|------------:|:-------:|:------------------------------------------------------------------------------|
+|    **DRAW** | 	11.89% | **📈 - drawings, maps, paintings with text**                                  |
+|  **DRAW_L** | 	8.17%  | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
+| **LINE_HW** |  5.99%  | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
+|  **LINE_P** | 	6.06%  | **📏 - printed text lines inside tabular layout / forms**                     |
+|  **LINE_T** | 	13.39% | **📏 - machine typed text lines inside tabular layout / forms**               |
+|   **PHOTO** | 	10.21% | **🌄 - photos with text**                                                     |
+| **PHOTO_L** |  7.86%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
+|    **TEXT** | 	8.58%  | **📰 - mixed types of printed and handwritten texts**                         |
+| **TEXT_HW** |  7.36%  | **✏️📄 - only handwritten text**                                              |
+|  **TEXT_P** | 	6.95%  | **📄 - only printed text**                                                    |
+|  **TEXT_T** | 	13.53% | **📄 - only machine typed text**                                              |
+The categories were chosen to sort the pages by the following criterion:
+- **presence of graphical elements** (drawings 📈 OR photos 🌄)
+- **type of text** 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
+- **presence of tabular layout / forms** 📏
+The reasons for such distinction are different processing pipelines for different types of pages, that would be
+applied after the classification.
+### Training
+During training image transformations were applied sequentially with a 50% chance.
+<details>
+<summary>Image preprocessing steps 👀</summary>
+* transforms.ColorJitter(**brightness** 0.5)
+* transforms.ColorJitter(**contrast** 0.5)
+* transforms.ColorJitter(**saturation** 0.5)
+* transforms.ColorJitter(**hue** 0.5)
+* transforms.Lambda(lambda img: ImageEnhance.**Sharpness**(img).enhance(random.uniform(0.5, 1.5)))
+* transforms.Lambda(lambda img: img.filter(ImageFilter.**GaussianBlur**(radius=random.uniform(0, 2))))
+</details>
+> [!NOTE]
+> No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The
+> reason behind this are pages containing specific form types, general text orientation on the pages, and the default
+> reshape of the model input to the square 224x224 resolution images.
+<details>
+<summary>Training hyperparameters 👀</summary>
 * eval_strategy "epoch"
 * save_strategy "epoch"
+* learning_rate **5e-5**
 * per_device_train_batch_size 8
 * per_device_eval_batch_size 8
+* num_train_epochs **3**
+* warmup_ratio **0.1**
+* logging_steps **10**
 * load_best_model_at_end True
+* metric_for_best_model "accuracy"
+</details>
+## Results 📊
 Evaluation set's accuracy (**Top-3**):  **99.6%**
 - **SCORE-N** - score of the category 🏷️, guess TOP-N
 - **TRUE** - actual label of the category 🏷️
+## Contacts 📧
 For support write to 📧 [email protected] 📧
 Official repository: UFAL [^3]
+## Acknowledgements 🙏
 - **Developed by** UFAL [^5] 👥
 - **Funded by** ATRIUM [^4]  💰