--- library_name: transformers tags: - page - classification base_model: - google/vit-base-patch16-224 - google/vit-base-patch16-384 - google/vit-large-patch16-384 pipeline_tag: image-classification license: mit --- # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting ### Goal: solve a task of archive page images sorting (for their further content-based processing) **Scope:** Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model ## Versions 🏁 There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch of HF 😊 hub [^1] πŸ”— | Version | Base | Pages | PDFs | Description | |--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------| | `v2.0` | `vit-base-path16-224` | 10073 | **3896** | annotations with mistakes, more heterogenous data | | `v2.1` | `vit-base-path16-224` | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes | | `v2.2` | `vit-base-path16-224` | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0` | | `v3.2` | `vit-base-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but a bit larger model base with higher resolution | | `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but the largest model base with higher resolution | ## Model description πŸ“‡ πŸ”² Fine-tuned model repository: vit-historical-page [^1] πŸ”— πŸ”³ Base model repository: Google's **vit-base-patch16-224**, **vit-base-patch16-384**, **vit-large-patch16-284** [^2] [^13] [^14] πŸ”— ### Data πŸ“œ Training set of the model: **8950** images for v2.0 Training set of the model: **10745** images for v2.1 ### Categories 🏷️ **v2.0 version Categories πŸͺ§**: | Label️ | Ratio | Description | |----------:|:------:|:------------------------------------------------------------------------------| | `DRAW` | 11.89% | **πŸ“ˆ - drawings, maps, paintings with text** | | `DRAW_L` | 8.17% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** | | `LINE_HW` | 5.99% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** | | `LINE_P` | 6.06% | **πŸ“ - printed text lines inside tabular layout / forms** | | `LINE_T` | 13.39% | **πŸ“ - machine typed text lines inside tabular layout / forms** | | `PHOTO` | 10.21% | **πŸŒ„ - photos with text** | | `PHOTO_L` | 7.86% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** | | `TEXT` | 8.58% | **πŸ“° - mixed types of printed and handwritten texts** | | `TEXT_HW` | 7.36% | **βœοΈπŸ“„ - only handwritten text** | | `TEXT_P` | 6.95% | **πŸ“„ - only printed text** | | `TEXT_T` | 13.53% | **πŸ“„ - only machine typed text** | **v2.1 version Categories πŸͺ§**: | Label️ | Ratio | Description | |----------:|:-----:|:------------------------------------------------------------------------------| | `DRAW` | 9.12% | **πŸ“ˆ - drawings, maps, paintings with text** | | `DRAW_L` | 9.14% | **πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms** | | `LINE_HW` | 8.84% | **βœοΈπŸ“ - handwritten text lines inside tabular layout / forms** | | `LINE_P` | 9.15% | **πŸ“ - printed text lines inside tabular layout / forms** | | `LINE_T` | 9.2% | **πŸ“ - machine typed text lines inside tabular layout / forms** | | `PHOTO` | 9.05% | **πŸŒ„ - photos with text** | | `PHOTO_L` | 9.1% | **πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation** | | `TEXT` | 9.14% | **πŸ“° - mixed types of printed and handwritten texts** | | `TEXT_HW` | 9.14% | **βœοΈπŸ“„ - only handwritten text** | | `TEXT_P` | 9.07% | **πŸ“„ - only printed text** | | `TEXT_T` | 9.05% | **πŸ“„ - only machine typed text** | Evaluation set (same proportions): **995** images for v2.0 Evaluation set (same proportions): **1194** images for v2.1 #### Data preprocessing During training the following transforms were applied randomly with a 50% chance: * transforms.ColorJitter(brightness 0.5) * transforms.ColorJitter(contrast 0.5) * transforms.ColorJitter(saturation 0.5) * transforms.ColorJitter(hue 0.5) * transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5))) * transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2)))) ### Training Hyperparameters * eval_strategy "epoch" * save_strategy "epoch" * learning_rate 5e-5 * per_device_train_batch_size 8 * per_device_eval_batch_size 8 * num_train_epochs 3 * warmup_ratio 0.1 * logging_steps 10 * load_best_model_at_end True * metric_for_best_model "accuracy" ### Results πŸ“Š **v2.0** Evaluation set's accuracy (**Top-3**): **99.6%** ![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1430_conf_mat_TOP-3.png?raw=true) **v2.1** Evaluation set's accuracy (**Top-3**): **99.75%** ![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1049_conf_mat_TOP-3.png?raw=true) **v2.0** Evaluation set's accuracy (**Top-1**): **97.3%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1436_conf_mat_TOP-1.png?raw=true) **v2.1** Evaluation set's accuracy (**Top-1**): **96.82%** ![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1055_conf_mat_TOP-1.png?raw=true) #### Result tables - **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1426_model_1119_3_TOP-3_EVAL.csv) πŸ”— - **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1431_model_1119_3_TOP-1_EVAL.csv) πŸ”— - **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1044_model_672_3_TOP-3_EVAL.csv) πŸ”— - **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1050_model_672_3_TOP-1_EVAL.csv) πŸ”— #### Table columns - **FILE** - name of the file - **PAGE** - number of the page - **CLASS-N** - label of the category 🏷️, guess TOP-N - **SCORE-N** - score of the category 🏷️, guess TOP-N - **TRUE** - actual label of the category 🏷️ ### Contacts πŸ“§ For support write to πŸ“§ lutsai.k@gmail.com πŸ“§ Official repository: UFAL [^3] ### Acknowledgements πŸ™ - **Developed by** UFAL [^5] πŸ‘₯ - **Funded by** ATRIUM [^4] πŸ’° - **Shared by** ATRIUM [^4] & UFAL [^5] - **Model type:** fine-tuned ViT with a 224x224 [^2] πŸ”— or 384x384 [^13] [^14] πŸ”— resolution size **©️ 2022 UFAL & ATRIUM** [^1]: https://huggingface.co/k4tel/vit-historical-page [^2]: https://huggingface.co/google/vit-base-patch16-224 [^3]: https://github.com/ufal/atrium-page-classification [^4]: https://atrium-research.eu/ [^5]: https://ufal.mff.cuni.cz/home-page [^6]: https://huggingface.co/google/vit-base-patch16-384 [^7]: https://huggingface.co/google/vit-large-patch16-384