File size: 8,706 Bytes

---
library_name: transformers
tags:
- page
- classification
base_model:
- google/vit-base-patch16-224
- google/vit-base-patch16-384
- google/vit-large-patch16-384
pipeline_tag: image-classification
license: mit
---

# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

### Goal: solve a task of archive page images sorting (for their further content-based processing)

**Scope:** Processing of images, training and evaluation of ViT model,
input file/directory processing, class 🏷️ (category) results of top
N predictions output, predictions summarizing into a tabular format, 
HF 😊 hub support for the model

## Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, 
but different data annotations. The latest approved `v2.1` is considered to be default and can be found in the `main` branch
of HF 😊 hub [^1] 🔗 

| Version | Base                   | Pages |   PDFs   | Description                                                               |
|--------:|------------------------|:-----:|:--------:|:--------------------------------------------------------------------------|
|  `v2.0` | `vit-base-path16-224`  | 10073 | **3896** | annotations with mistakes, more heterogenous data                         |
|  `v2.1` | `vit-base-path16-224`  | 11940 | **5002** | `main`: more diverse pages in each category, less annotation mistakes     |
|  `v2.2` | `vit-base-path16-224`  | 15855 | **5730** | same data as `v2.1` + some restored pages from `v2.0`                     |
|  `v3.2` | `vit-base-path16-384`  | 15855 | **5730** | same data as `v2.0.2`, but a bit larger model base with higher resolution |
|  `v5.2` | `vit-large-path16-384` | 15855 | **5730** | same data as `v2.0.2`, but the largest model base with higher resolution  |


## Model description 📇

🔲 Fine-tuned model repository:  vit-historical-page [^1] 🔗

🔳 Base model repository: Google's **vit-base-patch16-224**,  **vit-base-patch16-384**,  **vit-large-patch16-284** [^2] [^13] [^14] 🔗

### Data 📜

Training set of the model: **8950** images for v2.0

Training set of the model: **10745** images for v2.1

### Categories 🏷️

**v2.0 version Categories 🪧**:

|    Label️ | Ratio  | Description                                                                   |
|----------:|:------:|:------------------------------------------------------------------------------|
|    `DRAW` | 11.89% | **📈 - drawings, maps, paintings with text**                                  |
|  `DRAW_L` | 8.17%  | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 5.99%  | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
|  `LINE_P` | 6.06%  | **📏 - printed text lines inside tabular layout / forms**                     |
|  `LINE_T` | 13.39% | **📏 - machine typed text lines inside tabular layout / forms**               |
|   `PHOTO` | 10.21% | **🌄 - photos with text**                                                     |
| `PHOTO_L` | 7.86%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
|    `TEXT` | 8.58%  | **📰 - mixed types of printed and handwritten texts**                         |
| `TEXT_HW` | 7.36%  | **✏️📄 - only handwritten text**                                              |
|  `TEXT_P` | 6.95%  | **📄 - only printed text**                                                    |
|  `TEXT_T` | 13.53% | **📄 - only machine typed text**                                              |

**v2.1 version Categories 🪧**:

|    Label️ | Ratio | Description                                                                   |
|----------:|:-----:|:------------------------------------------------------------------------------|
|    `DRAW` | 9.12% | **📈 - drawings, maps, paintings with text**                                  |
|  `DRAW_L` | 9.14% | **📈📏 - drawings, etc with a table legend or inside tabular layout / forms** |
| `LINE_HW` | 8.84% | **✏️📏 - handwritten text lines inside tabular layout / forms**               |
|  `LINE_P` | 9.15% | **📏 - printed text lines inside tabular layout / forms**                     |
|  `LINE_T` | 9.2%  | **📏 - machine typed text lines inside tabular layout / forms**               |
|   `PHOTO` | 9.05% | **🌄 - photos with text**                                                     |
| `PHOTO_L` | 9.1%  | **🌄📏 - photos inside tabular layout / forms or with a tabular annotation**  |
|    `TEXT` | 9.14% | **📰 - mixed types of printed and handwritten texts**                         |
| `TEXT_HW` | 9.14% | **✏️📄 - only handwritten text**                                              |
|  `TEXT_P` | 9.07% | **📄 - only printed text**                                                    |
|  `TEXT_T` | 9.05% | **📄 - only machine typed text**                                              |

Evaluation set (same proportions):	**995** images for v2.0

Evaluation set (same proportions):	**1194** images for v2.1


#### Data preprocessing 

During training the following transforms were applied randomly with a 50% chance:

* transforms.ColorJitter(brightness 0.5)
* transforms.ColorJitter(contrast 0.5)
* transforms.ColorJitter(saturation 0.5)
* transforms.ColorJitter(hue 0.5)
* transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
* transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

### Training Hyperparameters
        
* eval_strategy "epoch"
* save_strategy "epoch"
* learning_rate 5e-5
* per_device_train_batch_size 8
* per_device_eval_batch_size 8
* num_train_epochs 3
* warmup_ratio 0.1
* logging_steps 10
* load_best_model_at_end True
* metric_for_best_model "accuracy"      

### Results 📊

**v2.0** Evaluation set's accuracy (**Top-3**):  **99.6%** 

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1430_conf_mat_TOP-3.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-3**):  **99.75%**

![TOP-3 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1049_conf_mat_TOP-3.png?raw=true)

**v2.0** Evaluation set's accuracy (**Top-1**):  **97.3%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250416-1436_conf_mat_TOP-1.png?raw=true)

**v2.1** Evaluation set's accuracy (**Top-1**):  **96.82%** 

![TOP-1 confusion matrix - trained ViT](https://github.com/ufal/atrium-page-classification/blob/main/result/plots/20250417-1055_conf_mat_TOP-1.png?raw=true)

#### Result tables

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1426_model_1119_3_TOP-3_EVAL.csv) 🔗

- **v2.0** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250416-1431_model_1119_3_TOP-1_EVAL.csv) 🔗

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-3): [model_TOP-3_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1044_model_672_3_TOP-3_EVAL.csv) 🔗

- **v2.1** Manually ✍ **checked** evaluation dataset results (TOP-1): [model_TOP-1_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250417-1050_model_672_3_TOP-1_EVAL.csv) 🔗

#### Table columns

- **FILE** - name of the file
- **PAGE** - number of the page
- **CLASS-N** - label of the category 🏷️, guess TOP-N 
- **SCORE-N** - score of the category 🏷️, guess TOP-N
- **TRUE** - actual label of the category 🏷️

### Contacts 📧

For support write to 📧 [email protected] 📧

Official repository: UFAL [^3]

### Acknowledgements 🙏

- **Developed by** UFAL [^5] 👥
- **Funded by** ATRIUM [^4]  💰
- **Shared by** ATRIUM [^4] & UFAL [^5]
- **Model type:** fine-tuned ViT with a 224x224 [^2] 🔗 or 384x384 [^13] [^14] 🔗 resolution size 

**©️ 2022 UFAL & ATRIUM**

[^1]: https://huggingface.co/k4tel/vit-historical-page
[^2]: https://huggingface.co/google/vit-base-patch16-224
[^3]: https://github.com/ufal/atrium-page-classification
[^4]: https://atrium-research.eu/
[^5]: https://ufal.mff.cuni.cz/home-page
[^6]: https://huggingface.co/google/vit-base-patch16-384
[^7]: https://huggingface.co/google/vit-large-patch16-384