ufal
/

Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v2.1 is considered to be default and can be found in the main branch of HF 😊 hub ^1 πŸ”—

Version Base Pages PDFs Description
v2.0 vit-base-path16-224 10073 3896 annotations with mistakes, more heterogenous data
v2.1 vit-base-path16-224 11940 5002 main: more diverse pages in each category, less annotation mistakes
v2.2 vit-base-path16-224 15855 5730 same data as v2.1 + some restored pages from v2.0
v3.2 vit-base-path16-384 15855 5730 same data as v2.0.2, but a bit larger model base with higher resolution
v5.2 vit-large-path16-384 15855 5730 same data as v2.0.2, but the largest model base with higher resolution

Model description πŸ“‡

πŸ”² Fine-tuned model repository: vit-historical-page ^1 πŸ”—

πŸ”³ Base model repository: Google's vit-base-patch16-224, vit-base-patch16-384, vit-large-patch16-284 ^2 [^13] [^14] πŸ”—

Data πŸ“œ

Training set of the model: 8950 images for v2.0

Training set of the model: 10745 images for v2.1

Categories 🏷️

v2.0 version Categories πŸͺ§:

Label️ Ratio Description
DRAW 11.89% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 8.17% πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW 5.99% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 6.06% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 13.39% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 10.21% πŸŒ„ - photos with text
PHOTO_L 7.86% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 8.58% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 7.36% βœοΈπŸ“„ - only handwritten text
TEXT_P 6.95% πŸ“„ - only printed text
TEXT_T 13.53% πŸ“„ - only machine typed text

v2.1 version Categories πŸͺ§:

Label️ Ratio Description
DRAW 9.12% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 9.14% πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW 8.84% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 9.15% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 9.2% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 9.05% πŸŒ„ - photos with text
PHOTO_L 9.1% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 9.14% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 9.14% βœοΈπŸ“„ - only handwritten text
TEXT_P 9.07% πŸ“„ - only printed text
TEXT_T 9.05% πŸ“„ - only machine typed text

Evaluation set (same proportions): 995 images for v2.0

Evaluation set (same proportions): 1194 images for v2.1

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Results πŸ“Š

v2.0 Evaluation set's accuracy (Top-3): 99.6%

TOP-3 confusion matrix - trained ViT

v2.1 Evaluation set's accuracy (Top-3): 99.75%

TOP-3 confusion matrix - trained ViT

v2.0 Evaluation set's accuracy (Top-1): 97.3%

TOP-1 confusion matrix - trained ViT

v2.1 Evaluation set's accuracy (Top-1): 96.82%

TOP-1 confusion matrix - trained ViT

Result tables

Table columns

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N
  • TRUE - actual label of the category 🏷️

Contacts πŸ“§

For support write to πŸ“§ [email protected] πŸ“§

Official repository: UFAL ^3

Acknowledgements πŸ™

  • Developed by UFAL ^5 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^5
  • Model type: fine-tuned ViT with a 224x224 ^2 πŸ”— or 384x384 [^13] [^14] πŸ”— resolution size

©️ 2022 UFAL & ATRIUM

Downloads last month
30
Safetensors
Model size
85.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ufal/vit-historical-page

Finetuned
(773)
this model