Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
Goal: solve a task of archive page images sorting (for their further content-based processing)
Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class π·οΈ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub support for the model
Versions π
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved v2.1
is considered to be default and can be found in the main
branch
of HF π hub ^1 π
Version | Base | Pages | PDFs | Description |
---|---|---|---|---|
v2.0 |
vit-base-path16-224 |
10073 | 3896 | annotations with mistakes, more heterogenous data |
v2.1 |
vit-base-path16-224 |
11940 | 5002 | main : more diverse pages in each category, less annotation mistakes |
v2.2 |
vit-base-path16-224 |
15855 | 5730 | same data as v2.1 + some restored pages from v2.0 |
v3.2 |
vit-base-path16-384 |
15855 | 5730 | same data as v2.0.2 , but a bit larger model base with higher resolution |
v5.2 |
vit-large-path16-384 |
15855 | 5730 | same data as v2.0.2 , but the largest model base with higher resolution |
Model description π
π² Fine-tuned model repository: vit-historical-page ^1 π
π³ Base model repository: Google's vit-base-patch16-224, vit-base-patch16-384, vit-large-patch16-284 ^2 [^13] [^14] π
Data π
Training set of the model: 8950 images for v2.0
Training set of the model: 10745 images for v2.1
Categories π·οΈ
v2.0 version Categories πͺ§:
LabelοΈ | Ratio | Description |
---|---|---|
DRAW |
11.89% | π - drawings, maps, paintings with text |
DRAW_L |
8.17% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW |
5.99% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P |
6.06% | π - printed text lines inside tabular layout / forms |
LINE_T |
13.39% | π - machine typed text lines inside tabular layout / forms |
PHOTO |
10.21% | π - photos with text |
PHOTO_L |
7.86% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT |
8.58% | π° - mixed types of printed and handwritten texts |
TEXT_HW |
7.36% | βοΈπ - only handwritten text |
TEXT_P |
6.95% | π - only printed text |
TEXT_T |
13.53% | π - only machine typed text |
v2.1 version Categories πͺ§:
LabelοΈ | Ratio | Description |
---|---|---|
DRAW |
9.12% | π - drawings, maps, paintings with text |
DRAW_L |
9.14% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW |
8.84% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P |
9.15% | π - printed text lines inside tabular layout / forms |
LINE_T |
9.2% | π - machine typed text lines inside tabular layout / forms |
PHOTO |
9.05% | π - photos with text |
PHOTO_L |
9.1% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT |
9.14% | π° - mixed types of printed and handwritten texts |
TEXT_HW |
9.14% | βοΈπ - only handwritten text |
TEXT_P |
9.07% | π - only printed text |
TEXT_T |
9.05% | π - only machine typed text |
Evaluation set (same proportions): 995 images for v2.0
Evaluation set (same proportions): 1194 images for v2.1
Data preprocessing
During training the following transforms were applied randomly with a 50% chance:
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training Hyperparameters
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Results π
v2.0 Evaluation set's accuracy (Top-3): 99.6%
v2.1 Evaluation set's accuracy (Top-3): 99.75%
v2.0 Evaluation set's accuracy (Top-1): 97.3%
v2.1 Evaluation set's accuracy (Top-1): 96.82%
Result tables
v2.0 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v2.0 Manually β checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv π
v2.1 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v2.1 Manually β checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv π
Table columns
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category π·οΈ, guess TOP-N
- SCORE-N - score of the category π·οΈ, guess TOP-N
- TRUE - actual label of the category π·οΈ
Contacts π§
For support write to π§ [email protected] π§
Official repository: UFAL ^3
Acknowledgements π
- Developed by UFAL ^5 π₯
- Funded by ATRIUM ^4 π°
- Shared by ATRIUM ^4 & UFAL ^5
- Model type: fine-tuned ViT with a 224x224 ^2 π or 384x384 [^13] [^14] π resolution size
Β©οΈ 2022 UFAL & ATRIUM
- Downloads last month
- 30
Model tree for ufal/vit-historical-page
Base model
google/vit-base-patch16-224