Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ license: mit
|
|
11 |
|
12 |
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
|
13 |
|
14 |
-
|
15 |
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
@@ -25,12 +25,14 @@ HF π hub support for the model, multiplatform (Win/Lin) data preparation scri
|
|
25 |
π³ Base model repository: **Google's vit-base-patch16-224** [^2] π
|
26 |
|
27 |
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
|
28 |
-
from the archival documents with paper sources that were scanned into digital form.
|
29 |
-
|
30 |
-
|
|
|
31 |
|
32 |
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
|
33 |
paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
|
|
|
34 |
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
|
35 |
mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π
|
36 |
format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
|
@@ -42,11 +44,15 @@ Training set of the model: **8950** images
|
|
42 |
Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) π: **995** images
|
43 |
|
44 |
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from
|
45 |
-
different sources of the archival documents
|
|
|
|
|
46 |
**NOT** intentional, but rather a result of the source data nature.
|
47 |
|
48 |
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
|
49 |
-
were one-page long and some were much longer (dozens and hundreds of pages).
|
|
|
|
|
50 |
source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
|
51 |
reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
|
52 |
arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
|
|
|
11 |
|
12 |
# Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
|
13 |
|
14 |
+
## Goal: solve a task of archive page images sorting (for their further content-based processing)
|
15 |
|
16 |
**Scope:** Processing of images, training and evaluation of ViT model,
|
17 |
input file/directory processing, class π·οΈ (category) results of top
|
|
|
25 |
π³ Base model repository: **Google's vit-base-patch16-224** [^2] π
|
26 |
|
27 |
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
|
28 |
+
from the archival documents with paper sources that were scanned into digital form.
|
29 |
+
|
30 |
+
The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π -
|
31 |
+
categories π·οΈ described below were formed based on those archival documents.
|
32 |
|
33 |
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
|
34 |
paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
|
35 |
+
|
36 |
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
|
37 |
mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π
|
38 |
format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
|
|
|
44 |
Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) π: **995** images
|
45 |
|
46 |
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from
|
47 |
+
different sources of the archival documents dated year 1920-2020.
|
48 |
+
|
49 |
+
Disproportion of the categories π·οΈ is
|
50 |
**NOT** intentional, but rather a result of the source data nature.
|
51 |
|
52 |
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
|
53 |
+
were one-page long and some were much longer (dozens and hundreds of pages).
|
54 |
+
|
55 |
+
The specific content and language of the
|
56 |
source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
|
57 |
reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
|
58 |
arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
|