ufal
/

k4tel commited on
Commit
89ba66d
Β·
verified Β·
1 Parent(s): 6fbc142

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -11,7 +11,7 @@ license: mit
11
 
12
  # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
13
 
14
- ### Goal: solve a task of archive page images sorting (for their further content-based processing)
15
 
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
@@ -25,12 +25,14 @@ HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scri
25
  πŸ”³ Base model repository: **Google's vit-base-patch16-224** [^2] πŸ”—
26
 
27
  The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
28
- from the archival documents with paper sources that were scanned into digital form. The images contain various
29
- combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories 🏷️ described below were formed based on those
30
- archival documents.
 
31
 
32
  The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
33
  paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
 
34
  In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
35
  mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“
36
  format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.
@@ -42,11 +44,15 @@ Training set of the model: **8950** images
42
  Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) πŸ“Ž: **995** images
43
 
44
  Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from
45
- different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is
 
 
46
  **NOT** intentional, but rather a result of the source data nature.
47
 
48
  In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
49
- were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the
 
 
50
  source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
51
  reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
52
  arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
 
11
 
12
  # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
13
 
14
+ ## Goal: solve a task of archive page images sorting (for their further content-based processing)
15
 
16
  **Scope:** Processing of images, training and evaluation of ViT model,
17
  input file/directory processing, class 🏷️ (category) results of top
 
25
  πŸ”³ Base model repository: **Google's vit-base-patch16-224** [^2] πŸ”—
26
 
27
  The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
28
+ from the archival documents with paper sources that were scanned into digital form.
29
+
30
+ The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ -
31
+ categories 🏷️ described below were formed based on those archival documents.
32
 
33
  The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
34
  paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
35
+
36
  In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
37
  mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“
38
  format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.
 
44
  Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) πŸ“Ž: **995** images
45
 
46
  Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from
47
+ different sources of the archival documents dated year 1920-2020.
48
+
49
+ Disproportion of the categories 🏷️ is
50
  **NOT** intentional, but rather a result of the source data nature.
51
 
52
  In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
53
+ were one-page long and some were much longer (dozens and hundreds of pages).
54
+
55
+ The specific content and language of the
56
  source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
57
  reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
58
  arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.