ufal
/

vit-historical-page

@@ -11,7 +11,7 @@ license: mit
 # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
-### Goal: solve a task of archive page images sorting (for their further content-based processing)
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
@@ -25,12 +25,14 @@ HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scri
 🔳 Base model repository: **Google's vit-base-patch16-224** [^2] 🔗
 The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
-from the archival documents with paper sources that were scanned into digital form. The images contain various
-combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🏷️ described below were formed based on those
-archival documents.
 The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
 paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
 In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
 mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏
 format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.
@@ -42,11 +44,15 @@ Training set of the model: **8950** images
 Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) 📎:	**995** images
 Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from
-different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is
 **NOT** intentional, but rather a result of the source data nature.
 In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
-were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the
 source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
 reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
 arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.

 # Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
+## Goal: solve a task of archive page images sorting (for their further content-based processing)
 **Scope:** Processing of images, training and evaluation of ViT model,
 input file/directory processing, class 🏷️ (category) results of top
 🔳 Base model repository: **Google's vit-base-patch16-224** [^2] 🔗
 The model was trained on the manually annotated dataset of historical documents, in particular, images of pages
+from the archival documents with paper sources that were scanned into digital form.
+The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 -
+categories 🏷️ described below were formed based on those archival documents.
 The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned
 paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
 In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to
 mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏
 format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.
 Evaluation set (10% of the all, with the same proportions as below) [model_EVAL.csv](https://github.com/ufal/atrium-page-classification/blob/main/result/tables/20250209-1534_model_1119_3_EVAL.csv) 📎:	**995** images
 Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from
+different sources of the archival documents dated year 1920-2020.
+Disproportion of the categories 🏷️ is
 **NOT** intentional, but rather a result of the source data nature.
 In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents
+were one-page long and some were much longer (dozens and hundreds of pages).
+The specific content and language of the
 source data is irrelevant considering the model's vision resolution, however all of the data samples were from **archaeological
 reports** which may somehow affect the drawings detection due to common form objects being ceramic pieces,
 arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.