--- title: Historical OCR emoji: 📜 colorFrom: red colorTo: green sdk: streamlit sdk_version: 1.43.2 app_file: app.py pinned: false license: mit short_description: Employs Mistral OCR for transcribing historical data --- # Historical Document OCR This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents with enhanced formatting and presentation. ## Features - OCR processing for both image and PDF files - Automatic file type detection and content structuring - Advanced HTML formatting with proper document structure preservation - Specialized formatting for poems and historical texts - Interactive web interface with Streamlit - "With Images" view that preserves document layout and embedded images - Multi-page document support with pagination - PDF preview functionality - Smart handling of large PDFs with automatic page limiting - Image preprocessing options for enhanced OCR accuracy - Document export in multiple formats (HTML, JSON) - Responsive design optimized for historical document presentation - Enhanced typography with appropriate fonts for historical content ## Project Structure The project is organized as follows: ``` Historical OCR - Project Structure ┌─ Main Application │ └─ app.py # Streamlit interface for OCR processing │ ├─ Core Functionality │ ├─ structured_ocr.py # Main OCR processing engine with Mistral AI integration │ ├─ ocr_utils.py # Utility functions for OCR text and image processing │ ├─ pdf_ocr.py # PDF-specific document processing functionality │ ├─ config.py # Configuration settings and API keys │ └─ process_file.py # File processing utilities │ ├─ Testing & Development │ ├─ simple_test.py # Basic OCR functionality test │ ├─ test_pdf.py # PDF processing test │ ├─ test_pdf_preview.py # PDF preview generation test │ ├─ test_pdf_handling.py # PDF handling test │ ├─ test_image_formats.py # Image format compatibility test │ └─ prepare_for_hf.py # Prepare project for Hugging Face deployment │ ├─ Scripts │ ├─ run_local.sh # Launch app locally │ ├─ run_large_files.sh # Process large documents with optimized settings │ └─ setup_git.sh # Configure Git repositories │ ├─ UI Components │ ├─ ui/layout.py # UI components and styling │ └─ ui/custom.css # Custom styling for the application │ ├─ Data Directories │ ├─ input/ # Sample documents for testing/demo │ └─ output/ # Output directory for processed files │ └─ Dependencies ├─ requirements.txt # Python package dependencies └─ packages.txt # System-level dependencies ``` ## Setup for Local Development 1. Clone this repository 2. Install system dependencies: - For PDF processing, you need poppler: - On macOS: `brew install poppler` - On Ubuntu/Debian: `apt-get install poppler-utils` - On Windows: Download from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/) and add to PATH 3. Install Python dependencies: ``` pip install -r requirements.txt ``` 4. Set up your Mistral API key: - Option 1: Create a `.env` file in this directory and add your Mistral API key: ``` MISTRAL_API_KEY=your_api_key_here ``` - Option 2: Set the `MISTRAL_API_KEY` environment variable directly: ``` export MISTRAL_API_KEY=your_api_key_here ``` - Option 3: Test if your API key is working correctly: ``` python test_api_key.py ``` - Get your API key from [Mistral AI Console](https://console.mistral.ai/api-keys/) **Important**: Make sure your API key is correctly formatted with no extra spaces, newlines, or other characters. The application requires a valid Mistral API key with access to the OCR API. 5. Run the Streamlit app using the script: ``` ./run_local.sh ``` Or directly: ``` streamlit run app.py ``` ## Usage 1. Upload an image or PDF file using the file uploader 2. Select processing options in the sidebar (e.g., use vision model, image preprocessing) 3. Click "Process Document" to analyze the file 4. View the results in three available formats: - **Structured View**: Beautifully formatted HTML with proper document structure - **Raw JSON**: Complete data structure for developers - **With Images**: Document with embedded images preserving original layout ## Document Output Features The application provides several specialized features for historical document presentation: 1. **Poetry Formatting**: Special handling for poem structure with proper line spacing and typography 2. **Image Embedding**: Original document images embedded within the text at their correct positions 3. **Multi-page Support**: Pagination controls for navigating multi-page documents 4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts 5. **Document Export**: Download options for saving the processed document in HTML format ## Testing Run the test suite to ensure proper functionality: ``` python simple_test.py # Basic OCR testing python test_pdf.py # PDF processing testing python test_image_formats.py # Test image format handling python test_pdf_handling.py # Test PDF handling ``` ## Large File Processing For processing large files, use the specialized script: ``` ./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500 ``` ## Deployment on Hugging Face Spaces This app is designed to be deployed on Hugging Face Spaces. To deploy: 1. Fork this repository to your GitHub account or directly create a new Space on [Hugging Face](https://huggingface.co/spaces) 2. Connect your GitHub repository to your Hugging Face Space for automatic deployment 3. Add your Mistral API key as a secret in your Hugging Face Space settings: - Secret name: `HF_MISTRAL_API_KEY` - Secret value: Your Mistral API key The `README.md` contains the necessary configuration metadata for Hugging Face Spaces. Check out the configuration reference at [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces-config-reference)