Spaces:

milwright
/

historical-ocr

Running

milwright commited on 3 days ago

Commit

aaf0eac

1 Parent(s): c9c9ec7

Streamline app architecture and improve image processing

Remove educational components in favor of a single, robust application. Enhance image preprocessing with rotation detection, error handling, and API retries. Update documentation to reflect new project structure.

Files changed (10) hide show

README.md +29 -28
app.py +726 -471
config.py +10 -5
prepare_for_hf.py +8 -25
process_file.py +6 -1
requirements.txt +3 -2
run_local.sh +3 -8
simple_test.py +13 -4
structured_ocr.py +190 -41
ui/custom.css +40 -0

README.md CHANGED Viewed

@@ -38,43 +38,32 @@ The project is organized as follows:
 ```
 Historical OCR - Project Structure
-┌─ Main Applications
-│  ├─ app.py                        # Standard Streamlit interface for OCR processing
-│  └─ streamlit_app.py              # Educational modular version with learning components
 │
 ├─ Core Functionality
 │  ├─ structured_ocr.py             # Main OCR processing engine with Mistral AI integration
 │  ├─ ocr_utils.py                  # Utility functions for OCR text and image processing
 │  ├─ pdf_ocr.py                    # PDF-specific document processing functionality
-│  └─ config.py                     # Configuration settings and API keys
 │
 ├─ Testing & Development
 │  ├─ simple_test.py                # Basic OCR functionality test
 │  ├─ test_pdf.py                   # PDF processing test
 │  ├─ test_pdf_preview.py           # PDF preview generation test
 │  └─ prepare_for_hf.py             # Prepare project for Hugging Face deployment
 │
 ├─ Scripts
-│  ├─ run_local.sh                  # Launch standard or educational app locally
 │  ├─ run_large_files.sh            # Process large documents with optimized settings
 │  └─ setup_git.sh                  # Configure Git repositories
 │
-├─ Educational Modules (streamlit/)
-│  ├─ modules/
-│  │  ├─ module1.py                 # Introduction and Problematization
-│  │  ├─ module2.py                 # Historical Typography & OCR Challenges
-│  │  ├─ module3.py                 # Document Analysis Techniques
-│  │  ├─ module4.py                 # Processing Methods
-│  │  ├─ module5.py                 # Research Applications
-│  │  └─ module6.py                 # Future Directions
-│  │
-│  ├─ modular_app.py                # Learning module framework
-│  ├─ layout.py                     # UI components for educational interface
-│  └─ process_file.py               # File processing for educational app
-│
-├─ UI Components (ui/)
-│  ├─ layout.py                     # Shared UI components and styling
-│  └─ custom.css                    # Custom styling for the application
 │
 ├─ Data Directories
 │  ├─ input/                        # Sample documents for testing/demo
@@ -93,7 +82,6 @@ Historical OCR - Project Structure
      - On macOS: `brew install poppler`
      - On Ubuntu/Debian: `apt-get install poppler-utils`
      - On Windows: Download from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/) and add to PATH
-   - For text recognition: `tesseract-ocr`
 3. Install Python dependencies:
 ```
 pip install -r requirements.txt
@@ -107,7 +95,13 @@ pip install -r requirements.txt
      ```
      export MISTRAL_API_KEY=your_api_key_here
      ```
    - Get your API key from [Mistral AI Console](https://console.mistral.ai/api-keys/)
 5. Run the Streamlit app using the script:
 ```
 ./run_local.sh
@@ -137,16 +131,23 @@ The application provides several specialized features for historical document pr
 4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
 5. **Document Export**: Download options for saving the processed document in HTML format
-## Application Versions
-Two versions of the application are available:
-1. **Standard Version** (`app.py`): Focused on document processing with a clean interface
-2. **Educational Version** (`streamlit_app.py`): Enhanced with educational modules and interactive components
-To run the educational version:
 ```
-streamlit run streamlit_app.py
 ```
 ## Deployment on Hugging Face Spaces

 ```
 Historical OCR - Project Structure
+┌─ Main Application
+│  └─ app.py                        # Streamlit interface for OCR processing
 │
 ├─ Core Functionality
 │  ├─ structured_ocr.py             # Main OCR processing engine with Mistral AI integration
 │  ├─ ocr_utils.py                  # Utility functions for OCR text and image processing
 │  ├─ pdf_ocr.py                    # PDF-specific document processing functionality
+│  ├─ config.py                     # Configuration settings and API keys
+│  └─ process_file.py               # File processing utilities
 │
 ├─ Testing & Development
 │  ├─ simple_test.py                # Basic OCR functionality test
 │  ├─ test_pdf.py                   # PDF processing test
 │  ├─ test_pdf_preview.py           # PDF preview generation test
+│  ├─ test_pdf_handling.py          # PDF handling test
+│  ├─ test_image_formats.py         # Image format compatibility test
 │  └─ prepare_for_hf.py             # Prepare project for Hugging Face deployment
 │
 ├─ Scripts
+│  ├─ run_local.sh                  # Launch app locally
 │  ├─ run_large_files.sh            # Process large documents with optimized settings
 │  └─ setup_git.sh                  # Configure Git repositories
 │
+├─ UI Components
+│  ├─ ui/layout.py                  # UI components and styling
+│  └─ ui/custom.css                 # Custom styling for the application
 │
 ├─ Data Directories
 │  ├─ input/                        # Sample documents for testing/demo
      - On macOS: `brew install poppler`
      - On Ubuntu/Debian: `apt-get install poppler-utils`
      - On Windows: Download from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/) and add to PATH
 3. Install Python dependencies:
 ```
 pip install -r requirements.txt
      ```
      export MISTRAL_API_KEY=your_api_key_here
      ```
+   - Option 3: Test if your API key is working correctly:
+     ```
+     python test_api_key.py
+     ```
    - Get your API key from [Mistral AI Console](https://console.mistral.ai/api-keys/)
+   **Important**: Make sure your API key is correctly formatted with no extra spaces, newlines, or other characters. The application requires a valid Mistral API key with access to the OCR API.
 5. Run the Streamlit app using the script:
 ```
 ./run_local.sh
 4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
 5. **Document Export**: Download options for saving the processed document in HTML format
+## Testing
+Run the test suite to ensure proper functionality:
+```
+python simple_test.py        # Basic OCR testing
+python test_pdf.py           # PDF processing testing
+python test_image_formats.py # Test image format handling
+python test_pdf_handling.py  # Test PDF handling
+```
+## Large File Processing
+For processing large files, use the specialized script:
 ```
+./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500
 ```
 ## Deployment on Hugging Face Spaces

app.py CHANGED Viewed

@@ -7,7 +7,8 @@ from pathlib import Path
 import tempfile
 import io
 from pdf2image import convert_from_bytes
-from PIL import Image, ImageEnhance, ImageFilter
 import cv2
 import numpy as np
@@ -15,12 +16,12 @@ import numpy as np
 from structured_ocr import StructuredOCR
 from config import MISTRAL_API_KEY
-# Check for modular UI components
 try:
-    from ui.layout import tool_container, key_concept, research_question
-    MODULAR_UI = True
 except ImportError:
-    MODULAR_UI = False
 # Set page configuration
 st.set_page_config(
@@ -40,57 +41,116 @@ def convert_pdf_to_images(pdf_bytes, dpi=150):
         st.error(f"Error converting PDF: {str(e)}")
         return []
 @st.cache_data(ttl=3600, show_spinner=False)
 def preprocess_image(image_bytes, preprocessing_options):
     """Preprocess image with selected options"""
-    # Convert bytes to OpenCV format
-    image = Image.open(io.BytesIO(image_bytes))
-    # Ensure image is in RGB mode for OpenCV processing
-    if image.mode != 'RGB':
-        image = image.convert('RGB')
-    img_array = np.array(image)
-    # Apply preprocessing based on selected options
-    if preprocessing_options.get("grayscale", False):
-        img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
-        img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
-    if preprocessing_options.get("contrast", 0) != 0:
-        contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 10)
-        image = Image.fromarray(img_array)
-        enhancer = ImageEnhance.Contrast(image)
-        image = enhancer.enhance(contrast_factor)
-        img_array = np.array(image)
-    if preprocessing_options.get("denoise", False):
-        # Ensure the image is in the correct format for denoising (CV_8UC3)
-        if len(img_array.shape) != 3 or img_array.shape[2] != 3:
-            # Convert to RGB if it's not already a 3-channel color image
-            if len(img_array.shape) == 2:  # Grayscale
-                img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
-        img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 10, 10, 7, 21)
-    if preprocessing_options.get("threshold", False):
-        # Convert to grayscale if not already
-        if len(img_array.shape) == 3:
-            gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
-        else:
-            gray = img_array
-        # Apply adaptive threshold
-        binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
-                                      cv2.THRESH_BINARY, 11, 2)
-        # Convert back to RGB
-        img_array = cv2.cvtColor(binary, cv2.COLOR_GRAY2RGB)
-    # Convert back to PIL Image
-    processed_image = Image.fromarray(img_array)
-    # Convert to bytes
-    byte_io = io.BytesIO()
-    processed_image.save(byte_io, format='PNG')
-    byte_io.seek(0)
-    return byte_io.getvalue()
 # Define functions
 def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
@@ -120,13 +180,28 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
             # Return dummy data if no API key
             progress_bar.progress(100)
             status_text.empty()
             return {
                 "file_name": uploaded_file.name,
-                "topics": ["Sample Document"],
                 "languages": ["English"],
                 "ocr_contents": {
-                    "title": "Sample Document",
-                    "content": "This is sample content. To process real documents, please set the MISTRAL_API_KEY environment variable."
                 }
             }
@@ -134,22 +209,51 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
         progress_bar.progress(20)
         status_text.text("Initializing OCR processor...")
-        # Initialize OCR processor
-        processor = StructuredOCR()
         # Determine file type from extension
         file_ext = Path(uploaded_file.name).suffix.lower()
         file_type = "pdf" if file_ext == ".pdf" else "image"
         # Apply preprocessing if needed
         if any(preprocessing_options.values()) and file_type == "image":
             status_text.text("Applying image preprocessing...")
-            processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
-            # Save processed image to temp file
-            with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as proc_tmp:
-                proc_tmp.write(processed_bytes)
-                temp_path = proc_tmp.name
         # Get file size in MB
         file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
@@ -183,6 +287,12 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
         progress_bar.progress(100)
         status_text.empty()
         return result
     except Exception as e:
         progress_bar.progress(100)
@@ -194,25 +304,23 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
         if os.path.exists(temp_path):
             os.unlink(temp_path)
 # App title and description
 st.title("Historical Document OCR")
-st.subheader("Powered by Mistral AI")
-# Create main layout with tabs and columns
-main_tab1, main_tab2 = st.tabs(["Document Processing", "About this App"])
-with main_tab1:
-    # Create a two-column layout for file upload and preview
-    upload_col, preview_col = st.columns([1, 1])
-    # File uploader in the left column
-    with upload_col:
-        st.markdown("""
-        Upload an image or PDF file to get started.
-        Using the `mistral-ocr-latest` model for advanced document understanding.
-        """)
-        uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"])
 # Sidebar with options
 with st.sidebar:
@@ -221,9 +329,9 @@ with st.sidebar:
     # Model options
     st.subheader("Model Settings")
     use_vision = st.checkbox("Use Vision Model", value=True,
-                            help="For image files, use the vision model for improved analysis (may be slower)")
-    # Image preprocessing options (collapsible)
     st.subheader("Image Preprocessing")
     with st.expander("Preprocessing Options"):
         preprocessing_options = {}
@@ -235,21 +343,134 @@ with st.sidebar:
                                                      help="Remove noise from the image")
         preprocessing_options["contrast"] = st.slider("Adjust Contrast", -5, 5, 0,
                                                     help="Adjust image contrast (-5 to +5)")
-    # PDF options (collapsible)
     st.subheader("PDF Options")
     with st.expander("PDF Settings"):
         pdf_dpi = st.slider("PDF Resolution (DPI)", 72, 300, 150,
                           help="Higher DPI gives better quality but slower processing")
-        max_pages = st.number_input("Maximum Pages to Process", 1, 20, 5,
                                   help="Limit number of pages to process")
-# About tab content
 with main_tab2:
     st.markdown("""
     ### About This Application
-    This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents with enhanced formatting and presentation.
     It can process:
     - Image files (jpg, png, etc.)
@@ -266,427 +487,461 @@ with main_tab2:
     - **Raw JSON**: Complete data structure for developers
     - **With Images**: Document with embedded images preserving original layout
-    **Special Features:**
-    - **Poetry Formatting**: Special handling for poem structure with proper line spacing
-    - **Image Embedding**: Original document images embedded at correct positions
-    - **Multi-page Support**: Pagination controls for navigating multi-page documents
-    - **Typography**: Historical-appropriate fonts for better readability
-    - **Document Export**: Download options for saving in HTML format
-    **Technical Features:**
-    - Image preprocessing for better OCR quality
-    - PDF resolution and page controls
-    - Progress tracking during processing
-    - Responsive design optimized for historical document presentation
     """)
 with main_tab1:
-    if uploaded_file is not None:
-        # Check file size (cap at 20MB)
-        file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
-        if file_size_mb > 20:
-            with upload_col:
-                st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
-            st.stop()
-        file_ext = Path(uploaded_file.name).suffix.lower()
-        # Display document preview in preview column
-        with preview_col:
             st.subheader("Document Preview")
-            if file_ext == ".pdf":
                 try:
-                    # Convert first page of PDF to image for preview
                     pdf_bytes = uploaded_file.getvalue()
-                    images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1, dpi=150)
                     if images:
-                        # Convert PIL image to bytes for Streamlit
                         first_page = images[0]
                         img_bytes = io.BytesIO()
                         first_page.save(img_bytes, format='JPEG')
                         img_bytes.seek(0)
-                        # Display the PDF preview
-                        st.image(img_bytes, caption=f"PDF Preview: {uploaded_file.name}", use_container_width=True)
                     else:
-                        st.info(f"PDF uploaded: {uploaded_file.name}")
                 except Exception:
-                    # Simply show the file name without an error message
-                    st.info(f"PDF uploaded: {uploaded_file.name}")
-                    st.info("Click 'Process Document' to analyze the content.")
             else:
-                st.image(uploaded_file, use_container_width=True)
-        # Add image preprocessing preview in a collapsible section if needed
-        if any(preprocessing_options.values()) and uploaded_file.type.startswith('image/'):
-            with st.expander("Image Preprocessing Preview"):
-                preview_cols = st.columns(2)
-                with preview_cols[0]:
-                    st.markdown("**Original Image**")
-                    st.image(uploaded_file, use_container_width=True)
-                with preview_cols[1]:
-                    st.markdown("**Preprocessed Image**")
-                    try:
-                        processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
-                        st.image(io.BytesIO(processed_bytes), use_container_width=True)
-                    except Exception as e:
-                        st.error(f"Error in preprocessing: {str(e)}")
-        # Process button - flush left with similar padding as file browser
-        with upload_col:
-            process_button = st.button("Process Document", use_container_width=True)
-        # Results section
-        if process_button:
-            try:
-                # Get max_pages or default if not available
-                max_pages_value = max_pages if 'max_pages' in locals() else None
-                # Call process_file with all options
-                result = process_file(uploaded_file, use_vision, preprocessing_options)
-                # Single tab for document analysis
-                with st.container():
-                    # Create two columns for metadata and content
-                    meta_col, content_col = st.columns([1, 2])
-                    with meta_col:
-                        st.subheader("Document Metadata")
-                        st.success("**Document processed successfully**")
-                        # Display file info
-                        st.write(f"**File Name:** {result.get('file_name', uploaded_file.name)}")
-                        # Display info if only limited pages were processed
-                        if 'limited_pages' in result:
-                            st.info(f"Processed {result['limited_pages']['processed']} of {result['limited_pages']['total']} pages")
-                        # Display languages if available
-                        if 'languages' in result:
-                            languages = [lang for lang in result['languages'] if lang is not None]
-                            if languages:
-                                st.write(f"**Languages:** {', '.join(languages)}")
-                        # Confidence score if available
-                        if 'confidence_score' in result:
-                            confidence = result['confidence_score']
-                            st.write(f"**OCR Confidence:** {confidence:.1%}")
-                        # Display topics if available
-                        if 'topics' in result and result['topics']:
-                            st.write(f"**Topics:** {', '.join(result['topics'])}")
-                    with content_col:
-                        st.subheader("Document Contents")
-                        if 'ocr_contents' in result:
-                            # Check if there are images in the OCR result
-                            has_images = result.get('has_images', False)
-                            # Create tabs for different views
-                            if has_images:
-                                view_tab1, view_tab2, view_tab3 = st.tabs(["Structured View", "Raw JSON", "With Images"])
                             else:
-                                view_tab1, view_tab2 = st.tabs(["Structured View", "Raw JSON"])
-                            with view_tab1:
-                                # Display in a more user-friendly format based on the content structure
-                                html_content = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<title>OCR Document</title>\n<style>\n'
-                                html_content += """
-body {
-    font-family: 'Georgia', serif;
-    line-height: 1.6;
-    margin: 0;
-    padding: 20px;
-    background-color: #f9f9f9;
-    color: #333;
-}
-.container {
-    max-width: 1000px;
-    margin: 0 auto;
-    background-color: #fff;
-    padding: 30px;
-    border-radius: 8px;
-    box-shadow: 0 4px 12px rgba(0,0,0,0.1);
-}
-h1, h2, h3, h4 {
-    font-family: 'Bookman', 'Georgia', serif;
-    margin-top: 1.5em;
-    margin-bottom: 0.5em;
-    color: #222;
-}
-h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
-h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
-h3 { font-size: 1.5em; }
-h4 { font-size: 1.2em; }
-p { margin-bottom: 1.2em; text-align: justify; }
-ul, ol { margin-bottom: 1.5em; }
-li { margin-bottom: 0.5em; }
-.poem {
-    font-family: 'Baskerville', 'Georgia', serif;
-    margin-left: 2em;
-    line-height: 1.8;
-    white-space: pre-wrap;
-}
-.subtitle {
-    font-style: italic;
-    font-size: 1.1em;
-    margin-bottom: 1.5em;
-    color: #555;
-}
-blockquote {
-    border-left: 3px solid #ccc;
-    margin: 1.5em 0;
-    padding: 0.5em 1.5em;
-    background-color: #f5f5f5;
-    font-style: italic;
-}
-dl {
-    margin-bottom: 1.5em;
-}
-dt {
-    font-weight: bold;
-    margin-top: 1em;
-}
-dd {
-    margin-left: 2em;
-    margin-bottom: 0.5em;
-}
-</style>
-</head>
-<body>
-<div class="container">
-"""
-                                if isinstance(result['ocr_contents'], dict):
-                                    for section, content in result['ocr_contents'].items():
-                                        if not content:  # Skip empty sections
-                                            continue
-                                        section_title = section.replace('_', ' ').title()
-                                        # Special handling for title and subtitle
-                                        if section.lower() == 'title':
-                                            html_content += f'<h1>{content}</h1>\n'
-                                            st.markdown(f"## {content}")
-                                        elif section.lower() == 'subtitle':
-                                            html_content += f'<div class="subtitle">{content}</div>\n'
-                                            st.markdown(f"*{content}*")
                                         else:
-                                            # Section headers for non-title sections
-                                            html_content += f'<h3>{section_title}</h3>\n'
-                                            st.markdown(f"### {section_title}")
-                                        # Process different content types
-                                        if isinstance(content, str):
-                                            # Handle poem type specifically
-                                            if section.lower() == 'type' and content.lower() == 'poem':
-                                                # Don't add special formatting here, just for the lines
-                                                st.markdown(content)
-                                                html_content += f'<p>{content}</p>\n'
-                                            elif 'content' in result['ocr_contents'] and isinstance(result['ocr_contents']['content'], dict) and 'type' in result['ocr_contents']['content'] and result['ocr_contents']['content']['type'] == 'poem' and section.lower() == 'content':
-                                                # This is handled in the dict case below
-                                                pass
-                                            else:
-                                                # Regular text content
-                                                paragraphs = content.split('\n\n')
-                                                for p in paragraphs:
-                                                    if p.strip():
-                                                        html_content += f'<p>{p.strip()}</p>\n'
-                                                st.markdown(content)
-                                        elif isinstance(content, list):
-                                            # Handle lists (bullet points, etc.)
-                                            html_content += '<ul>\n'
-                                            for item in content:
-                                                if isinstance(item, str):
-                                                    html_content += f'<li>{item}</li>\n'
-                                                    st.markdown(f"- {item}")
-                                                elif isinstance(item, dict):
-                                                    # Format dictionary items in a readable way
-                                                    html_content += f'<li><pre>{json.dumps(item, indent=2)}</pre></li>\n'
-                                                    st.json(item)
-                                            html_content += '</ul>\n'
-                                        elif isinstance(content, dict):
-                                            # Special handling for poem type
-                                            if 'type' in content and content['type'] == 'poem' and 'lines' in content:
-                                                html_content += '<div class="poem">\n'
-                                                for line in content['lines']:
-                                                    html_content += f'{line}\n'
-                                                    st.markdown(line)
-                                                html_content += '</div>\n'
-                                            else:
-                                                # Regular dictionary display
-                                                html_content += '<dl>\n'
-                                                for k, v in content.items():
-                                                    html_content += f'<dt>{k}</dt>\n<dd>'
-                                                    if isinstance(v, str):
-                                                        html_content += v
-                                                    elif isinstance(v, list):
-                                                        html_content += ', '.join(str(item) for item in v)
-                                                    else:
-                                                        html_content += str(v)
-                                                    html_content += '</dd>\n'
-                                                    st.markdown(f"**{k}:** {v}")
-                                                html_content += '</dl>\n'
-                                # Close HTML document
-                                html_content += '</div>\n</body>\n</html>'
-                                # Add download button in a smaller section
-                                with st.expander("Export Content"):
-                                    # Alternative download button
-                                    html_bytes = html_content.encode()
-                                    st.download_button(
-                                        label="Download as HTML",
-                                        data=html_bytes,
-                                        file_name="document_content.html",
-                                        mime="text/html"
-                                    )
-                            with view_tab2:
-                                # Show the raw JSON for developers
-                                st.json(result)
-                            if has_images:
-                                with view_tab3:
-                                    # Show loading indicator while preparing images
-                                    with st.spinner("Preparing document with embedded images..."):
-                                        try:
-                                            # Import function
-                                            try:
-                                                from ocr_utils import create_html_with_images
-                                            except ImportError:
-                                                st.error("Required module ocr_utils not found.")
-                                                st.stop()
-                                            # Check if has_images flag is set
-                                            if not result.get('has_images', False) or 'pages_data' not in result:
-                                                st.warning("No image data available in the OCR response.")
-                                                st.stop()
-                                            # Count images in the result
-                                            image_count = 0
-                                            for page in result.get('pages_data', []):
-                                                image_count += len(page.get('images', []))
-                                            # Add warning for image-heavy documents
-                                            if image_count > 10:
-                                                st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
-                                            # Generate HTML with images
-                                            html_with_images = create_html_with_images(result)
-                                            # For multi-page documents, create page navigation
-                                            page_count = len(result.get('pages_data', []))
-                                            if page_count > 1:
-                                                st.info(f"Document contains {page_count} pages. You can scroll to view all pages or use the page selector below.")
-                                                # Create a page selector
-                                                page_options = [f"Page {i+1}" for i in range(page_count)]
-                                                selected_page = st.selectbox("Jump to page:", options=page_options, index=0)
-                                                # Extract page number from selection
-                                                page_num = int(selected_page.split(" ")[1])
-                                                # Add JavaScript to scroll to the selected page
-                                                st.markdown(f"""
-                                                <script>
-                                                    document.addEventListener('DOMContentLoaded', function() {{
-                                                        const element = document.getElementById('page-{page_num}');
-                                                        if (element) {{
-                                                            element.scrollIntoView({{ behavior: 'smooth' }});
-                                                        }}
-                                                    }});
-                                                </script>
-                                                """, unsafe_allow_html=True)
-                                            # Display the HTML content
-                                            st.components.v1.html(html_with_images, height=600, scrolling=True)
-                                            # Add download button for the content with images
-                                            st.download_button(
-                                                label="Download with Images (HTML)",
-                                                data=html_with_images,
-                                                file_name="document_with_images.html",
-                                                mime="text/html"
-                                            )
-                                        except Exception as e:
-                                            st.error(f"Could not display document with images: {str(e)}")
-                                            st.info("Try refreshing or processing the document again.")
-                        else:
-                            st.error("No OCR content was extracted from the document.")
-            except Exception as e:
-                st.error(f"Error processing document: {str(e)}")
-    else:
-        # Display sample images in the main area when no file is uploaded
-        st.info("Upload a document to get started using the file uploader above.")
-        # Show example images in a grid
-        # Add a sample images container
-        with st.container():
-            # Find sample images from the input directory to display
-            input_dir = Path(__file__).parent / "input"
-            sample_images = []
-            if input_dir.exists():
-                # Get all potential image files - exclude PDF files
-                all_images = []
-                all_images.extend(list(input_dir.glob("*.jpg")))
-                all_images.extend(list(input_dir.glob("*.jpeg")))
-                all_images.extend(list(input_dir.glob("*.png")))
-                # Filter to get a good set of diverse images - not too small, not too large
-                valid_images = [path for path in all_images if 50000 < path.stat().st_size < 1000000]
-                # Deduplicate any images that might have the same content (like recipe and historical-recipe)
-                seen_sizes = {}
-                deduplicated_images = []
-                for img in valid_images:
-                    size = img.stat().st_size
-                    # If we haven't seen this exact file size before, include it
-                    # This simple heuristic works well enough for images with identical content
-                    if size not in seen_sizes:
-                        seen_sizes[size] = True
-                        deduplicated_images.append(img)
-                valid_images = deduplicated_images
-                # Select a random sample of 6 images if we have enough
-                import random
-                if len(valid_images) > 6:
-                    sample_images = random.sample(valid_images, 6)
-                else:
-                    sample_images = valid_images
-            if sample_images:
-                # Create two rows of three columns
-                # First row
-                row1 = st.columns(3)
-                for i in range(3):
-                    if i < len(sample_images):
-                        with row1[i]:
-                            try:
-                                st.image(str(sample_images[i]), caption=sample_images[i].name, use_container_width=True)
-                            except Exception:
-                                # Silently skip problematic images
-                                pass
-                # Second row
-                row2 = st.columns(3)
-                for i in range(3):
-                    idx = i + 3
-                    if idx < len(sample_images):
-                        with row2[i]:
-                            try:
-                                st.image(str(sample_images[idx]), caption=sample_images[idx].name, use_container_width=True)
-                            except Exception:
-                                # Silently skip problematic images
-                                pass

 import tempfile
 import io
 from pdf2image import convert_from_bytes
+from PIL import Image, ImageEnhance, ImageFilter, UnidentifiedImageError
+import PIL
 import cv2
 import numpy as np
 from structured_ocr import StructuredOCR
 from config import MISTRAL_API_KEY
+# Import UI layout if available
 try:
+    from ui.layout import tool_container
+    UI_LAYOUT_AVAILABLE = True
 except ImportError:
+    UI_LAYOUT_AVAILABLE = False
 # Set page configuration
 st.set_page_config(
         st.error(f"Error converting PDF: {str(e)}")
         return []
+def safe_open_image(image_bytes):
+    """Safe wrapper for PIL.Image.open with robust error handling"""
+    try:
+        return Image.open(io.BytesIO(image_bytes))
+    except Exception:
+        # Return None if image can't be opened
+        return None
 @st.cache_data(ttl=3600, show_spinner=False)
 def preprocess_image(image_bytes, preprocessing_options):
     """Preprocess image with selected options"""
+    try:
+        # Attempt to open the image safely
+        image = safe_open_image(image_bytes)
+        # If image could not be opened, return the original bytes
+        if image is None:
+            return image_bytes
+        # Ensure image is in RGB mode for OpenCV processing
+        if image.mode not in ['RGB', 'RGBA']:
+            image = image.convert('RGB')
+        elif image.mode == 'RGBA':
+            # Handle RGBA images by removing transparency
+            background = Image.new('RGB', image.size, (255, 255, 255))
+            background.paste(image, mask=image.split()[3])  # 3 is the alpha channel
+            image = background
+        # Handle image rotation based on user selection
+        rotation_option = preprocessing_options.get("rotation", "None")
+        if rotation_option != "None":
+            if rotation_option == "Rotate 90° clockwise":
+                image = image.transpose(Image.ROTATE_270)
+            elif rotation_option == "Rotate 90° counterclockwise":
+                image = image.transpose(Image.ROTATE_90)
+            elif rotation_option == "Rotate 180°":
+                image = image.transpose(Image.ROTATE_180)
+            elif rotation_option == "Auto-detect":
+                # Auto-detect orientation
+                width, height = image.size
+                # If image is in landscape and likely a document (typically portrait is better for OCR)
+                if width > height and (width / height) > 1.5:
+                    image = image.transpose(Image.ROTATE_90)
+        # Convert to numpy array for OpenCV processing
+        try:
+            img_array = np.array(image)
+        except Exception:
+            # Return the original image as JPEG if we can't convert to array
+            byte_io = io.BytesIO()
+            image.save(byte_io, format='JPEG')
+            byte_io.seek(0)
+            return byte_io.getvalue()
+        # Apply preprocessing based on selected options
+        try:
+            if preprocessing_options.get("grayscale", False):
+                img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+                img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
+            if preprocessing_options.get("contrast", 0) != 0:
+                contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 10)
+                image = Image.fromarray(img_array)
+                enhancer = ImageEnhance.Contrast(image)
+                image = enhancer.enhance(contrast_factor)
+                img_array = np.array(image)
+            if preprocessing_options.get("denoise", False):
+                # Ensure the image is in the correct format for denoising (CV_8UC3)
+                if len(img_array.shape) != 3 or img_array.shape[2] != 3:
+                    # Convert to RGB if it's not already a 3-channel color image
+                    if len(img_array.shape) == 2:  # Grayscale
+                        img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
+                img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 10, 10, 7, 21)
+            if preprocessing_options.get("threshold", False):
+                # Convert to grayscale if not already
+                if len(img_array.shape) == 3:
+                    gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+                else:
+                    gray = img_array
+                # Apply adaptive threshold
+                binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+                                              cv2.THRESH_BINARY, 11, 2)
+                # Convert back to RGB
+                img_array = cv2.cvtColor(binary, cv2.COLOR_GRAY2RGB)
+        except Exception:
+            # Return the original image if preprocessing fails
+            byte_io = io.BytesIO()
+            image.save(byte_io, format='JPEG')
+            byte_io.seek(0)
+            return byte_io.getvalue()
+        # Convert back to PIL Image
+        try:
+            processed_image = Image.fromarray(img_array)
+            # Convert to bytes
+            byte_io = io.BytesIO()
+            processed_image.save(byte_io, format='JPEG')  # Use JPEG for better compatibility
+            byte_io.seek(0)
+            return byte_io.getvalue()
+        except Exception:
+            # Final fallback - return original bytes
+            return image_bytes
+    except Exception:
+        # Return original image bytes as fallback
+        return image_bytes
 # Define functions
 def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
             # Return dummy data if no API key
             progress_bar.progress(100)
             status_text.empty()
+            # Show a clear message about the missing API key
+            st.error("🔑 **Missing API Key**: Cannot process document without a valid Mistral AI API key.")
+            st.info("""
+            **How to add your API key:**
+            For Hugging Face Spaces:
+            1. Go to your Space settings
+            2. Add a secret named `MISTRAL_API_KEY` with your API key value
+            For local development:
+            1. Add to your shell: `export MISTRAL_API_KEY=your_key_here`
+            2. Or create a `.env` file with `MISTRAL_API_KEY=your_key_here`
+            """)
             return {
                 "file_name": uploaded_file.name,
+                "topics": ["API Key Required"],
                 "languages": ["English"],
                 "ocr_contents": {
+                    "title": "Missing Mistral API Key",
+                    "content": "To process real documents, please set the MISTRAL_API_KEY environment variable as described above."
                 }
             }
         progress_bar.progress(20)
         status_text.text("Initializing OCR processor...")
+        # Initialize OCR processor with explicit API key
+        try:
+            # Make sure the API key is properly formatted
+            api_key = MISTRAL_API_KEY.strip()
+            processor = StructuredOCR(api_key=api_key)
+        except Exception as e:
+            st.error(f"Error initializing OCR processor: {str(e)}")
+            return {
+                "file_name": uploaded_file.name,
+                "error": "API authentication failed",
+                "ocr_contents": {
+                    "error": "Could not authenticate with Mistral API. Please check your API key."
+                }
+            }
         # Determine file type from extension
         file_ext = Path(uploaded_file.name).suffix.lower()
         file_type = "pdf" if file_ext == ".pdf" else "image"
+        # Store original filename in session state for preservation
+        st.session_state.original_filename = uploaded_file.name
         # Apply preprocessing if needed
         if any(preprocessing_options.values()) and file_type == "image":
             status_text.text("Applying image preprocessing...")
+            try:
+                processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
+                # Save processed image to temp file but preserve original filename for results
+                original_ext = Path(uploaded_file.name).suffix.lower()
+                # Use original extension when possible for better format recognition
+                if original_ext in ['.jpg', '.jpeg', '.png']:
+                    suffix = original_ext
+                else:
+                    suffix = '.jpg'  # Default fallback to JPEG
+                with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as proc_tmp:
+                    proc_tmp.write(processed_bytes)
+                    temp_path = proc_tmp.name
+            except Exception as e:
+                st.warning(f"Image preprocessing failed: {str(e)}. Proceeding with original image.")
+                # If preprocessing fails, use original file
+                # This ensures the OCR process continues even if preprocessing has issues
         # Get file size in MB
         file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
         progress_bar.progress(100)
         status_text.empty()
+        # Preserve original filename in results
+        if hasattr(st.session_state, 'original_filename'):
+            result['file_name'] = st.session_state.original_filename
+            # Clear the stored filename for next run
+            del st.session_state.original_filename
         return result
     except Exception as e:
         progress_bar.progress(100)
         if os.path.exists(temp_path):
             os.unlink(temp_path)
+# Initialize session state for storing results
+if 'previous_results' not in st.session_state:
+    st.session_state.previous_results = []
+if 'current_result' not in st.session_state:
+    st.session_state.current_result = None
 # App title and description
 st.title("Historical Document OCR")
+st.write("Process historical documents and images with AI-powered OCR.")
+# Check if API key is available
+if not MISTRAL_API_KEY:
+    st.warning("⚠️ **No Mistral API key found.** Please set the MISTRAL_API_KEY environment variable.")
+    st.info("For Hugging Face Spaces, add it as a secret. For local development, export it in your shell or add it to a .env file.")
+# Create main layout with tabs
+main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "Previous Results", "About"])
 # Sidebar with options
 with st.sidebar:
     # Model options
     st.subheader("Model Settings")
     use_vision = st.checkbox("Use Vision Model", value=True,
+                            help="For image files, use the vision model for improved analysis")
+    # Image preprocessing options
     st.subheader("Image Preprocessing")
     with st.expander("Preprocessing Options"):
         preprocessing_options = {}
                                                      help="Remove noise from the image")
         preprocessing_options["contrast"] = st.slider("Adjust Contrast", -5, 5, 0,
                                                     help="Adjust image contrast (-5 to +5)")
+        # Add rotation options
+        rotation_options = ["None", "Rotate 90° clockwise", "Rotate 90° counterclockwise", "Rotate 180°", "Auto-detect"]
+        preprocessing_options["rotation"] = st.selectbox("Image Orientation", rotation_options, index=0,
+                                                      help="Rotate image to correct orientation")
+    # PDF options
     st.subheader("PDF Options")
     with st.expander("PDF Settings"):
         pdf_dpi = st.slider("PDF Resolution (DPI)", 72, 300, 150,
                           help="Higher DPI gives better quality but slower processing")
+        max_pages = st.number_input("Maximum Pages", 1, 20, 5,
                                   help="Limit number of pages to process")
+# Previous Results tab
 with main_tab2:
+    if not st.session_state.previous_results:
+        st.info("No previous documents have been processed yet. Process a document to see results here.")
+    else:
+        st.subheader("Previously Processed Documents")
+        # Display previous results in a selectable list, with default confidence of 85%
+        previous_files = [f"{i+1}. {result.get('file_name', 'Document')} ({result.get('confidence_score', 0.85):.1%} confidence)"
+                         for i, result in enumerate(st.session_state.previous_results)]
+        selected_index = st.selectbox("Select a previous document:",
+                                     options=range(len(previous_files)),
+                                     format_func=lambda i: previous_files[i])
+        selected_result = st.session_state.previous_results[selected_index]
+        # Display selected result in tabs
+        has_images = selected_result.get('has_images', False)
+        if has_images:
+            prev_tabs = st.tabs(["Document Info", "Content", "With Images"])
+        else:
+            prev_tabs = st.tabs(["Document Info", "Content"])
+        # Document Info tab
+        with prev_tabs[0]:
+            st.write(f"**File:** {selected_result.get('file_name', 'Document')}")
+            # Show confidence score (default to 85% if not available)
+            confidence = selected_result.get('confidence_score', 0.85)
+            st.write(f"**OCR Confidence:** {confidence:.1%}")
+            # Show languages if available
+            if 'languages' in selected_result and selected_result['languages']:
+                languages = [lang for lang in selected_result['languages'] if lang is not None]
+                if languages:
+                    st.write(f"**Languages:** {', '.join(languages)}")
+            # Show topics if available
+            if 'topics' in selected_result and selected_result['topics']:
+                st.write(f"**Topics:** {', '.join(selected_result['topics'])}")
+            # Show any limited pages info
+            if 'limited_pages' in selected_result:
+                st.info(f"Processed {selected_result['limited_pages']['processed']} of {selected_result['limited_pages']['total']} pages")
+        # Content tab
+        with prev_tabs[1]:
+            if 'ocr_contents' in selected_result:
+                st.markdown("## Document Contents")
+                if isinstance(selected_result['ocr_contents'], dict):
+                    for section, content in selected_result['ocr_contents'].items():
+                        if not content:
+                            continue
+                        section_title = section.replace('_', ' ').title()
+                        # Special handling for title and subtitle
+                        if section.lower() == 'title':
+                            st.markdown(f"# {content}")
+                        elif section.lower() == 'subtitle':
+                            st.markdown(f"*{content}*")
+                        else:
+                            st.markdown(f"### {section_title}")
+                        # Handle different content types
+                        if isinstance(content, str):
+                            st.markdown(content)
+                        elif isinstance(content, list):
+                            for item in content:
+                                if isinstance(item, str):
+                                    st.markdown(f"* {item}")
+                                else:
+                                    st.json(item)
+                        elif isinstance(content, dict):
+                            for k, v in content.items():
+                                st.markdown(f"**{k}:** {v}")
+            else:
+                st.warning("No content available for this document.")
+        # Images tab if available
+        if has_images and len(prev_tabs) > 2:
+            with prev_tabs[2]:
+                try:
+                    # Import function
+                    from ocr_utils import create_html_with_images
+                    if 'pages_data' in selected_result:
+                        # Generate HTML with images
+                        html_with_images = create_html_with_images(selected_result)
+                        # Display HTML content
+                        st.components.v1.html(html_with_images, height=600, scrolling=True)
+                        # Download button with unique key to prevent resets
+                        st.download_button(
+                            label="Download with Images (HTML)",
+                            data=html_with_images,
+                            file_name=f"{selected_result.get('file_name', 'document')}_with_images.html",
+                            mime="text/html",
+                            key=f"prev_download_{hash(selected_result.get('file_name', 'doc'))}_{selected_index}"
+                        )
+                    else:
+                        st.warning("No image data available for this document.")
+                except Exception as e:
+                    st.error(f"Could not display document with images: {str(e)}")
+# About tab content
+with main_tab3:
     st.markdown("""
     ### About This Application
+    This app uses Mistral AI's Document OCR to extract text and images from historical documents with enhanced formatting.
     It can process:
     - Image files (jpg, png, etc.)
     - **Raw JSON**: Complete data structure for developers
     - **With Images**: Document with embedded images preserving original layout
+    **History Feature:**
+    - All processed documents are saved in the session history
+    - Access previous documents in the "Previous Results" tab
+    - No need to reprocess the same document multiple times
     """)
+# Main tab content
 with main_tab1:
+    # Create two columns for the main interface
+    col1, col2 = st.columns([1, 1])
+    # File upload column
+    with col1:
+        st.subheader("Upload Document")
+        # File uploader
+        uploaded_file = st.file_uploader("Choose an image or PDF file",
+                                        type=["pdf", "png", "jpg", "jpeg"],
+                                        help="Select a document to process with OCR")
+        # Show preprocessing summary if options are selected
+        if uploaded_file is not None and any(preprocessing_options.values()):
+            st.write("**Active preprocessing:**")
+            prep_list = []
+            if preprocessing_options.get("grayscale", False):
+                prep_list.append("Grayscale conversion")
+            if preprocessing_options.get("threshold", False):
+                prep_list.append("Adaptive thresholding")
+            if preprocessing_options.get("denoise", False):
+                prep_list.append("Noise reduction")
+            contrast_value = preprocessing_options.get("contrast", 0)
+            if contrast_value != 0:
+                direction = "increased" if contrast_value > 0 else "decreased"
+                prep_list.append(f"Contrast {direction} by {abs(contrast_value)}")
+            rotation = preprocessing_options.get("rotation", "None")
+            if rotation != "None":
+                prep_list.append(f"{rotation}")
+            for item in prep_list:
+                st.write(f"- {item}")
+        # Process button - show only when file is uploaded
+        if uploaded_file is not None:
+            # Check file size (cap at 20MB)
+            file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
+            if file_size_mb > 20:
+                st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
+            else:
+                # Display file info
+                st.write(f"**File:** {uploaded_file.name} ({file_size_mb:.2f} MB)")
+                # Process button
+                process_button = st.button("Process Document",
+                                         type="primary",
+                                         use_container_width=True,
+                                         help="Start OCR processing with the selected options")
+    # Preview column
+    with col2:
+        if uploaded_file is not None:
             st.subheader("Document Preview")
+            file_ext = Path(uploaded_file.name).suffix.lower()
+            # Show preview tabs for original and processed (if applicable)
+            if uploaded_file.type and uploaded_file.type.startswith('image/'):
+                # For image files
+                preview_tabs = st.tabs(["Original"])
+                # Show original image preview
+                with preview_tabs[0]:
+                    try:
+                        image = safe_open_image(uploaded_file.getvalue())
+                        if image:
+                            # Display with controlled size
+                            st.image(image, caption=uploaded_file.name, width=400)
+                        else:
+                            st.info("Image preview not available")
+                    except Exception:
+                        st.info("Image preview could not be displayed")
+                # Add processed preview if preprocessing options are selected
+                if any(preprocessing_options.values()):
+                    # Create a before-after comparison
+                    st.subheader("Preprocessing Preview")
+                    try:
+                        # Process the image with selected options
+                        processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
+                        processed_image = safe_open_image(processed_bytes)
+                        # Show before/after in columns
+                        col1, col2 = st.columns(2)
+                        with col1:
+                            st.write("**Original**")
+                            image = safe_open_image(uploaded_file.getvalue())
+                            if image:
+                                st.image(image, width=300)
+                        with col2:
+                            st.write("**Processed**")
+                            if processed_image:
+                                st.image(processed_image, width=300)
+                            else:
+                                st.info("Processed preview not available")
+                    except Exception:
+                        st.info("Preprocessing preview could not be generated")
+            elif file_ext == ".pdf":
+                # For PDF files
                 try:
+                    # Convert first page of PDF to image
                     pdf_bytes = uploaded_file.getvalue()
+                    with st.spinner("Generating PDF preview..."):
+                        images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1, dpi=150)
                     if images:
+                        # Convert to JPEG for display
                         first_page = images[0]
                         img_bytes = io.BytesIO()
                         first_page.save(img_bytes, format='JPEG')
                         img_bytes.seek(0)
+                        # Display preview
+                        st.image(img_bytes, caption=f"PDF Preview: {uploaded_file.name}", width=400)
+                        st.info(f"PDF document with {len(convert_from_bytes(pdf_bytes, dpi=100))} pages")
                     else:
+                        st.info(f"PDF preview not available: {uploaded_file.name}")
                 except Exception:
+                    st.info(f"PDF preview could not be displayed: {uploaded_file.name}")
+    # Results section - spans full width
+    if 'process_button' in locals() and process_button:
+        # Horizontal line to separate input and results
+        st.markdown("---")
+        st.subheader("Processing Results")
+        try:
+            # Process the file with selected options
+            result = process_file(uploaded_file, use_vision, preprocessing_options)
+            # Save result to session state
+            st.session_state.current_result = result
+            # Add to previous results if not already there
+            if result not in st.session_state.previous_results:
+                st.session_state.previous_results.append(result)
+                # Keep only the last 10 results to avoid memory issues
+                if len(st.session_state.previous_results) > 10:
+                    st.session_state.previous_results.pop(0)
+            # Create tabs for viewing results
+            has_images = result.get('has_images', False)
+            if has_images:
+                result_tabs = st.tabs(["Structured View", "Raw JSON", "With Images"])
             else:
+                result_tabs = st.tabs(["Structured View", "Raw JSON"])
+            # Structured view tab
+            with result_tabs[0]:
+                # Display file info
+                st.write(f"**File:** {result.get('file_name', uploaded_file.name)}")
+                # Show confidence score (default to 85% if not available)
+                confidence = result.get('confidence_score', 0.85)
+                st.write(f"**OCR Confidence:** {confidence:.1%}")
+                # Show languages if available
+                if 'languages' in result and result['languages']:
+                    languages = [lang for lang in result['languages'] if lang is not None]
+                    if languages:
+                        st.write(f"**Languages:** {', '.join(languages)}")
+                # Show topics if available
+                if 'topics' in result and result['topics']:
+                    st.write(f"**Topics:** {', '.join(result['topics'])}")
+                # Display limited pages info if applicable
+                if 'limited_pages' in result:
+                    st.info(f"Processed {result['limited_pages']['processed']} of {result['limited_pages']['total']} pages")
+                # Display structured content
+                if 'ocr_contents' in result:
+                    st.markdown("## Document Contents")
+                    # Format based on content structure
+                    if isinstance(result['ocr_contents'], dict):
+                        for section, content in result['ocr_contents'].items():
+                            if not content:  # Skip empty sections
+                                continue
+                            section_title = section.replace('_', ' ').title()
+                            # Special handling for title and subtitle
+                            if section.lower() == 'title':
+                                st.markdown(f"# {content}")
+                            elif section.lower() == 'subtitle':
+                                st.markdown(f"*{content}*")
                             else:
+                                # Section headers for non-title sections
+                                st.markdown(f"### {section_title}")
+                            # Process different content types
+                            if isinstance(content, str):
+                                st.markdown(content)
+                            elif isinstance(content, list):
+                                # Display list items with proper formatting
+                                st.write("")  # Add spacing
+                                for item in content:
+                                    if isinstance(item, str):
+                                        st.markdown(f"* {item}")
+                                    elif isinstance(item, dict):
+                                        # Create formatted display for dictionary items instead of raw JSON
+                                        with st.expander(f"Details {list(item.keys())[0] if item else ''}"):
+                                            for k, v in item.items():
+                                                st.markdown(f"**{k}:** {v}")
+                            elif isinstance(content, dict):
+                                # Special handling for poem type
+                                if 'type' in content and content['type'] == 'poem' and 'lines' in content:
+                                    st.markdown("```")  # Use code block for poem to preserve spacing
+                                    for line in content['lines']:
+                                        st.markdown(line)
+                                    st.markdown("```")
+                                else:
+                                    # Regular dictionary display with better formatting
+                                    st.write("")  # Add spacing
+                                    for k, v in content.items():
+                                        if isinstance(v, str):
+                                            st.markdown(f"**{k}:** {v}")
+                                        elif isinstance(v, list):
+                                            st.markdown(f"**{k}:**")
+                                            for item in v:
+                                                st.markdown(f"  * {item}")
                                         else:
+                                            st.markdown(f"**{k}:** {v}")
+                # Download button
+                with st.expander("Export Content"):
+                    # Generate HTML content for download with proper CSS styling
+                    html_content = '''<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>OCR Document</title>
+    <style>
+        body {
+            font-family: 'Georgia', serif;
+            line-height: 1.6;
+            margin: 0;
+            padding: 20px;
+            background-color: #f9f9f9;
+            color: #333;
+        }
+        .container {
+            max-width: 1000px;
+            margin: 0 auto;
+            background-color: #fff;
+            padding: 30px;
+            border-radius: 8px;
+            box-shadow: 0 4px 12px rgba(0,0,0,0.1);
+        }
+        h1, h2, h3 {
+            font-family: 'Bookman', 'Georgia', serif;
+            margin-top: 1.5em;
+            margin-bottom: 0.5em;
+            color: #222;
+        }
+        h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
+        h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
+        h3 { font-size: 1.5em; }
+        p { margin-bottom: 1.2em; text-align: justify; }
+        ul { margin-bottom: 1.5em; }
+        li { margin-bottom: 0.3em; }
+        dl { margin-bottom: 1.5em; }
+        dt { font-weight: bold; margin-top: 1em; }
+        dd { margin-left: 2em; margin-bottom: 0.5em; }
+        .poem {
+            font-family: 'Baskerville', 'Georgia', serif;
+            margin-left: 2em;
+            line-height: 1.8;
+            white-space: pre-wrap;
+        }
+    </style>
+</head>
+<body>
+<div class="container">'''
+                    # Add content to HTML with proper formatting
+                    if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
+                        for section, content in result['ocr_contents'].items():
+                            if not content:
+                                continue
+                            section_title = section.replace('_', ' ').title()
+                            # Handle title and subtitle with special formatting
+                            if section.lower() == 'title':
+                                html_content += f'<h1>{content}</h1>\n'
+                            elif section.lower() == 'subtitle':
+                                html_content += f'<div style="font-style:italic;font-size:1.1em;margin-bottom:1.5em;">{content}</div>\n'
+                            else:
+                                html_content += f'<h3>{section_title}</h3>\n'
+                            # Handle different content types with appropriate HTML
+                            if isinstance(content, str):
+                                # Split into paragraphs and format each properly
+                                paragraphs = content.split('\n\n')
+                                for p in paragraphs:
+                                    if p.strip():
+                                        html_content += f'<p>{p.strip()}</p>\n'
+                            elif isinstance(content, list):
+                                # Properly format lists with better handling for dict items
+                                html_content += '<ul>\n'
+                                for item in content:
+                                    if isinstance(item, str):
+                                        html_content += f'<li>{item}</li>\n'
+                                    elif isinstance(item, dict):
+                                        # Format dictionary items in the list
+                                        html_content += '<li>\n'
+                                        html_content += '<details>\n'
+                                        html_content += f'<summary>{list(item.keys())[0] if item else "Details"}</summary>\n'
+                                        html_content += '<dl>\n'
+                                        for k, v in item.items():
+                                            html_content += f'<dt>{k}</dt>\n<dd>{v}</dd>\n'
+                                        html_content += '</dl>\n'
+                                        html_content += '</details>\n'
+                                        html_content += '</li>\n'
+                                    else:
+                                        html_content += f'<li>{str(item)}</li>\n'
+                                html_content += '</ul>\n'
+                            elif isinstance(content, dict):
+                                # Special handling for poem content
+                                if 'type' in content and content['type'] == 'poem' and 'lines' in content:
+                                    html_content += '<div class="poem">\n'
+                                    for line in content['lines']:
+                                        html_content += f'{line}<br>\n'
+                                    html_content += '</div>\n'
+                                else:
+                                    # Regular dictionary display with proper nesting
+                                    html_content += '<dl>\n'
+                                    for k, v in content.items():
+                                        html_content += f'<dt>{k}</dt>\n'
+                                        if isinstance(v, str):
+                                            html_content += f'<dd>{v}</dd>\n'
+                                        elif isinstance(v, list):
+                                            html_content += '<dd><ul>\n'
+                                            for item in v:
+                                                html_content += f'<li>{item}</li>\n'
+                                            html_content += '</ul></dd>\n'
+                                        else:
+                                            html_content += f'<dd>{str(v)}</dd>\n'
+                                    html_content += '</dl>\n'
+                    # Close HTML
+                    html_content += '''
+</div>
+</body>
+</html>'''
+                    # Create download button with unique key to prevent resets
+                    html_bytes = html_content.encode()
+                    st.download_button(
+                        label="Download as HTML",
+                        data=html_bytes,
+                        file_name="document_content.html",
+                        mime="text/html",
+                        key=f"download_html_{hash(result.get('file_name', 'doc'))}"
+                    )
+            # Raw JSON tab
+            with result_tabs[1]:
+                st.json(result)
+            # Images tab (if available)
+            if has_images:
+                with result_tabs[2]:
+                    try:
+                        # Import create_html_with_images function
+                        from ocr_utils import create_html_with_images
+                        # Check if images are available
+                        if 'pages_data' not in result:
+                            st.warning("No image data available in the OCR response.")
+                        else:
+                            # Count images for warning
+                            image_count = 0
+                            for page in result.get('pages_data', []):
+                                image_count += len(page.get('images', []))
+                            if image_count > 10:
+                                st.warning(f"This document contains {image_count} images. Rendering may take longer.")
+                            # Display info about pages and images
+                            page_count = len(result.get('pages_data', []))
+                            st.write(f"**Document contains {page_count} page{'' if page_count == 1 else 's'} with {image_count} image{'' if image_count == 1 else 's'} total**")
+                            # Add pagination if multiple pages
+                            if page_count > 1:
+                                page_options = [f"Page {i+1}" for i in range(page_count)]
+                                selected_page = st.selectbox("Select page to view:", options=page_options)
+                                selected_page_num = int(selected_page.split(" ")[1])
+                                st.write(f"**Viewing {selected_page}**")
+                            # Generate HTML with images
+                            with st.spinner("Generating document with embedded images..."):
+                                html_with_images = create_html_with_images(result)
+                                # Display document in a fixed height container with scrolling
+                                st.write("**Document with Original Images**")
+                                st.components.v1.html(html_with_images, height=600, scrolling=True)
+                            # Provide a download option
+                            col1, col2 = st.columns([3, 1])
+                            with col2:
+                                st.download_button(
+                                    label="Download with Images",
+                                    data=html_with_images,
+                                    file_name=f"{result.get('file_name', 'document')}_with_images.html",
+                                    mime="text/html",
+                                    use_container_width=True,
+                                    key=f"download_images_{hash(result.get('file_name', 'doc'))}"
+                                )
+                            with col1:
+                                st.info("This HTML document includes the original document images embedded at their correct positions.")
+                                st.write("Original filenames and image positions are preserved in the downloaded file.")
+                    except Exception as e:
+                        st.error(f"Could not display document with images: {str(e)}")
+        except Exception as e:
+            st.error(f"Error processing document: {str(e)}")
+    # Show sample examples when no file is uploaded
+    elif uploaded_file is None:
+        # Show info about supported formats
+        st.info("📝 Upload a document to get started. Supported formats: JPG, PNG, PDF")
+        # Show example usage
+        with st.expander("Tips for best results"):
+            st.markdown("""
+            **For best OCR results:**
+            1. **Image quality** - Higher resolution images produce better results
+            2. **Document orientation** - Use rotation options for incorrectly oriented documents
+            3. **Preprocessing** - Try grayscale and thresholding for low-contrast documents
+            4. **File size** - Keep files under 10MB for best API performance
+            **File preservation:** Original filenames are preserved in the results.
+            """)

config.py CHANGED Viewed

@@ -4,14 +4,19 @@ Configuration file for Mistral OCR processing.
 Contains API key and other settings.
 """
 import os
 # Your Mistral API key - get from Hugging Face secrets or environment variable
-# The priority order is: HF_SPACES environment var > regular environment var > empty string
-# Note: No default API key is provided for security reasons
-MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",  # First check HF-specific env var
-                  os.environ.get("MISTRAL_API_KEY", ""))  # Then check regular env var
 # Model settings
 OCR_MODEL = "mistral-ocr-latest"
-TEXT_MODEL = "ministral-8b-latest"
 VISION_MODEL = "pixtral-12b-latest"

 Contains API key and other settings.
 """
 import os
+from dotenv import load_dotenv
+# Load environment variables from .env file if it exists
+load_dotenv()
 # Your Mistral API key - get from Hugging Face secrets or environment variable
+# The priority order is:
+# 1. HF_MISTRAL_API_KEY environment var (specific to Hugging Face)
+# 2. MISTRAL_API_KEY environment var (standard environment variable)
+# 3. Empty string (will show warning in app)
+MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",
+                  os.environ.get("MISTRAL_API_KEY", ""))
 # Model settings
 OCR_MODEL = "mistral-ocr-latest"
 VISION_MODEL = "pixtral-12b-latest"

prepare_for_hf.py CHANGED Viewed

@@ -13,34 +13,17 @@ import shutil
 import sys
 from pathlib import Path
-# Configuration for HF module
-HF_MODULE_ENABLED = True  # Set to False to disable the educational module
 def setup_hf_module():
-    """Setup the Hugging Face educational module if enabled"""
-    if not HF_MODULE_ENABLED:
-        print("Hugging Face educational module is disabled.")
-        return
-    print("Setting up Hugging Face educational module...")
-    # Ensure directories exist
-    for directory in ["modules", "ui"]:
-        if not os.path.exists(directory):
-            os.makedirs(directory)
-            print(f"Created {directory} directory")
-    # Check if module files exist
-    required_files = ["streamlit_app.py", "modules/modular_app.py", "ui/layout.py"]
-    missing_files = [f for f in required_files if not os.path.exists(f)]
-    if missing_files:
-        print("Warning: Some module files are missing:")
-        for file in missing_files:
-            print(f"  - {file}")
-        print("The educational version may not work correctly.")
-    else:
-        print("All required module files are present.")
 def main():
     print("Preparing repository for Hugging Face Spaces deployment...")

 import sys
 from pathlib import Path
+# No educational module needed
+HF_MODULE_ENABLED = False
 def setup_hf_module():
+    """Setup the Hugging Face integration"""
+    print("No educational module needed - using simplified app structure.")
+    # Ensure ui directory exists for layout files
+    if not os.path.exists("ui"):
+        os.makedirs("ui")
+        print("Created ui directory")
 def main():
     print("Preparing repository for Hugging Face Spaces deployment...")

process_file.py CHANGED Viewed

@@ -54,11 +54,16 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
             "use_vision": use_vision
         })
         return result
     except Exception as e:
         return {
             "error": str(e),
-            "file_name": uploaded_file.name
         }
     finally:
         # Clean up the temporary file

             "use_vision": use_vision
         })
+        # Always ensure confidence score is present (default to 85%)
+        if 'confidence_score' not in result:
+            result['confidence_score'] = 0.85
         return result
     except Exception as e:
         return {
             "error": str(e),
+            "file_name": uploaded_file.name,
+            "confidence_score": 0.85  # Add default confidence score even to error results
         }
     finally:
         # Clean up the temporary file

requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
 streamlit>=1.43.2
-mistralai>=0.0.7
 pydantic>=2.0.0
 pycountry>=23.12.11
 pillow>=10.0.0
@@ -7,4 +7,5 @@ python-multipart>=0.0.6
 pdf2image>=1.17.0
 pytesseract>=0.3.10
 opencv-python-headless>=4.6.0
-numpy>=1.23.5

 streamlit>=1.43.2
+mistralai>=0.0.7,<2.0.0
 pydantic>=2.0.0
 pycountry>=23.12.11
 pillow>=10.0.0
 pdf2image>=1.17.0
 pytesseract>=0.3.10
 opencv-python-headless>=4.6.0
+numpy>=1.23.5
+python-dotenv>=1.0.0

run_local.sh CHANGED Viewed

@@ -1,13 +1,8 @@
 #!/bin/bash
-# Determine which version of the app to run
-if [ "$1" == "educational" ]; then
-  APP_FILE="streamlit_app.py"
-  echo "Starting Educational Version..."
-else
-  APP_FILE="app.py"
-  echo "Starting Standard Version..."
-fi
 # Check if .env file exists and load it
 if [ -f .env ]; then

 #!/bin/bash
+# Run the standard app
+APP_FILE="app.py"
+echo "Starting OCR Application..."
 # Check if .env file exists and load it
 if [ -f .env ]; then

simple_test.py CHANGED Viewed

@@ -12,7 +12,7 @@ def main():
     print("Testing OCR with a sample image file")
     # Path to the sample image file
-    image_path = os.path.join("input", "recipe.jpg")
     # Check if the file exists
     if not os.path.isfile(image_path):
@@ -25,7 +25,7 @@ def main():
     output_dir = "output"
     os.makedirs(output_dir, exist_ok=True)
-    output_path = os.path.join(output_dir, "recipe_test.json")
     # Import the StructuredOCR class
     from structured_ocr import StructuredOCR
@@ -38,9 +38,18 @@ def main():
         print(f"Processing image file: {image_path}")
         result = processor.process_file(image_path, file_type="image")
-        # Save the result to the output file
         with open(output_path, 'w') as f:
-            json.dump(result, f, indent=2)
         print(f"Image processing completed successfully. Output saved to {output_path}")

     print("Testing OCR with a sample image file")
     # Path to the sample image file
+    image_path = os.path.join("input", "magician-satire.jpg")
     # Check if the file exists
     if not os.path.isfile(image_path):
     output_dir = "output"
     os.makedirs(output_dir, exist_ok=True)
+    output_path = os.path.join(output_dir, "magician_test.json")
     # Import the StructuredOCR class
     from structured_ocr import StructuredOCR
         print(f"Processing image file: {image_path}")
         result = processor.process_file(image_path, file_type="image")
+        # Convert any non-serializable objects in the result
+        def sanitize_for_json(obj):
+            if hasattr(obj, 'to_dict'):
+                return obj.to_dict()
+            elif hasattr(obj, '__dict__'):
+                return obj.__dict__
+            else:
+                return str(obj)
+        # Save the result to the output file with a custom serializer
         with open(output_path, 'w') as f:
+            json.dump(result, f, indent=2, default=sanitize_for_json)
         print(f"Image processing completed successfully. Output saved to {output_path}")

structured_ocr.py CHANGED Viewed

@@ -37,7 +37,7 @@ except ImportError:
         return "\n\n".join(markdowns)
 # Import config directly (now local to historical-ocr)
-from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL
 # Create language enum for structured output
 languages = {lang.alpha_2: lang.name for lang in pycountry.languages if hasattr(lang, 'alpha_2')}
@@ -61,9 +61,36 @@ class StructuredOCR:
     def __init__(self, api_key=None):
         """Initialize the OCR processor with API key"""
         self.api_key = api_key or MISTRAL_API_KEY
-        self.client = Mistral(api_key=self.api_key)
-    def process_file(self, file_path, file_type=None, use_vision=True, max_pages=None, file_size_mb=None, custom_pages=None):
         """Process a file and return structured OCR results
         Args:
@@ -120,9 +147,9 @@ class StructuredOCR:
         # Read and process the file
         if file_type == "pdf":
-            result = self._process_pdf(file_path, use_vision, max_pages, custom_pages)
         else:
-            result = self._process_image(file_path, use_vision)
         # Add processing time information
         processing_time = time.time() - start_time
@@ -134,7 +161,7 @@ class StructuredOCR:
         return result
-    def _process_pdf(self, file_path, use_vision=True, max_pages=None, custom_pages=None):
         """Process a PDF file with OCR
         Args:
@@ -162,11 +189,57 @@ class StructuredOCR:
             # Process the PDF with OCR
             logger.info(f"Processing PDF with OCR using {OCR_MODEL}")
-            pdf_response = self.client.ocr.process(
-                document=DocumentURLChunk(document_url=signed_url.url),
-                model=OCR_MODEL,
-                include_image_base64=True
-            )
             # Limit pages if requested
             pages_to_process = pdf_response.pages
@@ -218,15 +291,15 @@ class StructuredOCR:
                 if first_page_image:
                     # Use vision model
                     logger.info(f"Using vision model: {VISION_MODEL}")
-                    result = self._extract_structured_data_with_vision(first_page_image, all_markdown, file_path.name)
                 else:
-                    # Fall back to text-only model if no image available
-                    logger.info(f"No images in PDF, falling back to text model: {TEXT_MODEL}")
-                    result = self._extract_structured_data_text_only(all_markdown, file_path.name)
             else:
-                # Use text-only model
-                logger.info(f"Using text-only model: {TEXT_MODEL}")
-                result = self._extract_structured_data_text_only(all_markdown, file_path.name)
             # Add page limit info to result if needed
             if limited_pages:
@@ -239,7 +312,8 @@ class StructuredOCR:
             result['confidence_score'] = confidence_score
             # Store key parts of the OCR response for image rendering
-            # Extract and store image data in a format that can be serialized to JSON
             has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
             result['has_images'] = has_images
@@ -282,7 +356,7 @@ class StructuredOCR:
                 }
             }
-    def _process_image(self, file_path, use_vision=True):
         """Process an image file with OCR"""
         logger = logging.getLogger("image_processor")
         logger.info(f"Processing image: {file_path}")
@@ -299,24 +373,43 @@ class StructuredOCR:
                     from PIL import Image
                     import io
-                    # Open and resize the image
                     with Image.open(file_path) as img:
                         # Convert to RGB if not already (prevents mode errors)
                         if img.mode != 'RGB':
                             img = img.convert('RGB')
                         # Calculate new dimensions (maintain aspect ratio)
                         # Target around 2000-3000 pixels on longest side for good OCR quality
-                        width, height = img.size
-                        max_dimension = max(width, height)
                         target_dimension = 2500  # Good balance between quality and size
                         if max_dimension > target_dimension:
                             scale_factor = target_dimension / max_dimension
-                            new_width = int(width * scale_factor)
-                            new_height = int(height * scale_factor)
-                            img = img.resize((new_width, new_height), Image.LANCZOS)
                         # Save to bytes with compression
                         buffer = io.BytesIO()
                         img.save(buffer, format="JPEG", quality=85, optimize=True)
@@ -344,11 +437,64 @@ class StructuredOCR:
             # Process the image with OCR
             logger.info(f"Processing image with OCR using {OCR_MODEL}")
-            image_response = self.client.ocr.process(
-                document=ImageURLChunk(image_url=base64_data_url),
-                model=OCR_MODEL,
-                include_image_base64=True
-            )
             # Get the OCR markdown from the first page
             image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
@@ -364,16 +510,17 @@ class StructuredOCR:
             # Extract structured data using the appropriate model
             if use_vision:
                 logger.info(f"Using vision model: {VISION_MODEL}")
-                result = self._extract_structured_data_with_vision(base64_data_url, image_ocr_markdown, file_path.name)
             else:
                 logger.info(f"Using text-only model: {TEXT_MODEL}")
-                result = self._extract_structured_data_text_only(image_ocr_markdown, file_path.name)
             # Add confidence score
             result['confidence_score'] = confidence_score
             # Store key parts of the OCR response for image rendering
-            # Extract and store image data in a format that can be serialized to JSON
             has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
             result['has_images'] = has_images
@@ -416,7 +563,7 @@ class StructuredOCR:
                 }
             }
-    def _extract_structured_data_with_vision(self, image_base64, ocr_markdown, filename):
         """Extract structured data using vision model"""
         try:
             # Parse with vision model with a timeout
@@ -435,6 +582,7 @@ class StructuredOCR:
                                 f"For handwritten documents, carefully preserve the structure. "
                                 f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
                                 f"For tabular content, preserve the table structure as much as possible."
                             ))
                         ],
                     },
@@ -457,12 +605,12 @@ class StructuredOCR:
         return result
-    def _extract_structured_data_text_only(self, ocr_markdown, filename):
-        """Extract structured data using text-only model"""
         try:
-            # Parse with text-only model with a timeout
             chat_response = self.client.chat.parse(
-                model=TEXT_MODEL,
                 messages=[
                     {
                         "role": "user",
@@ -473,6 +621,7 @@ class StructuredOCR:
                                   f"For handwritten documents, carefully preserve the structure. "
                                   f"For printed texts, organize content logically by sections. "
                                   f"For tabular content, preserve the table structure as much as possible."
                     },
                 ],
                 response_format=StructuredOCRModel,

         return "\n\n".join(markdowns)
 # Import config directly (now local to historical-ocr)
+from config import MISTRAL_API_KEY, OCR_MODEL, VISION_MODEL
 # Create language enum for structured output
 languages = {lang.alpha_2: lang.name for lang in pycountry.languages if hasattr(lang, 'alpha_2')}
     def __init__(self, api_key=None):
         """Initialize the OCR processor with API key"""
         self.api_key = api_key or MISTRAL_API_KEY
+        # Ensure we have a valid API key
+        if not self.api_key:
+            raise ValueError("No Mistral API key provided. Please set the MISTRAL_API_KEY environment variable.")
+        # Clean the API key by removing any whitespace
+        self.api_key = self.api_key.strip()
+        # Basic validation of API key format (Mistral keys are typically 32 characters)
+        if len(self.api_key) != 32:
+            logger = logging.getLogger("api_validator")
+            logger.warning(f"Warning: API key length ({len(self.api_key)}) is not the expected 32 characters")
+        # Initialize client with the API key
+        try:
+            self.client = Mistral(api_key=self.api_key)
+            # Validate API key by making a small request
+            # This is optional but catches authentication issues early
+            # Uncomment for early validation (costs a small API call)
+            # self.client.models.list()
+        except Exception as e:
+            error_msg = str(e).lower()
+            if "unauthorized" in error_msg or "401" in error_msg:
+                raise ValueError(f"API key authentication failed. Please check your Mistral API key: {str(e)}")
+            else:
+                raise
+    def process_file(self, file_path, file_type=None, use_vision=True, max_pages=None, file_size_mb=None, custom_pages=None, custom_prompt=None):
         """Process a file and return structured OCR results
         Args:
         # Read and process the file
         if file_type == "pdf":
+            result = self._process_pdf(file_path, use_vision, max_pages, custom_pages, custom_prompt)
         else:
+            result = self._process_image(file_path, use_vision, custom_prompt)
         # Add processing time information
         processing_time = time.time() - start_time
         return result
+    def _process_pdf(self, file_path, use_vision=True, max_pages=None, custom_pages=None, custom_prompt=None):
         """Process a PDF file with OCR
         Args:
             # Process the PDF with OCR
             logger.info(f"Processing PDF with OCR using {OCR_MODEL}")
+            # Add retry logic with exponential backoff for API errors
+            max_retries = 3
+            retry_delay = 2
+            for retry in range(max_retries):
+                try:
+                    pdf_response = self.client.ocr.process(
+                        document=DocumentURLChunk(document_url=signed_url.url),
+                        model=OCR_MODEL,
+                        include_image_base64=True
+                    )
+                    break  # Success, exit retry loop
+                except Exception as e:
+                    error_msg = str(e)
+                    logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
+                    # Check specific error types to handle them appropriately
+                    error_lower = error_msg.lower()
+                    # Authentication errors - no point in retrying
+                    if "unauthorized" in error_lower or "401" in error_lower:
+                        logger.error("API authentication failed. Check your API key.")
+                        raise ValueError(f"Authentication failed with API key. Please verify your Mistral API key is correct and active: {error_msg}")
+                    # Connection errors - worth retrying
+                    elif "connection" in error_lower or "timeout" in error_lower or "520" in error_msg or "server error" in error_lower:
+                        if retry < max_retries - 1:
+                            # Wait with exponential backoff before retrying
+                            wait_time = retry_delay * (2 ** retry)
+                            logger.info(f"Connection issue detected. Waiting {wait_time}s before retry...")
+                            time.sleep(wait_time)
+                        else:
+                            # Last retry failed
+                            logger.error("Maximum retries reached, API connection error persists.")
+                            raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
+                    # Rate limit errors
+                    elif "rate limit" in error_lower or "429" in error_lower:
+                        if retry < max_retries - 1:
+                            wait_time = retry_delay * (2 ** retry) * 2  # Wait longer for rate limits
+                            logger.info(f"Rate limit exceeded. Waiting {wait_time}s before retry...")
+                            time.sleep(wait_time)
+                        else:
+                            logger.error("Maximum retries reached, rate limit error persists.")
+                            raise ValueError(f"Mistral API rate limit exceeded. Please try again later: {error_msg}")
+                    # Other errors - no retry
+                    else:
+                        logger.error(f"Unrecoverable API error: {error_msg}")
+                        raise
             # Limit pages if requested
             pages_to_process = pdf_response.pages
                 if first_page_image:
                     # Use vision model
                     logger.info(f"Using vision model: {VISION_MODEL}")
+                    result = self._extract_structured_data_with_vision(first_page_image, all_markdown, file_path.name, custom_prompt)
                 else:
+                    # Fall back to vision model but without image
+                    logger.info(f"No images in PDF, falling back to using vision model without image")
+                    result = self._extract_structured_data_text_only(all_markdown, file_path.name, custom_prompt)
             else:
+                # Use vision model without image
+                logger.info(f"Using vision model without image")
+                result = self._extract_structured_data_text_only(all_markdown, file_path.name, custom_prompt)
             # Add page limit info to result if needed
             if limited_pages:
             result['confidence_score'] = confidence_score
             # Store key parts of the OCR response for image rendering
+            # First store the raw response for backwards compatibility
+            # Then extract and store image data in a format that can be serialized to JSON
             has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
             result['has_images'] = has_images
                 }
             }
+    def _process_image(self, file_path, use_vision=True, custom_prompt=None):
         """Process an image file with OCR"""
         logger = logging.getLogger("image_processor")
         logger.info(f"Processing image: {file_path}")
                     from PIL import Image
                     import io
+                    # Open and process the image
                     with Image.open(file_path) as img:
                         # Convert to RGB if not already (prevents mode errors)
                         if img.mode != 'RGB':
                             img = img.convert('RGB')
+                        # Detect and correct orientation based on aspect ratio
+                        # For OCR, portrait (vertical) orientation typically works better
+                        width, height = img.size
+                        # If image is horizontally oriented (landscape) and significantly wider than tall
+                        # OCR models often work better with portrait orientation
+                        is_horizontal = width > height and (width / height) > 1.2
+                        # For documents, we can also use a heuristic that very wide images might need rotation
+                        needs_rotation = is_horizontal and width > 1000 and (width / height) > 1.5
+                        # Rotate if needed for OCR processing
+                        if needs_rotation:
+                            logger.info("Detected horizontal document, rotating for better OCR performance")
+                            # Try to determine whether to rotate 90° clockwise or counterclockwise
+                            # For OCR, we generally want to ensure text reads from left to right
+                            # Simple approach: rotate counterclockwise by default (often correct for scanned docs)
+                            img = img.transpose(Image.ROTATE_90)
                         # Calculate new dimensions (maintain aspect ratio)
                         # Target around 2000-3000 pixels on longest side for good OCR quality
+                        new_width, new_height = img.size  # Now potentially rotated
+                        max_dimension = max(new_width, new_height)
                         target_dimension = 2500  # Good balance between quality and size
                         if max_dimension > target_dimension:
                             scale_factor = target_dimension / max_dimension
+                            resized_width = int(new_width * scale_factor)
+                            resized_height = int(new_height * scale_factor)
+                            img = img.resize((resized_width, resized_height), Image.LANCZOS)
                         # Save to bytes with compression
                         buffer = io.BytesIO()
                         img.save(buffer, format="JPEG", quality=85, optimize=True)
             # Process the image with OCR
             logger.info(f"Processing image with OCR using {OCR_MODEL}")
+            # Log API key information (first and last characters only)
+            if self.api_key:
+                key_preview = f"{self.api_key[:3]}...{self.api_key[-3:]}"
+                logger.info(f"Using API key: {key_preview} (length: {len(self.api_key)})")
+            else:
+                logger.error("No API key provided!")
+            # Add retry logic with exponential backoff for API errors
+            max_retries = 3
+            retry_delay = 2
+            for retry in range(max_retries):
+                try:
+                    image_response = self.client.ocr.process(
+                        document=ImageURLChunk(image_url=base64_data_url),
+                        model=OCR_MODEL,
+                        include_image_base64=True
+                    )
+                    break  # Success, exit retry loop
+                except Exception as e:
+                    error_msg = str(e)
+                    logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
+                    # Check specific error types to handle them appropriately
+                    error_lower = error_msg.lower()
+                    # Authentication errors - no point in retrying
+                    if "unauthorized" in error_lower or "401" in error_lower:
+                        logger.error("API authentication failed. Check your API key.")
+                        raise ValueError(f"Authentication failed with API key. Please verify your Mistral API key is correct and active: {error_msg}")
+                    # Connection errors - worth retrying
+                    elif "connection" in error_lower or "timeout" in error_lower or "520" in error_msg or "server error" in error_lower:
+                        if retry < max_retries - 1:
+                            # Wait with exponential backoff before retrying
+                            wait_time = retry_delay * (2 ** retry)
+                            logger.info(f"Connection issue detected. Waiting {wait_time}s before retry...")
+                            time.sleep(wait_time)
+                        else:
+                            # Last retry failed
+                            logger.error("Maximum retries reached, API connection error persists.")
+                            raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
+                    # Rate limit errors
+                    elif "rate limit" in error_lower or "429" in error_lower:
+                        if retry < max_retries - 1:
+                            wait_time = retry_delay * (2 ** retry) * 2  # Wait longer for rate limits
+                            logger.info(f"Rate limit exceeded. Waiting {wait_time}s before retry...")
+                            time.sleep(wait_time)
+                        else:
+                            logger.error("Maximum retries reached, rate limit error persists.")
+                            raise ValueError(f"Mistral API rate limit exceeded. Please try again later: {error_msg}")
+                    # Other errors - no retry
+                    else:
+                        logger.error(f"Unrecoverable API error: {error_msg}")
+                        raise
             # Get the OCR markdown from the first page
             image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
             # Extract structured data using the appropriate model
             if use_vision:
                 logger.info(f"Using vision model: {VISION_MODEL}")
+                result = self._extract_structured_data_with_vision(base64_data_url, image_ocr_markdown, file_path.name, custom_prompt)
             else:
                 logger.info(f"Using text-only model: {TEXT_MODEL}")
+                result = self._extract_structured_data_text_only(image_ocr_markdown, file_path.name, custom_prompt)
             # Add confidence score
             result['confidence_score'] = confidence_score
             # Store key parts of the OCR response for image rendering
+            # First store the raw response for backwards compatibility
+            # Then extract and store image data in a format that can be serialized to JSON
             has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
             result['has_images'] = has_images
                 }
             }
+    def _extract_structured_data_with_vision(self, image_base64, ocr_markdown, filename, custom_prompt=None):
         """Extract structured data using vision model"""
         try:
             # Parse with vision model with a timeout
                                 f"For handwritten documents, carefully preserve the structure. "
                                 f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
                                 f"For tabular content, preserve the table structure as much as possible."
+                                + (f"\n\nAdditional instructions: {custom_prompt}" if custom_prompt else "")
                             ))
                         ],
                     },
         return result
+    def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
+        """Extract structured data without using vision capabilities"""
         try:
+            # Parse with vision model but without image
             chat_response = self.client.chat.parse(
+                model=VISION_MODEL,
                 messages=[
                     {
                         "role": "user",
                                   f"For handwritten documents, carefully preserve the structure. "
                                   f"For printed texts, organize content logically by sections. "
                                   f"For tabular content, preserve the table structure as much as possible."
+                                  + (f"\n\nAdditional instructions: {custom_prompt}" if custom_prompt else "")
                     },
                 ],
                 response_format=StructuredOCRModel,

ui/custom.css CHANGED Viewed

@@ -300,4 +300,44 @@
 .stTabs [data-baseweb="tab-highlight"] {
     background-color: var(--color-blue-600);
 }

 .stTabs [data-baseweb="tab-highlight"] {
     background-color: var(--color-blue-600);
+}
+/* Workflow steps */
+.workflow-step {
+    background-color: var(--color-gray-800);
+    border-radius: 8px;
+    padding: 15px;
+    border-left: 5px solid var(--color-blue-500);
+    margin-bottom: 15px;
+}
+.workflow-step.active {
+    border-left: 5px solid var(--color-blue-400);
+    background-color: var(--color-blue-900);
+}
+.workflow-step.complete {
+    border-left: 5px solid var(--color-blue-300);
+    background-color: var(--color-gray-700);
+}
+/* Before-after comparison */
+.comparison-container {
+    display: flex;
+    justify-content: space-between;
+    gap: 10px;
+    margin-bottom: 20px;
+}
+.comparison-image {
+    flex: 1;
+    border-radius: 8px;
+    overflow: hidden;
+    border: 1px solid var(--color-gray-300);
+}
+.comparison-title {
+    text-align: center;
+    font-weight: bold;
+    margin-bottom: 5px;
 }