Spaces:

milwright
/

historical-ocr

Running

milwright commited on 7 days ago

Commit

8a9c37d

1 Parent(s): 4ddf559

Update documentation and improve HTML formatting

- Enhanced README with detailed features and documentation
- Improved About section with new special features
- Updated HTML formatting for better historical document presentation
- Added specialized poem formatting and multi-page support
- Removed redundant Technical Details expander

Files changed (5) hide show

CLAUDE.md +35 -0
README.md +26 -8
app.py +285 -167
ocr_utils.py +161 -2
structured_ocr.py +58 -6

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Historical OCR Project Guidelines
+## Commands
+- Run standard app: `./run_local.sh` or `streamlit run app.py`
+- Run educational version: `./run_local.sh educational` or `streamlit run streamlit_app.py`
+- Run simple test: `python simple_test.py`
+- Run PDF test: `python test_pdf.py`
+- Process large files: `./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500`
+- Prepare for Hugging Face: `python prepare_for_hf.py`
+## Environment Setup
+- API key: Set `MISTRAL_API_KEY` in `.env` file or as environment variable
+- System dependencies: Install poppler for PDF processing (brew install poppler on macOS)
+- Python dependencies: `pip install -r requirements.txt`
+## Code Style
+- Imports: Standard library → third-party → local imports
+- Documentation: Google-style docstrings with Args, Returns sections
+- Error handling: Specific exceptions with informative messages, logging
+- Naming: snake_case for variables/functions, PascalCase for classes
+- Type hints: Pydantic models for structured data, typing module annotations
+## Project Structure
+- Core: `structured_ocr.py` - OCR processing with Mistral AI
+- Utils: `ocr_utils.py` - Text/image processing utilities
+- PDF: `pdf_ocr.py` - PDF-specific document handling
+- Config: `config.py` - API settings, model selection
+- UI: Streamlit interface with modular components
+- Testing: Simple test scripts in project root
+## Development Workflow
+- Use logging for debugging (configured in structured_ocr.py)
+- Test with sample files in input/ directory
+- Handle large files with specific options for optimal processing
+- Check confidence_score in results to evaluate OCR quality

README.md CHANGED Viewed

@@ -13,19 +13,23 @@ short_description: Employs Mistral OCR for transcribing historical data
 # Historical Document OCR
-This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents.
 ## Features
 - OCR processing for both image and PDF files
-- Automatic file type detection
-- Structured output generation using Mistral models
 - Interactive web interface with Streamlit
-- Supports historical documents and manuscripts
-- PDF preview functionality for better user experience
 - Smart handling of large PDFs with automatic page limiting
-- Robust error handling with helpful messages
 - Image preprocessing options for enhanced OCR accuracy
 ## Project Structure
@@ -69,7 +73,8 @@ Historical OCR - Project Structure
 │  └─ process_file.py               # File processing for educational app
 │
 ├─ UI Components (ui/)
-│  └─ layout.py                     # Shared UI components and styling
 │
 ├─ Data Directories
 │  ├─ input/                        # Sample documents for testing/demo
@@ -117,7 +122,20 @@ streamlit run app.py
 1. Upload an image or PDF file using the file uploader
 2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
 3. Click "Process Document" to analyze the file
-4. View the structured results and extract information
 ## Application Versions

 # Historical Document OCR
+This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents with enhanced formatting and presentation.
 ## Features
 - OCR processing for both image and PDF files
+- Automatic file type detection and content structuring
+- Advanced HTML formatting with proper document structure preservation
+- Specialized formatting for poems and historical texts
 - Interactive web interface with Streamlit
+- "With Images" view that preserves document layout and embedded images
+- Multi-page document support with pagination
+- PDF preview functionality
 - Smart handling of large PDFs with automatic page limiting
 - Image preprocessing options for enhanced OCR accuracy
+- Document export in multiple formats (HTML, JSON)
+- Responsive design optimized for historical document presentation
+- Enhanced typography with appropriate fonts for historical content
 ## Project Structure
 │  └─ process_file.py               # File processing for educational app
 │
 ├─ UI Components (ui/)
+│  ├─ layout.py                     # Shared UI components and styling
+│  └─ custom.css                    # Custom styling for the application
 │
 ├─ Data Directories
 │  ├─ input/                        # Sample documents for testing/demo
 1. Upload an image or PDF file using the file uploader
 2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
 3. Click "Process Document" to analyze the file
+4. View the results in three available formats:
+   - **Structured View**: Beautifully formatted HTML with proper document structure
+   - **Raw JSON**: Complete data structure for developers
+   - **With Images**: Document with embedded images preserving original layout
+## Document Output Features
+The application provides several specialized features for historical document presentation:
+1. **Poetry Formatting**: Special handling for poem structure with proper line spacing and typography
+2. **Image Embedding**: Original document images embedded within the text at their correct positions
+3. **Multi-page Support**: Pagination controls for navigating multi-page documents
+4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
+5. **Document Export**: Download options for saving the processed document in HTML format
 ## Application Versions

app.py CHANGED Viewed

@@ -146,17 +146,17 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
         # Get file size in MB
         file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
-        # Check if file exceeds size limits (200 MB)
-        if file_size_mb > 200:
-            st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 200MB.")
             return {
                 "file_name": uploaded_file.name,
                 "topics": ["Document"],
                 "languages": ["English"],
                 "confidence_score": 0.0,
-                "error": f"File size {file_size_mb:.2f} MB exceeds limit of 200 MB",
                 "ocr_contents": {
-                    "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of 200 MB",
                     "partial_text": "Document could not be processed due to size limitations."
                 }
             }
@@ -190,7 +190,7 @@ st.title("Historical Document OCR")
 st.subheader("Powered by Mistral AI")
 # Create main layout with tabs and columns
-main_tab1, main_tab2 = st.tabs(["Document Processing", "About"])
 with main_tab1:
     # Create a two-column layout for file upload and preview
@@ -203,7 +203,7 @@ with main_tab1:
         Using the `mistral-ocr-latest` model for advanced document understanding.
         """)
-        uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit 200MB per file")
 # Sidebar with options
 with st.sidebar:
@@ -240,7 +240,7 @@ with main_tab2:
     st.markdown("""
     ### About This Application
-    This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents.
     It can process:
     - Image files (jpg, png, etc.)
@@ -250,26 +250,71 @@ with main_tab2:
     - Text extraction with `mistral-ocr-latest`
     - Analysis with language models
     - Layout preservation with images
     View results in three formats:
-    - Structured HTML view
-    - Raw JSON (for developers)
-    - Markdown with images (preserves document layout)
-    **New Features:**
     - Image preprocessing for better OCR quality
     - PDF resolution and page controls
     - Progress tracking during processing
     """)
 with main_tab1:
     if uploaded_file is not None:
-        # Check file size (cap at 200MB)
         file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
-        if file_size_mb > 200:
             with upload_col:
-                st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 200MB.")
             st.stop()
         file_ext = Path(uploaded_file.name).suffix.lower()
@@ -331,10 +376,8 @@ with main_tab1:
                 # Call process_file with all options
                 result = process_file(uploaded_file, use_vision, preprocessing_options)
-                # Create results tabs for better organization
-                results_tab1, results_tab2 = st.tabs(["Document Analysis", "Technical Details"])
-                with results_tab1:
                     # Create two columns for metadata and content
                     meta_col, content_col = st.columns([1, 2])
@@ -368,12 +411,7 @@ with main_tab1:
                         st.subheader("Document Contents")
                         if 'ocr_contents' in result:
                             # Check if there are images in the OCR result
-                            has_images = False
-                            if 'raw_response' in result:
-                                try:
-                                    has_images = any(page.images for page in result['raw_response'].pages)
-                                except Exception:
-                                    has_images = False
                             # Create tabs for different views
                             if has_images:
@@ -383,37 +421,148 @@ with main_tab1:
                             with view_tab1:
                                 # Display in a more user-friendly format based on the content structure
-                                html_content = ""
                                 if isinstance(result['ocr_contents'], dict):
                                     for section, content in result['ocr_contents'].items():
-                                        if content:  # Only display non-empty sections
-                                            section_title = f"<h4>{section.replace('_', ' ').title()}</h4>"
-                                            html_content += section_title
-                                            if isinstance(content, str):
-                                                html_content += f"<p>{content}</p>"
-                                                st.markdown(f"#### {section.replace('_', ' ').title()}")
                                                 st.markdown(content)
-                                            elif isinstance(content, list):
-                                                html_list = "<ul>"
-                                                st.markdown(f"#### {section.replace('_', ' ').title()}")
-                                                for item in content:
-                                                    if isinstance(item, str):
-                                                        html_list += f"<li>{item}</li>"
-                                                        st.markdown(f"- {item}")
-                                                    elif isinstance(item, dict):
-                                                        html_list += f"<li>{json.dumps(item)}</li>"
-                                                        st.json(item)
-                                                html_list += "</ul>"
-                                                html_content += html_list
-                                            elif isinstance(content, dict):
-                                                html_dict = "<dl>"
-                                                st.markdown(f"#### {section.replace('_', ' ').title()}")
                                                 for k, v in content.items():
-                                                    html_dict += f"<dt><strong>{k}</strong></dt><dd>{v}</dd>"
                                                     st.markdown(f"**{k}:** {v}")
-                                                html_dict += "</dl>"
-                                                html_content += html_dict
                                 # Add download button in a smaller section
                                 with st.expander("Export Content"):
@@ -437,125 +586,60 @@ with main_tab1:
                                         try:
                                             # Import function
                                             try:
-                                                from ocr_utils import get_combined_markdown
                                             except ImportError:
                                                 st.error("Required module ocr_utils not found.")
                                                 st.stop()
-                                            # Check if raw_response is available
-                                            if 'raw_response' not in result:
-                                                st.warning("Raw OCR response not available. Cannot display images.")
-                                                st.stop()
-                                            # Validate the raw_response structure before processing
-                                            if not hasattr(result['raw_response'], 'pages'):
-                                                st.warning("Invalid OCR response format. Cannot display images.")
-                                                st.stop()
-                                            # Get the combined markdown with images
-                                            # Set a flag to compress images if needed
-                                            compress_images = True
-                                            max_image_width = 800  # Maximum width for images
-                                            try:
-                                                # First try to get combined markdown with compressed images
-                                                if compress_images and hasattr(result['raw_response'], 'pages'):
-                                                    from ocr_utils import get_combined_markdown_compressed
-                                                    combined_markdown = get_combined_markdown_compressed(
-                                                        result['raw_response'],
-                                                        max_width=max_image_width,
-                                                        quality=85
-                                                    )
-                                                else:
-                                                    # Fall back to regular method if compression not available
-                                                    combined_markdown = get_combined_markdown(result['raw_response'])
-                                            except (ImportError, AttributeError):
-                                                # Fall back to regular method
-                                                combined_markdown = get_combined_markdown(result['raw_response'])
-                                            if not combined_markdown or combined_markdown.strip() == "":
-                                                st.warning("No image content found in the document.")
                                                 st.stop()
-                                            # Check if there are many images that might cause loading issues
-                                            image_count = sum(len(page.images) for page in result['raw_response'].pages if hasattr(page, 'images'))
                                             # Add warning for image-heavy documents
                                             if image_count > 10:
                                                 st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
-                                            # Add CSS to ensure proper spacing and handling of text and images
-                                            st.markdown("""
-                                            <style>
-                                            .markdown-text-container {
-                                                padding: 10px;
-                                                background-color: #f9f9f9;
-                                                border-radius: 5px;
-                                            }
-                                            .markdown-text-container img {
-                                                margin: 15px 0;
-                                                max-width: 100%;
-                                                border: 1px solid #ddd;
-                                                border-radius: 4px;
-                                                display: block;
-                                            }
-                                            .markdown-text-container p {
-                                                margin-bottom: 16px;
-                                                line-height: 1.6;
-                                            }
-                                            /* Add lazy loading for images to improve performance */
-                                            .markdown-text-container img {
-                                                loading: lazy;
-                                            }
-                                            </style>
-                                            """, unsafe_allow_html=True)
-                                            # For very image-heavy documents, show images in a paginated way
-                                            if image_count > 20:
-                                                # Show image content in a paginated way
-                                                st.write("Document contains many images. Showing in a paginated format:")
-                                                # Split the combined markdown by page separators
-                                                pages = combined_markdown.split("---")
                                                 # Create a page selector
-                                                page_num = st.selectbox("Select page to view:",
-                                                                     options=list(range(1, len(pages)+1)),
-                                                                     index=0)
-                                                # Display only the selected page
-                                                st.markdown(f"""
-                                                <div class="markdown-text-container">
-                                                {pages[page_num-1]}
-                                                </div>
-                                                """, unsafe_allow_html=True)
-                                                # Add note about pagination
-                                                st.info(f"Showing page {page_num} of {len(pages)}. Select a different page from the dropdown above.")
-                                            else:
-                                                # Wrap the markdown in a div with the class for styling
                                                 st.markdown(f"""
-                                                <div class="markdown-text-container">
-                                                {combined_markdown}
-                                                </div>
                                                 """, unsafe_allow_html=True)
-                                            # Add a download button for the combined content
                                             st.download_button(
                                                 label="Download with Images (HTML)",
-                                                data=f"""
-                                                <html>
-                                                <head>
-                                                    <style>
-                                                    body {{ font-family: Arial, sans-serif; line-height: 1.6; }}
-                                                    img {{ max-width: 100%; margin: 15px 0; }}
-                                                    </style>
-                                                </head>
-                                                <body>
-                                                {combined_markdown}
-                                                </body>
-                                                </html>
-                                                """,
                                                 file_name="document_with_images.html",
                                                 mime="text/html"
                                             )
@@ -565,10 +649,6 @@ with main_tab1:
                                             st.info("Try refreshing or processing the document again.")
                         else:
                             st.error("No OCR content was extracted from the document.")
-                with results_tab2:
-                    st.subheader("Raw Processing Results")
-                    st.json(result)
             except Exception as e:
                 st.error(f"Error processing document: {str(e)}")
@@ -577,25 +657,63 @@ with main_tab1:
         st.info("Upload a document to get started using the file uploader above.")
         # Show example images in a grid
-        st.subheader("Example Documents")
         # Add a sample images container
         with st.container():
             # Find sample images from the input directory to display
             input_dir = Path(__file__).parent / "input"
             sample_images = []
             if input_dir.exists():
-                # Find valid jpg files (with size > 50KB to avoid placeholders)
-                sample_images = [
-                    path for path in input_dir.glob("*.jpg")
-                    if path.stat().st_size > 50000
-                ][:3]  # Limit to 3 samples
             if sample_images:
-                columns = st.columns(3)
-                for i, img_path in enumerate(sample_images):
-                    with columns[i % 3]:
-                        try:
-                            st.image(str(img_path), caption=img_path.name, use_container_width=True)
-                        except Exception as e:
-                            st.error(f"Error loading image {img_path.name}: {str(e)}")

         # Get file size in MB
         file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
+        # Check if file exceeds size limits (20 MB)
+        if file_size_mb > 20:
+            st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
             return {
                 "file_name": uploaded_file.name,
                 "topics": ["Document"],
                 "languages": ["English"],
                 "confidence_score": 0.0,
+                "error": f"File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
                 "ocr_contents": {
+                    "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
                     "partial_text": "Document could not be processed due to size limitations."
                 }
             }
 st.subheader("Powered by Mistral AI")
 # Create main layout with tabs and columns
+main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "About this App", "Companion Workshop"])
 with main_tab1:
     # Create a two-column layout for file upload and preview
         Using the `mistral-ocr-latest` model for advanced document understanding.
         """)
+        uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit 20MB per file")
 # Sidebar with options
 with st.sidebar:
     st.markdown("""
     ### About This Application
+    This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents with enhanced formatting and presentation.
     It can process:
     - Image files (jpg, png, etc.)
     - Text extraction with `mistral-ocr-latest`
     - Analysis with language models
     - Layout preservation with images
+    - Enhanced typography for historical documents
     View results in three formats:
+    - **Structured View**: Beautifully formatted HTML with proper document structure
+    - **Raw JSON**: Complete data structure for developers
+    - **With Images**: Document with embedded images preserving original layout
+    **Special Features:**
+    - **Poetry Formatting**: Special handling for poem structure with proper line spacing
+    - **Image Embedding**: Original document images embedded at correct positions
+    - **Multi-page Support**: Pagination controls for navigating multi-page documents
+    - **Typography**: Historical-appropriate fonts for better readability
+    - **Document Export**: Download options for saving in HTML format
+    **Technical Features:**
     - Image preprocessing for better OCR quality
     - PDF resolution and page controls
     - Progress tracking during processing
+    - Responsive design optimized for historical document presentation
     """)
+# Workshop tab content
+with main_tab3:
+    st.markdown("<h3>Hacking AI for Historical Research</h3>", unsafe_allow_html=True)
+    st.markdown("<p style='margin-bottom: 20px;'>Interactive workshop resources and materials</p>", unsafe_allow_html=True)
+    # Custom CSS to improve the Padlet embed appearance
+    st.markdown("""
+    <style>
+    .padlet-container {
+        border-radius: 8px;
+        box-shadow: 0 4px 6px rgba(0,0,0,0.1);
+        margin-top: 10px;
+        margin-bottom: 20px;
+        overflow: hidden;
+    }
+    </style>
+    """, unsafe_allow_html=True)
+    # Padlet embed with additional container
+    st.markdown("""
+    <div class="padlet-container">
+        <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:8px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4">
+            <p style="padding:0;margin:0">
+                <iframe src="https://padlet.com/embed/y9daf9yabqcj93dq" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:650px;display:block;padding:0;margin:0"></iframe>
+            </p>
+            <div style="display:flex;align-items:center;justify-content:end;margin:0;height:28px">
+                <a href="https://padlet.com?ref=embed" style="display:block;flex-grow:0;margin:0;border:none;padding:0;text-decoration:none" target="_blank">
+                    <div style="display:flex;align-items:center;">
+                        <img src="https://padlet.net/embeds/made_with_padlet_2022.png" width="114" height="28" style="padding:0;margin:0;background:0 0;border:none;box-shadow:none" alt="Made with Padlet">
+                    </div>
+                </a>
+            </div>
+        </div>
+    </div>
+    """, unsafe_allow_html=True)
 with main_tab1:
     if uploaded_file is not None:
+        # Check file size (cap at 20MB)
         file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
+        if file_size_mb > 20:
             with upload_col:
+                st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
             st.stop()
         file_ext = Path(uploaded_file.name).suffix.lower()
                 # Call process_file with all options
                 result = process_file(uploaded_file, use_vision, preprocessing_options)
+                # Single tab for document analysis
+                with st.container():
                     # Create two columns for metadata and content
                     meta_col, content_col = st.columns([1, 2])
                         st.subheader("Document Contents")
                         if 'ocr_contents' in result:
                             # Check if there are images in the OCR result
+                            has_images = result.get('has_images', False)
                             # Create tabs for different views
                             if has_images:
                             with view_tab1:
                                 # Display in a more user-friendly format based on the content structure
+                                html_content = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<title>OCR Document</title>\n<style>\n'
+                                html_content += """
+body {
+    font-family: 'Georgia', serif;
+    line-height: 1.6;
+    margin: 0;
+    padding: 20px;
+    background-color: #f9f9f9;
+    color: #333;
+}
+.container {
+    max-width: 1000px;
+    margin: 0 auto;
+    background-color: #fff;
+    padding: 30px;
+    border-radius: 8px;
+    box-shadow: 0 4px 12px rgba(0,0,0,0.1);
+}
+h1, h2, h3, h4 {
+    font-family: 'Bookman', 'Georgia', serif;
+    margin-top: 1.5em;
+    margin-bottom: 0.5em;
+    color: #222;
+}
+h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
+h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
+h3 { font-size: 1.5em; }
+h4 { font-size: 1.2em; }
+p { margin-bottom: 1.2em; text-align: justify; }
+ul, ol { margin-bottom: 1.5em; }
+li { margin-bottom: 0.5em; }
+.poem {
+    font-family: 'Baskerville', 'Georgia', serif;
+    margin-left: 2em;
+    line-height: 1.8;
+    white-space: pre-wrap;
+}
+.subtitle {
+    font-style: italic;
+    font-size: 1.1em;
+    margin-bottom: 1.5em;
+    color: #555;
+}
+blockquote {
+    border-left: 3px solid #ccc;
+    margin: 1.5em 0;
+    padding: 0.5em 1.5em;
+    background-color: #f5f5f5;
+    font-style: italic;
+}
+dl {
+    margin-bottom: 1.5em;
+}
+dt {
+    font-weight: bold;
+    margin-top: 1em;
+}
+dd {
+    margin-left: 2em;
+    margin-bottom: 0.5em;
+}
+</style>
+</head>
+<body>
+<div class="container">
+"""
                                 if isinstance(result['ocr_contents'], dict):
                                     for section, content in result['ocr_contents'].items():
+                                        if not content:  # Skip empty sections
+                                            continue
+                                        section_title = section.replace('_', ' ').title()
+                                        # Special handling for title and subtitle
+                                        if section.lower() == 'title':
+                                            html_content += f'<h1>{content}</h1>\n'
+                                            st.markdown(f"## {content}")
+                                        elif section.lower() == 'subtitle':
+                                            html_content += f'<div class="subtitle">{content}</div>\n'
+                                            st.markdown(f"*{content}*")
+                                        else:
+                                            # Section headers for non-title sections
+                                            html_content += f'<h3>{section_title}</h3>\n'
+                                            st.markdown(f"### {section_title}")
+                                        # Process different content types
+                                        if isinstance(content, str):
+                                            # Handle poem type specifically
+                                            if section.lower() == 'type' and content.lower() == 'poem':
+                                                # Don't add special formatting here, just for the lines
                                                 st.markdown(content)
+                                                html_content += f'<p>{content}</p>\n'
+                                            elif 'content' in result['ocr_contents'] and isinstance(result['ocr_contents']['content'], dict) and 'type' in result['ocr_contents']['content'] and result['ocr_contents']['content']['type'] == 'poem' and section.lower() == 'content':
+                                                # This is handled in the dict case below
+                                                pass
+                                            else:
+                                                # Regular text content
+                                                paragraphs = content.split('\n\n')
+                                                for p in paragraphs:
+                                                    if p.strip():
+                                                        html_content += f'<p>{p.strip()}</p>\n'
+                                                st.markdown(content)
+                                        elif isinstance(content, list):
+                                            # Handle lists (bullet points, etc.)
+                                            html_content += '<ul>\n'
+                                            for item in content:
+                                                if isinstance(item, str):
+                                                    html_content += f'<li>{item}</li>\n'
+                                                    st.markdown(f"- {item}")
+                                                elif isinstance(item, dict):
+                                                    # Format dictionary items in a readable way
+                                                    html_content += f'<li><pre>{json.dumps(item, indent=2)}</pre></li>\n'
+                                                    st.json(item)
+                                            html_content += '</ul>\n'
+                                        elif isinstance(content, dict):
+                                            # Special handling for poem type
+                                            if 'type' in content and content['type'] == 'poem' and 'lines' in content:
+                                                html_content += '<div class="poem">\n'
+                                                for line in content['lines']:
+                                                    html_content += f'{line}\n'
+                                                    st.markdown(line)
+                                                html_content += '</div>\n'
+                                            else:
+                                                # Regular dictionary display
+                                                html_content += '<dl>\n'
                                                 for k, v in content.items():
+                                                    html_content += f'<dt>{k}</dt>\n<dd>'
+                                                    if isinstance(v, str):
+                                                        html_content += v
+                                                    elif isinstance(v, list):
+                                                        html_content += ', '.join(str(item) for item in v)
+                                                    else:
+                                                        html_content += str(v)
+                                                    html_content += '</dd>\n'
                                                     st.markdown(f"**{k}:** {v}")
+                                                html_content += '</dl>\n'
+                                # Close HTML document
+                                html_content += '</div>\n</body>\n</html>'
                                 # Add download button in a smaller section
                                 with st.expander("Export Content"):
                                         try:
                                             # Import function
                                             try:
+                                                from ocr_utils import create_html_with_images
                                             except ImportError:
                                                 st.error("Required module ocr_utils not found.")
                                                 st.stop()
+                                            # Check if has_images flag is set
+                                            if not result.get('has_images', False) or 'pages_data' not in result:
+                                                st.warning("No image data available in the OCR response.")
                                                 st.stop()
+                                            # Count images in the result
+                                            image_count = 0
+                                            for page in result.get('pages_data', []):
+                                                image_count += len(page.get('images', []))
                                             # Add warning for image-heavy documents
                                             if image_count > 10:
                                                 st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
+                                            # Generate HTML with images
+                                            html_with_images = create_html_with_images(result)
+                                            # For multi-page documents, create page navigation
+                                            page_count = len(result.get('pages_data', []))
+                                            if page_count > 1:
+                                                st.info(f"Document contains {page_count} pages. You can scroll to view all pages or use the page selector below.")
                                                 # Create a page selector
+                                                page_options = [f"Page {i+1}" for i in range(page_count)]
+                                                selected_page = st.selectbox("Jump to page:", options=page_options, index=0)
+                                                # Extract page number from selection
+                                                page_num = int(selected_page.split(" ")[1])
+                                                # Add JavaScript to scroll to the selected page
                                                 st.markdown(f"""
+                                                <script>
+                                                    document.addEventListener('DOMContentLoaded', function() {{
+                                                        const element = document.getElementById('page-{page_num}');
+                                                        if (element) {{
+                                                            element.scrollIntoView({{ behavior: 'smooth' }});
+                                                        }}
+                                                    }});
+                                                </script>
                                                 """, unsafe_allow_html=True)
+                                            # Display the HTML content
+                                            st.components.v1.html(html_with_images, height=600, scrolling=True)
+                                            # Add download button for the content with images
                                             st.download_button(
                                                 label="Download with Images (HTML)",
+                                                data=html_with_images,
                                                 file_name="document_with_images.html",
                                                 mime="text/html"
                                             )
                                             st.info("Try refreshing or processing the document again.")
                         else:
                             st.error("No OCR content was extracted from the document.")
             except Exception as e:
                 st.error(f"Error processing document: {str(e)}")
         st.info("Upload a document to get started using the file uploader above.")
         # Show example images in a grid
         # Add a sample images container
         with st.container():
             # Find sample images from the input directory to display
             input_dir = Path(__file__).parent / "input"
             sample_images = []
             if input_dir.exists():
+                # Get all potential image files - exclude PDF files
+                all_images = []
+                all_images.extend(list(input_dir.glob("*.jpg")))
+                all_images.extend(list(input_dir.glob("*.jpeg")))
+                all_images.extend(list(input_dir.glob("*.png")))
+                # Filter to get a good set of diverse images - not too small, not too large
+                valid_images = [path for path in all_images if 50000 < path.stat().st_size < 1000000]
+                # Deduplicate any images that might have the same content (like recipe and historical-recipe)
+                seen_sizes = {}
+                deduplicated_images = []
+                for img in valid_images:
+                    size = img.stat().st_size
+                    # If we haven't seen this exact file size before, include it
+                    # This simple heuristic works well enough for images with identical content
+                    if size not in seen_sizes:
+                        seen_sizes[size] = True
+                        deduplicated_images.append(img)
+                valid_images = deduplicated_images
+                # Select a random sample of 6 images if we have enough
+                import random
+                if len(valid_images) > 6:
+                    sample_images = random.sample(valid_images, 6)
+                else:
+                    sample_images = valid_images
             if sample_images:
+                # Create two rows of three columns
+                # First row
+                row1 = st.columns(3)
+                for i in range(3):
+                    if i < len(sample_images):
+                        with row1[i]:
+                            try:
+                                st.image(str(sample_images[i]), caption=sample_images[i].name, use_container_width=True)
+                            except Exception:
+                                # Silently skip problematic images
+                                pass
+                # Second row
+                row2 = st.columns(3)
+                for i in range(3):
+                    idx = i + 3
+                    if idx < len(sample_images):
+                        with row2[i]:
+                            try:
+                                st.image(str(sample_images[idx]), caption=sample_images[idx].name, use_container_width=True)
+                            except Exception:
+                                # Silently skip problematic images
+                                pass

ocr_utils.py CHANGED Viewed

@@ -125,7 +125,7 @@ def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
     response_dict = json.loads(ocr_response.model_dump_json())
     return json.dumps(response_dict, indent=indent)
-def get_combined_markdown_compressed(ocr_response, max_width: int = 800, quality: int = 85) -> str:
     """
     Combine OCR text and images into a single markdown document with compressed images.
     Reduces image sizes to improve performance.
@@ -209,4 +209,163 @@ try:
         display(Markdown(combined_markdown))
 except ImportError:
     # IPython not available
-    pass

     response_dict = json.loads(ocr_response.model_dump_json())
     return json.dumps(response_dict, indent=indent)
+def get_combined_markdown_compressed(ocr_response, max_width: int = 1200, quality: int = 92) -> str:
     """
     Combine OCR text and images into a single markdown document with compressed images.
     Reduces image sizes to improve performance.
         display(Markdown(combined_markdown))
 except ImportError:
     # IPython not available
+    pass
+def create_html_with_images(result_with_pages: dict) -> str:
+    """
+    Create HTML with embedded images from the OCR result.
+    Args:
+        result_with_pages: OCR result with pages_data containing markdown and images
+    Returns:
+        HTML string with embedded images
+    """
+    if not result_with_pages.get('has_images', False) or 'pages_data' not in result_with_pages:
+        return "<p>No images available in the document.</p>"
+    # Create HTML document
+    html = """<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Document with Images</title>
+    <style>
+        body {
+            font-family: 'Georgia', serif;
+            line-height: 1.6;
+            margin: 0;
+            padding: 20px;
+            background-color: #f9f9f9;
+            color: #333;
+        }
+        .container {
+            max-width: 1000px;
+            margin: 0 auto;
+            background-color: #fff;
+            padding: 30px;
+            border-radius: 8px;
+            box-shadow: 0 4px 12px rgba(0,0,0,0.1);
+        }
+        h1, h2, h3, h4 {
+            font-family: 'Bookman', 'Georgia', serif;
+            margin-top: 1.5em;
+            margin-bottom: 0.5em;
+            color: #222;
+        }
+        h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
+        h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
+        h3 { font-size: 1.5em; }
+        h4 { font-size: 1.2em; }
+        p { margin-bottom: 1.2em; text-align: justify; }
+        img {
+            max-width: 100%;
+            height: auto;
+            margin: 20px 0;
+            border: 1px solid #ddd;
+            border-radius: 6px;
+            box-shadow: 0 3px 6px rgba(0,0,0,0.1);
+            display: block;
+        }
+        .page {
+            margin-bottom: 40px;
+            padding-bottom: 30px;
+            border-bottom: 1px dashed #ccc;
+        }
+        .page:last-child {
+            border-bottom: none;
+        }
+        .page-title {
+            text-align: center;
+            color: #555;
+            font-style: italic;
+            margin: 30px 0;
+        }
+        pre {
+            background-color: #f5f5f5;
+            padding: 15px;
+            border-radius: 5px;
+            overflow-x: auto;
+            font-size: 14px;
+            line-height: 1.4;
+        }
+        blockquote {
+            border-left: 3px solid #ccc;
+            margin: 1.5em 0;
+            padding: 0.5em 1.5em;
+            background-color: #f5f5f5;
+            font-style: italic;
+        }
+        .poem {
+            font-family: 'Baskerville', 'Georgia', serif;
+            margin-left: 2em;
+            line-height: 1.8;
+            white-space: pre-wrap;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+"""
+    # Process each page
+    pages_data = result_with_pages.get('pages_data', [])
+    for page_idx, page in enumerate(pages_data):
+        page_number = page.get('page_number', page_idx + 1)
+        page_markdown = page.get('markdown', '')
+        page_images = page.get('images', [])
+        # Add page header
+        html += f'<div class="page" id="page-{page_number}">\n'
+        if len(pages_data) > 1:
+            html += f'<div class="page-title">Page {page_number}</div>\n'
+        # Process markdown text and replace image references
+        if page_markdown:
+            # Replace image markers with actual images
+            for img in page_images:
+                img_id = img.get('id', '')
+                img_base64 = img.get('image_base64', '')
+                if img_id and img_base64:
+                    # Format image tag
+                    img_tag = f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">'
+                    # Replace markdown image reference with HTML image
+                    page_markdown = page_markdown.replace(f'![{img_id}]({img_id})', img_tag)
+            # Convert line breaks to <p> tags for proper HTML formatting
+            paragraphs = page_markdown.split('\n\n')
+            for paragraph in paragraphs:
+                if paragraph.strip():
+                    # Check if this looks like a header
+                    if paragraph.startswith('# '):
+                        header_text = paragraph[2:].strip()
+                        html += f'<h1>{header_text}</h1>\n'
+                    elif paragraph.startswith('## '):
+                        header_text = paragraph[3:].strip()
+                        html += f'<h2>{header_text}</h2>\n'
+                    elif paragraph.startswith('### '):
+                        header_text = paragraph[4:].strip()
+                        html += f'<h3>{header_text}</h3>\n'
+                    else:
+                        html += f'<p>{paragraph}</p>\n'
+        # Add any images that weren't referenced in the markdown
+        referenced_img_ids = [img.get('id') for img in page_images if img.get('id') in page_markdown]
+        for img in page_images:
+            img_id = img.get('id', '')
+            img_base64 = img.get('image_base64', '')
+            if img_id and img_base64 and img_id not in referenced_img_ids:
+                html += f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">\n'
+        # Close page div
+        html += '</div>\n'
+    # Close main container and document
+    html += """    </div>
+</body>
+</html>"""
+    return html

structured_ocr.py CHANGED Viewed

@@ -238,8 +238,31 @@ class StructuredOCR:
             # Add confidence score
             result['confidence_score'] = confidence_score
-            # Store the raw OCR response for image rendering
-            result['raw_response'] = pdf_response
             logger.info(f"PDF processing completed successfully")
             return result
@@ -300,8 +323,31 @@ class StructuredOCR:
             # Add confidence score
             result['confidence_score'] = confidence_score
-            # Store the raw OCR response for image rendering
-            result['raw_response'] = image_response
             logger.info("Image processing completed successfully")
             return result
@@ -336,7 +382,10 @@ class StructuredOCR:
                                 f"This is a historical document's OCR in markdown:\n"
                                 f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
                                 f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
-                                f"Extract topics, languages, and organize the content logically."
                             ))
                         ],
                     },
@@ -371,7 +420,10 @@ class StructuredOCR:
                         "content": f"This is a historical document's OCR in markdown:\n"
                                   f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
                                   f"Convert this into a structured JSON response with the OCR contents. "
-                                  f"Extract topics, languages, and organize the content logically."
                     },
                 ],
                 response_format=StructuredOCRModel,

             # Add confidence score
             result['confidence_score'] = confidence_score
+            # Store key parts of the OCR response for image rendering
+            # Extract and store image data in a format that can be serialized to JSON
+            has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
+            result['has_images'] = has_images
+            if has_images:
+                # Create a structured representation of images that can be serialized
+                result['pages_data'] = []
+                for page_idx, page in enumerate(pdf_response.pages):
+                    page_data = {
+                        'page_number': page_idx + 1,
+                        'markdown': page.markdown if hasattr(page, 'markdown') else '',
+                        'images': []
+                    }
+                    # Extract images if present
+                    if hasattr(page, 'images') and page.images:
+                        for img_idx, img in enumerate(page.images):
+                            if hasattr(img, 'image_base64') and img.image_base64:
+                                page_data['images'].append({
+                                    'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
+                                    'image_base64': img.image_base64
+                                })
+                    result['pages_data'].append(page_data)
             logger.info(f"PDF processing completed successfully")
             return result
             # Add confidence score
             result['confidence_score'] = confidence_score
+            # Store key parts of the OCR response for image rendering
+            # Extract and store image data in a format that can be serialized to JSON
+            has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
+            result['has_images'] = has_images
+            if has_images:
+                # Create a structured representation of images that can be serialized
+                result['pages_data'] = []
+                for page_idx, page in enumerate(image_response.pages):
+                    page_data = {
+                        'page_number': page_idx + 1,
+                        'markdown': page.markdown if hasattr(page, 'markdown') else '',
+                        'images': []
+                    }
+                    # Extract images if present
+                    if hasattr(page, 'images') and page.images:
+                        for img_idx, img in enumerate(page.images):
+                            if hasattr(img, 'image_base64') and img.image_base64:
+                                page_data['images'].append({
+                                    'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
+                                    'image_base64': img.image_base64
+                                })
+                    result['pages_data'].append(page_data)
             logger.info("Image processing completed successfully")
             return result
                                 f"This is a historical document's OCR in markdown:\n"
                                 f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
                                 f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
+                                f"Extract topics, languages, document type, date if present, and key entities. "
+                                f"For handwritten documents, carefully preserve the structure. "
+                                f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
+                                f"For tabular content, preserve the table structure as much as possible."
                             ))
                         ],
                     },
                         "content": f"This is a historical document's OCR in markdown:\n"
                                   f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
                                   f"Convert this into a structured JSON response with the OCR contents. "
+                                  f"Extract topics, languages, document type, date if present, and key entities. "
+                                  f"For handwritten documents, carefully preserve the structure. "
+                                  f"For printed texts, organize content logically by sections. "
+                                  f"For tabular content, preserve the table structure as much as possible."
                     },
                 ],
                 response_format=StructuredOCRModel,