milwright commited on
Commit
8a9c37d
·
1 Parent(s): 4ddf559

Update documentation and improve HTML formatting

Browse files

- Enhanced README with detailed features and documentation
- Improved About section with new special features
- Updated HTML formatting for better historical document presentation
- Added specialized poem formatting and multi-page support
- Removed redundant Technical Details expander

Files changed (5) hide show
  1. CLAUDE.md +35 -0
  2. README.md +26 -8
  3. app.py +285 -167
  4. ocr_utils.py +161 -2
  5. structured_ocr.py +58 -6
CLAUDE.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Historical OCR Project Guidelines
2
+
3
+ ## Commands
4
+ - Run standard app: `./run_local.sh` or `streamlit run app.py`
5
+ - Run educational version: `./run_local.sh educational` or `streamlit run streamlit_app.py`
6
+ - Run simple test: `python simple_test.py`
7
+ - Run PDF test: `python test_pdf.py`
8
+ - Process large files: `./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500`
9
+ - Prepare for Hugging Face: `python prepare_for_hf.py`
10
+
11
+ ## Environment Setup
12
+ - API key: Set `MISTRAL_API_KEY` in `.env` file or as environment variable
13
+ - System dependencies: Install poppler for PDF processing (brew install poppler on macOS)
14
+ - Python dependencies: `pip install -r requirements.txt`
15
+
16
+ ## Code Style
17
+ - Imports: Standard library → third-party → local imports
18
+ - Documentation: Google-style docstrings with Args, Returns sections
19
+ - Error handling: Specific exceptions with informative messages, logging
20
+ - Naming: snake_case for variables/functions, PascalCase for classes
21
+ - Type hints: Pydantic models for structured data, typing module annotations
22
+
23
+ ## Project Structure
24
+ - Core: `structured_ocr.py` - OCR processing with Mistral AI
25
+ - Utils: `ocr_utils.py` - Text/image processing utilities
26
+ - PDF: `pdf_ocr.py` - PDF-specific document handling
27
+ - Config: `config.py` - API settings, model selection
28
+ - UI: Streamlit interface with modular components
29
+ - Testing: Simple test scripts in project root
30
+
31
+ ## Development Workflow
32
+ - Use logging for debugging (configured in structured_ocr.py)
33
+ - Test with sample files in input/ directory
34
+ - Handle large files with specific options for optimal processing
35
+ - Check confidence_score in results to evaluate OCR quality
README.md CHANGED
@@ -13,19 +13,23 @@ short_description: Employs Mistral OCR for transcribing historical data
13
 
14
  # Historical Document OCR
15
 
16
- This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents.
17
 
18
  ## Features
19
 
20
  - OCR processing for both image and PDF files
21
- - Automatic file type detection
22
- - Structured output generation using Mistral models
 
23
  - Interactive web interface with Streamlit
24
- - Supports historical documents and manuscripts
25
- - PDF preview functionality for better user experience
 
26
  - Smart handling of large PDFs with automatic page limiting
27
- - Robust error handling with helpful messages
28
  - Image preprocessing options for enhanced OCR accuracy
 
 
 
29
 
30
  ## Project Structure
31
 
@@ -69,7 +73,8 @@ Historical OCR - Project Structure
69
  │ └─ process_file.py # File processing for educational app
70
 
71
  ├─ UI Components (ui/)
72
- └─ layout.py # Shared UI components and styling
 
73
 
74
  ├─ Data Directories
75
  │ ├─ input/ # Sample documents for testing/demo
@@ -117,7 +122,20 @@ streamlit run app.py
117
  1. Upload an image or PDF file using the file uploader
118
  2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
119
  3. Click "Process Document" to analyze the file
120
- 4. View the structured results and extract information
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  ## Application Versions
123
 
 
13
 
14
  # Historical Document OCR
15
 
16
+ This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents with enhanced formatting and presentation.
17
 
18
  ## Features
19
 
20
  - OCR processing for both image and PDF files
21
+ - Automatic file type detection and content structuring
22
+ - Advanced HTML formatting with proper document structure preservation
23
+ - Specialized formatting for poems and historical texts
24
  - Interactive web interface with Streamlit
25
+ - "With Images" view that preserves document layout and embedded images
26
+ - Multi-page document support with pagination
27
+ - PDF preview functionality
28
  - Smart handling of large PDFs with automatic page limiting
 
29
  - Image preprocessing options for enhanced OCR accuracy
30
+ - Document export in multiple formats (HTML, JSON)
31
+ - Responsive design optimized for historical document presentation
32
+ - Enhanced typography with appropriate fonts for historical content
33
 
34
  ## Project Structure
35
 
 
73
  │ └─ process_file.py # File processing for educational app
74
 
75
  ├─ UI Components (ui/)
76
+ ├─ layout.py # Shared UI components and styling
77
+ │ └─ custom.css # Custom styling for the application
78
 
79
  ├─ Data Directories
80
  │ ├─ input/ # Sample documents for testing/demo
 
122
  1. Upload an image or PDF file using the file uploader
123
  2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
124
  3. Click "Process Document" to analyze the file
125
+ 4. View the results in three available formats:
126
+ - **Structured View**: Beautifully formatted HTML with proper document structure
127
+ - **Raw JSON**: Complete data structure for developers
128
+ - **With Images**: Document with embedded images preserving original layout
129
+
130
+ ## Document Output Features
131
+
132
+ The application provides several specialized features for historical document presentation:
133
+
134
+ 1. **Poetry Formatting**: Special handling for poem structure with proper line spacing and typography
135
+ 2. **Image Embedding**: Original document images embedded within the text at their correct positions
136
+ 3. **Multi-page Support**: Pagination controls for navigating multi-page documents
137
+ 4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
138
+ 5. **Document Export**: Download options for saving the processed document in HTML format
139
 
140
  ## Application Versions
141
 
app.py CHANGED
@@ -146,17 +146,17 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
146
  # Get file size in MB
147
  file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
148
 
149
- # Check if file exceeds size limits (200 MB)
150
- if file_size_mb > 200:
151
- st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 200MB.")
152
  return {
153
  "file_name": uploaded_file.name,
154
  "topics": ["Document"],
155
  "languages": ["English"],
156
  "confidence_score": 0.0,
157
- "error": f"File size {file_size_mb:.2f} MB exceeds limit of 200 MB",
158
  "ocr_contents": {
159
- "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of 200 MB",
160
  "partial_text": "Document could not be processed due to size limitations."
161
  }
162
  }
@@ -190,7 +190,7 @@ st.title("Historical Document OCR")
190
  st.subheader("Powered by Mistral AI")
191
 
192
  # Create main layout with tabs and columns
193
- main_tab1, main_tab2 = st.tabs(["Document Processing", "About"])
194
 
195
  with main_tab1:
196
  # Create a two-column layout for file upload and preview
@@ -203,7 +203,7 @@ with main_tab1:
203
 
204
  Using the `mistral-ocr-latest` model for advanced document understanding.
205
  """)
206
- uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit 200MB per file")
207
 
208
  # Sidebar with options
209
  with st.sidebar:
@@ -240,7 +240,7 @@ with main_tab2:
240
  st.markdown("""
241
  ### About This Application
242
 
243
- This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents.
244
 
245
  It can process:
246
  - Image files (jpg, png, etc.)
@@ -250,26 +250,71 @@ with main_tab2:
250
  - Text extraction with `mistral-ocr-latest`
251
  - Analysis with language models
252
  - Layout preservation with images
 
253
 
254
  View results in three formats:
255
- - Structured HTML view
256
- - Raw JSON (for developers)
257
- - Markdown with images (preserves document layout)
258
 
259
- **New Features:**
 
 
 
 
 
 
 
260
  - Image preprocessing for better OCR quality
261
  - PDF resolution and page controls
262
  - Progress tracking during processing
 
263
  """)
264
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
265
  with main_tab1:
266
  if uploaded_file is not None:
267
- # Check file size (cap at 200MB)
268
  file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
269
 
270
- if file_size_mb > 200:
271
  with upload_col:
272
- st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 200MB.")
273
  st.stop()
274
 
275
  file_ext = Path(uploaded_file.name).suffix.lower()
@@ -331,10 +376,8 @@ with main_tab1:
331
  # Call process_file with all options
332
  result = process_file(uploaded_file, use_vision, preprocessing_options)
333
 
334
- # Create results tabs for better organization
335
- results_tab1, results_tab2 = st.tabs(["Document Analysis", "Technical Details"])
336
-
337
- with results_tab1:
338
  # Create two columns for metadata and content
339
  meta_col, content_col = st.columns([1, 2])
340
 
@@ -368,12 +411,7 @@ with main_tab1:
368
  st.subheader("Document Contents")
369
  if 'ocr_contents' in result:
370
  # Check if there are images in the OCR result
371
- has_images = False
372
- if 'raw_response' in result:
373
- try:
374
- has_images = any(page.images for page in result['raw_response'].pages)
375
- except Exception:
376
- has_images = False
377
 
378
  # Create tabs for different views
379
  if has_images:
@@ -383,37 +421,148 @@ with main_tab1:
383
 
384
  with view_tab1:
385
  # Display in a more user-friendly format based on the content structure
386
- html_content = ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
387
  if isinstance(result['ocr_contents'], dict):
388
  for section, content in result['ocr_contents'].items():
389
- if content: # Only display non-empty sections
390
- section_title = f"<h4>{section.replace('_', ' ').title()}</h4>"
391
- html_content += section_title
392
 
393
- if isinstance(content, str):
394
- html_content += f"<p>{content}</p>"
395
- st.markdown(f"#### {section.replace('_', ' ').title()}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
396
  st.markdown(content)
397
- elif isinstance(content, list):
398
- html_list = "<ul>"
399
- st.markdown(f"#### {section.replace('_', ' ').title()}")
400
- for item in content:
401
- if isinstance(item, str):
402
- html_list += f"<li>{item}</li>"
403
- st.markdown(f"- {item}")
404
- elif isinstance(item, dict):
405
- html_list += f"<li>{json.dumps(item)}</li>"
406
- st.json(item)
407
- html_list += "</ul>"
408
- html_content += html_list
409
- elif isinstance(content, dict):
410
- html_dict = "<dl>"
411
- st.markdown(f"#### {section.replace('_', ' ').title()}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
412
  for k, v in content.items():
413
- html_dict += f"<dt><strong>{k}</strong></dt><dd>{v}</dd>"
 
 
 
 
 
 
 
414
  st.markdown(f"**{k}:** {v}")
415
- html_dict += "</dl>"
416
- html_content += html_dict
 
 
417
 
418
  # Add download button in a smaller section
419
  with st.expander("Export Content"):
@@ -437,125 +586,60 @@ with main_tab1:
437
  try:
438
  # Import function
439
  try:
440
- from ocr_utils import get_combined_markdown
441
  except ImportError:
442
  st.error("Required module ocr_utils not found.")
443
  st.stop()
444
 
445
- # Check if raw_response is available
446
- if 'raw_response' not in result:
447
- st.warning("Raw OCR response not available. Cannot display images.")
448
- st.stop()
449
-
450
- # Validate the raw_response structure before processing
451
- if not hasattr(result['raw_response'], 'pages'):
452
- st.warning("Invalid OCR response format. Cannot display images.")
453
- st.stop()
454
-
455
- # Get the combined markdown with images
456
- # Set a flag to compress images if needed
457
- compress_images = True
458
- max_image_width = 800 # Maximum width for images
459
-
460
- try:
461
- # First try to get combined markdown with compressed images
462
- if compress_images and hasattr(result['raw_response'], 'pages'):
463
- from ocr_utils import get_combined_markdown_compressed
464
- combined_markdown = get_combined_markdown_compressed(
465
- result['raw_response'],
466
- max_width=max_image_width,
467
- quality=85
468
- )
469
- else:
470
- # Fall back to regular method if compression not available
471
- combined_markdown = get_combined_markdown(result['raw_response'])
472
- except (ImportError, AttributeError):
473
- # Fall back to regular method
474
- combined_markdown = get_combined_markdown(result['raw_response'])
475
-
476
- if not combined_markdown or combined_markdown.strip() == "":
477
- st.warning("No image content found in the document.")
478
  st.stop()
479
 
480
- # Check if there are many images that might cause loading issues
481
- image_count = sum(len(page.images) for page in result['raw_response'].pages if hasattr(page, 'images'))
 
 
482
 
483
  # Add warning for image-heavy documents
484
  if image_count > 10:
485
  st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
486
-
487
- # Add CSS to ensure proper spacing and handling of text and images
488
- st.markdown("""
489
- <style>
490
- .markdown-text-container {
491
- padding: 10px;
492
- background-color: #f9f9f9;
493
- border-radius: 5px;
494
- }
495
- .markdown-text-container img {
496
- margin: 15px 0;
497
- max-width: 100%;
498
- border: 1px solid #ddd;
499
- border-radius: 4px;
500
- display: block;
501
- }
502
- .markdown-text-container p {
503
- margin-bottom: 16px;
504
- line-height: 1.6;
505
- }
506
- /* Add lazy loading for images to improve performance */
507
- .markdown-text-container img {
508
- loading: lazy;
509
- }
510
- </style>
511
- """, unsafe_allow_html=True)
512
 
513
- # For very image-heavy documents, show images in a paginated way
514
- if image_count > 20:
515
- # Show image content in a paginated way
516
- st.write("Document contains many images. Showing in a paginated format:")
517
-
518
- # Split the combined markdown by page separators
519
- pages = combined_markdown.split("---")
 
520
 
521
  # Create a page selector
522
- page_num = st.selectbox("Select page to view:",
523
- options=list(range(1, len(pages)+1)),
524
- index=0)
525
 
526
- # Display only the selected page
527
- st.markdown(f"""
528
- <div class="markdown-text-container">
529
- {pages[page_num-1]}
530
- </div>
531
- """, unsafe_allow_html=True)
532
 
533
- # Add note about pagination
534
- st.info(f"Showing page {page_num} of {len(pages)}. Select a different page from the dropdown above.")
535
- else:
536
- # Wrap the markdown in a div with the class for styling
537
  st.markdown(f"""
538
- <div class="markdown-text-container">
539
- {combined_markdown}
540
- </div>
 
 
 
 
 
541
  """, unsafe_allow_html=True)
542
 
543
- # Add a download button for the combined content
 
 
 
544
  st.download_button(
545
  label="Download with Images (HTML)",
546
- data=f"""
547
- <html>
548
- <head>
549
- <style>
550
- body {{ font-family: Arial, sans-serif; line-height: 1.6; }}
551
- img {{ max-width: 100%; margin: 15px 0; }}
552
- </style>
553
- </head>
554
- <body>
555
- {combined_markdown}
556
- </body>
557
- </html>
558
- """,
559
  file_name="document_with_images.html",
560
  mime="text/html"
561
  )
@@ -565,10 +649,6 @@ with main_tab1:
565
  st.info("Try refreshing or processing the document again.")
566
  else:
567
  st.error("No OCR content was extracted from the document.")
568
-
569
- with results_tab2:
570
- st.subheader("Raw Processing Results")
571
- st.json(result)
572
 
573
  except Exception as e:
574
  st.error(f"Error processing document: {str(e)}")
@@ -577,25 +657,63 @@ with main_tab1:
577
  st.info("Upload a document to get started using the file uploader above.")
578
 
579
  # Show example images in a grid
580
- st.subheader("Example Documents")
581
-
582
  # Add a sample images container
583
  with st.container():
584
  # Find sample images from the input directory to display
585
  input_dir = Path(__file__).parent / "input"
586
  sample_images = []
587
  if input_dir.exists():
588
- # Find valid jpg files (with size > 50KB to avoid placeholders)
589
- sample_images = [
590
- path for path in input_dir.glob("*.jpg")
591
- if path.stat().st_size > 50000
592
- ][:3] # Limit to 3 samples
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
593
 
594
  if sample_images:
595
- columns = st.columns(3)
596
- for i, img_path in enumerate(sample_images):
597
- with columns[i % 3]:
598
- try:
599
- st.image(str(img_path), caption=img_path.name, use_container_width=True)
600
- except Exception as e:
601
- st.error(f"Error loading image {img_path.name}: {str(e)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  # Get file size in MB
147
  file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
148
 
149
+ # Check if file exceeds size limits (20 MB)
150
+ if file_size_mb > 20:
151
+ st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
152
  return {
153
  "file_name": uploaded_file.name,
154
  "topics": ["Document"],
155
  "languages": ["English"],
156
  "confidence_score": 0.0,
157
+ "error": f"File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
158
  "ocr_contents": {
159
+ "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
160
  "partial_text": "Document could not be processed due to size limitations."
161
  }
162
  }
 
190
  st.subheader("Powered by Mistral AI")
191
 
192
  # Create main layout with tabs and columns
193
+ main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "About this App", "Companion Workshop"])
194
 
195
  with main_tab1:
196
  # Create a two-column layout for file upload and preview
 
203
 
204
  Using the `mistral-ocr-latest` model for advanced document understanding.
205
  """)
206
+ uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit 20MB per file")
207
 
208
  # Sidebar with options
209
  with st.sidebar:
 
240
  st.markdown("""
241
  ### About This Application
242
 
243
+ This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents with enhanced formatting and presentation.
244
 
245
  It can process:
246
  - Image files (jpg, png, etc.)
 
250
  - Text extraction with `mistral-ocr-latest`
251
  - Analysis with language models
252
  - Layout preservation with images
253
+ - Enhanced typography for historical documents
254
 
255
  View results in three formats:
256
+ - **Structured View**: Beautifully formatted HTML with proper document structure
257
+ - **Raw JSON**: Complete data structure for developers
258
+ - **With Images**: Document with embedded images preserving original layout
259
 
260
+ **Special Features:**
261
+ - **Poetry Formatting**: Special handling for poem structure with proper line spacing
262
+ - **Image Embedding**: Original document images embedded at correct positions
263
+ - **Multi-page Support**: Pagination controls for navigating multi-page documents
264
+ - **Typography**: Historical-appropriate fonts for better readability
265
+ - **Document Export**: Download options for saving in HTML format
266
+
267
+ **Technical Features:**
268
  - Image preprocessing for better OCR quality
269
  - PDF resolution and page controls
270
  - Progress tracking during processing
271
+ - Responsive design optimized for historical document presentation
272
  """)
273
 
274
+ # Workshop tab content
275
+ with main_tab3:
276
+ st.markdown("<h3>Hacking AI for Historical Research</h3>", unsafe_allow_html=True)
277
+ st.markdown("<p style='margin-bottom: 20px;'>Interactive workshop resources and materials</p>", unsafe_allow_html=True)
278
+
279
+ # Custom CSS to improve the Padlet embed appearance
280
+ st.markdown("""
281
+ <style>
282
+ .padlet-container {
283
+ border-radius: 8px;
284
+ box-shadow: 0 4px 6px rgba(0,0,0,0.1);
285
+ margin-top: 10px;
286
+ margin-bottom: 20px;
287
+ overflow: hidden;
288
+ }
289
+ </style>
290
+ """, unsafe_allow_html=True)
291
+
292
+ # Padlet embed with additional container
293
+ st.markdown("""
294
+ <div class="padlet-container">
295
+ <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:8px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4">
296
+ <p style="padding:0;margin:0">
297
+ <iframe src="https://padlet.com/embed/y9daf9yabqcj93dq" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:650px;display:block;padding:0;margin:0"></iframe>
298
+ </p>
299
+ <div style="display:flex;align-items:center;justify-content:end;margin:0;height:28px">
300
+ <a href="https://padlet.com?ref=embed" style="display:block;flex-grow:0;margin:0;border:none;padding:0;text-decoration:none" target="_blank">
301
+ <div style="display:flex;align-items:center;">
302
+ <img src="https://padlet.net/embeds/made_with_padlet_2022.png" width="114" height="28" style="padding:0;margin:0;background:0 0;border:none;box-shadow:none" alt="Made with Padlet">
303
+ </div>
304
+ </a>
305
+ </div>
306
+ </div>
307
+ </div>
308
+ """, unsafe_allow_html=True)
309
+
310
  with main_tab1:
311
  if uploaded_file is not None:
312
+ # Check file size (cap at 20MB)
313
  file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
314
 
315
+ if file_size_mb > 20:
316
  with upload_col:
317
+ st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
318
  st.stop()
319
 
320
  file_ext = Path(uploaded_file.name).suffix.lower()
 
376
  # Call process_file with all options
377
  result = process_file(uploaded_file, use_vision, preprocessing_options)
378
 
379
+ # Single tab for document analysis
380
+ with st.container():
 
 
381
  # Create two columns for metadata and content
382
  meta_col, content_col = st.columns([1, 2])
383
 
 
411
  st.subheader("Document Contents")
412
  if 'ocr_contents' in result:
413
  # Check if there are images in the OCR result
414
+ has_images = result.get('has_images', False)
 
 
 
 
 
415
 
416
  # Create tabs for different views
417
  if has_images:
 
421
 
422
  with view_tab1:
423
  # Display in a more user-friendly format based on the content structure
424
+ html_content = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<title>OCR Document</title>\n<style>\n'
425
+ html_content += """
426
+ body {
427
+ font-family: 'Georgia', serif;
428
+ line-height: 1.6;
429
+ margin: 0;
430
+ padding: 20px;
431
+ background-color: #f9f9f9;
432
+ color: #333;
433
+ }
434
+ .container {
435
+ max-width: 1000px;
436
+ margin: 0 auto;
437
+ background-color: #fff;
438
+ padding: 30px;
439
+ border-radius: 8px;
440
+ box-shadow: 0 4px 12px rgba(0,0,0,0.1);
441
+ }
442
+ h1, h2, h3, h4 {
443
+ font-family: 'Bookman', 'Georgia', serif;
444
+ margin-top: 1.5em;
445
+ margin-bottom: 0.5em;
446
+ color: #222;
447
+ }
448
+ h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
449
+ h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
450
+ h3 { font-size: 1.5em; }
451
+ h4 { font-size: 1.2em; }
452
+ p { margin-bottom: 1.2em; text-align: justify; }
453
+ ul, ol { margin-bottom: 1.5em; }
454
+ li { margin-bottom: 0.5em; }
455
+ .poem {
456
+ font-family: 'Baskerville', 'Georgia', serif;
457
+ margin-left: 2em;
458
+ line-height: 1.8;
459
+ white-space: pre-wrap;
460
+ }
461
+ .subtitle {
462
+ font-style: italic;
463
+ font-size: 1.1em;
464
+ margin-bottom: 1.5em;
465
+ color: #555;
466
+ }
467
+ blockquote {
468
+ border-left: 3px solid #ccc;
469
+ margin: 1.5em 0;
470
+ padding: 0.5em 1.5em;
471
+ background-color: #f5f5f5;
472
+ font-style: italic;
473
+ }
474
+ dl {
475
+ margin-bottom: 1.5em;
476
+ }
477
+ dt {
478
+ font-weight: bold;
479
+ margin-top: 1em;
480
+ }
481
+ dd {
482
+ margin-left: 2em;
483
+ margin-bottom: 0.5em;
484
+ }
485
+ </style>
486
+ </head>
487
+ <body>
488
+ <div class="container">
489
+ """
490
+
491
  if isinstance(result['ocr_contents'], dict):
492
  for section, content in result['ocr_contents'].items():
493
+ if not content: # Skip empty sections
494
+ continue
 
495
 
496
+ section_title = section.replace('_', ' ').title()
497
+
498
+ # Special handling for title and subtitle
499
+ if section.lower() == 'title':
500
+ html_content += f'<h1>{content}</h1>\n'
501
+ st.markdown(f"## {content}")
502
+ elif section.lower() == 'subtitle':
503
+ html_content += f'<div class="subtitle">{content}</div>\n'
504
+ st.markdown(f"*{content}*")
505
+ else:
506
+ # Section headers for non-title sections
507
+ html_content += f'<h3>{section_title}</h3>\n'
508
+ st.markdown(f"### {section_title}")
509
+
510
+ # Process different content types
511
+ if isinstance(content, str):
512
+ # Handle poem type specifically
513
+ if section.lower() == 'type' and content.lower() == 'poem':
514
+ # Don't add special formatting here, just for the lines
515
  st.markdown(content)
516
+ html_content += f'<p>{content}</p>\n'
517
+ elif 'content' in result['ocr_contents'] and isinstance(result['ocr_contents']['content'], dict) and 'type' in result['ocr_contents']['content'] and result['ocr_contents']['content']['type'] == 'poem' and section.lower() == 'content':
518
+ # This is handled in the dict case below
519
+ pass
520
+ else:
521
+ # Regular text content
522
+ paragraphs = content.split('\n\n')
523
+ for p in paragraphs:
524
+ if p.strip():
525
+ html_content += f'<p>{p.strip()}</p>\n'
526
+ st.markdown(content)
527
+
528
+ elif isinstance(content, list):
529
+ # Handle lists (bullet points, etc.)
530
+ html_content += '<ul>\n'
531
+ for item in content:
532
+ if isinstance(item, str):
533
+ html_content += f'<li>{item}</li>\n'
534
+ st.markdown(f"- {item}")
535
+ elif isinstance(item, dict):
536
+ # Format dictionary items in a readable way
537
+ html_content += f'<li><pre>{json.dumps(item, indent=2)}</pre></li>\n'
538
+ st.json(item)
539
+ html_content += '</ul>\n'
540
+
541
+ elif isinstance(content, dict):
542
+ # Special handling for poem type
543
+ if 'type' in content and content['type'] == 'poem' and 'lines' in content:
544
+ html_content += '<div class="poem">\n'
545
+ for line in content['lines']:
546
+ html_content += f'{line}\n'
547
+ st.markdown(line)
548
+ html_content += '</div>\n'
549
+ else:
550
+ # Regular dictionary display
551
+ html_content += '<dl>\n'
552
  for k, v in content.items():
553
+ html_content += f'<dt>{k}</dt>\n<dd>'
554
+ if isinstance(v, str):
555
+ html_content += v
556
+ elif isinstance(v, list):
557
+ html_content += ', '.join(str(item) for item in v)
558
+ else:
559
+ html_content += str(v)
560
+ html_content += '</dd>\n'
561
  st.markdown(f"**{k}:** {v}")
562
+ html_content += '</dl>\n'
563
+
564
+ # Close HTML document
565
+ html_content += '</div>\n</body>\n</html>'
566
 
567
  # Add download button in a smaller section
568
  with st.expander("Export Content"):
 
586
  try:
587
  # Import function
588
  try:
589
+ from ocr_utils import create_html_with_images
590
  except ImportError:
591
  st.error("Required module ocr_utils not found.")
592
  st.stop()
593
 
594
+ # Check if has_images flag is set
595
+ if not result.get('has_images', False) or 'pages_data' not in result:
596
+ st.warning("No image data available in the OCR response.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
597
  st.stop()
598
 
599
+ # Count images in the result
600
+ image_count = 0
601
+ for page in result.get('pages_data', []):
602
+ image_count += len(page.get('images', []))
603
 
604
  # Add warning for image-heavy documents
605
  if image_count > 10:
606
  st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
607
 
608
+ # Generate HTML with images
609
+ html_with_images = create_html_with_images(result)
610
+
611
+ # For multi-page documents, create page navigation
612
+ page_count = len(result.get('pages_data', []))
613
+
614
+ if page_count > 1:
615
+ st.info(f"Document contains {page_count} pages. You can scroll to view all pages or use the page selector below.")
616
 
617
  # Create a page selector
618
+ page_options = [f"Page {i+1}" for i in range(page_count)]
619
+ selected_page = st.selectbox("Jump to page:", options=page_options, index=0)
 
620
 
621
+ # Extract page number from selection
622
+ page_num = int(selected_page.split(" ")[1])
 
 
 
 
623
 
624
+ # Add JavaScript to scroll to the selected page
 
 
 
625
  st.markdown(f"""
626
+ <script>
627
+ document.addEventListener('DOMContentLoaded', function() {{
628
+ const element = document.getElementById('page-{page_num}');
629
+ if (element) {{
630
+ element.scrollIntoView({{ behavior: 'smooth' }});
631
+ }}
632
+ }});
633
+ </script>
634
  """, unsafe_allow_html=True)
635
 
636
+ # Display the HTML content
637
+ st.components.v1.html(html_with_images, height=600, scrolling=True)
638
+
639
+ # Add download button for the content with images
640
  st.download_button(
641
  label="Download with Images (HTML)",
642
+ data=html_with_images,
 
 
 
 
 
 
 
 
 
 
 
 
643
  file_name="document_with_images.html",
644
  mime="text/html"
645
  )
 
649
  st.info("Try refreshing or processing the document again.")
650
  else:
651
  st.error("No OCR content was extracted from the document.")
 
 
 
 
652
 
653
  except Exception as e:
654
  st.error(f"Error processing document: {str(e)}")
 
657
  st.info("Upload a document to get started using the file uploader above.")
658
 
659
  # Show example images in a grid
 
 
660
  # Add a sample images container
661
  with st.container():
662
  # Find sample images from the input directory to display
663
  input_dir = Path(__file__).parent / "input"
664
  sample_images = []
665
  if input_dir.exists():
666
+ # Get all potential image files - exclude PDF files
667
+ all_images = []
668
+ all_images.extend(list(input_dir.glob("*.jpg")))
669
+ all_images.extend(list(input_dir.glob("*.jpeg")))
670
+ all_images.extend(list(input_dir.glob("*.png")))
671
+
672
+ # Filter to get a good set of diverse images - not too small, not too large
673
+ valid_images = [path for path in all_images if 50000 < path.stat().st_size < 1000000]
674
+
675
+ # Deduplicate any images that might have the same content (like recipe and historical-recipe)
676
+ seen_sizes = {}
677
+ deduplicated_images = []
678
+ for img in valid_images:
679
+ size = img.stat().st_size
680
+ # If we haven't seen this exact file size before, include it
681
+ # This simple heuristic works well enough for images with identical content
682
+ if size not in seen_sizes:
683
+ seen_sizes[size] = True
684
+ deduplicated_images.append(img)
685
+
686
+ valid_images = deduplicated_images
687
+
688
+ # Select a random sample of 6 images if we have enough
689
+ import random
690
+ if len(valid_images) > 6:
691
+ sample_images = random.sample(valid_images, 6)
692
+ else:
693
+ sample_images = valid_images
694
 
695
  if sample_images:
696
+ # Create two rows of three columns
697
+
698
+ # First row
699
+ row1 = st.columns(3)
700
+ for i in range(3):
701
+ if i < len(sample_images):
702
+ with row1[i]:
703
+ try:
704
+ st.image(str(sample_images[i]), caption=sample_images[i].name, use_container_width=True)
705
+ except Exception:
706
+ # Silently skip problematic images
707
+ pass
708
+
709
+ # Second row
710
+ row2 = st.columns(3)
711
+ for i in range(3):
712
+ idx = i + 3
713
+ if idx < len(sample_images):
714
+ with row2[i]:
715
+ try:
716
+ st.image(str(sample_images[idx]), caption=sample_images[idx].name, use_container_width=True)
717
+ except Exception:
718
+ # Silently skip problematic images
719
+ pass
ocr_utils.py CHANGED
@@ -125,7 +125,7 @@ def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
125
  response_dict = json.loads(ocr_response.model_dump_json())
126
  return json.dumps(response_dict, indent=indent)
127
 
128
- def get_combined_markdown_compressed(ocr_response, max_width: int = 800, quality: int = 85) -> str:
129
  """
130
  Combine OCR text and images into a single markdown document with compressed images.
131
  Reduces image sizes to improve performance.
@@ -209,4 +209,163 @@ try:
209
  display(Markdown(combined_markdown))
210
  except ImportError:
211
  # IPython not available
212
- pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  response_dict = json.loads(ocr_response.model_dump_json())
126
  return json.dumps(response_dict, indent=indent)
127
 
128
+ def get_combined_markdown_compressed(ocr_response, max_width: int = 1200, quality: int = 92) -> str:
129
  """
130
  Combine OCR text and images into a single markdown document with compressed images.
131
  Reduces image sizes to improve performance.
 
209
  display(Markdown(combined_markdown))
210
  except ImportError:
211
  # IPython not available
212
+ pass
213
+
214
+ def create_html_with_images(result_with_pages: dict) -> str:
215
+ """
216
+ Create HTML with embedded images from the OCR result.
217
+
218
+ Args:
219
+ result_with_pages: OCR result with pages_data containing markdown and images
220
+
221
+ Returns:
222
+ HTML string with embedded images
223
+ """
224
+ if not result_with_pages.get('has_images', False) or 'pages_data' not in result_with_pages:
225
+ return "<p>No images available in the document.</p>"
226
+
227
+ # Create HTML document
228
+ html = """<!DOCTYPE html>
229
+ <html lang="en">
230
+ <head>
231
+ <meta charset="UTF-8">
232
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
233
+ <title>Document with Images</title>
234
+ <style>
235
+ body {
236
+ font-family: 'Georgia', serif;
237
+ line-height: 1.6;
238
+ margin: 0;
239
+ padding: 20px;
240
+ background-color: #f9f9f9;
241
+ color: #333;
242
+ }
243
+ .container {
244
+ max-width: 1000px;
245
+ margin: 0 auto;
246
+ background-color: #fff;
247
+ padding: 30px;
248
+ border-radius: 8px;
249
+ box-shadow: 0 4px 12px rgba(0,0,0,0.1);
250
+ }
251
+ h1, h2, h3, h4 {
252
+ font-family: 'Bookman', 'Georgia', serif;
253
+ margin-top: 1.5em;
254
+ margin-bottom: 0.5em;
255
+ color: #222;
256
+ }
257
+ h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
258
+ h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
259
+ h3 { font-size: 1.5em; }
260
+ h4 { font-size: 1.2em; }
261
+ p { margin-bottom: 1.2em; text-align: justify; }
262
+ img {
263
+ max-width: 100%;
264
+ height: auto;
265
+ margin: 20px 0;
266
+ border: 1px solid #ddd;
267
+ border-radius: 6px;
268
+ box-shadow: 0 3px 6px rgba(0,0,0,0.1);
269
+ display: block;
270
+ }
271
+ .page {
272
+ margin-bottom: 40px;
273
+ padding-bottom: 30px;
274
+ border-bottom: 1px dashed #ccc;
275
+ }
276
+ .page:last-child {
277
+ border-bottom: none;
278
+ }
279
+ .page-title {
280
+ text-align: center;
281
+ color: #555;
282
+ font-style: italic;
283
+ margin: 30px 0;
284
+ }
285
+ pre {
286
+ background-color: #f5f5f5;
287
+ padding: 15px;
288
+ border-radius: 5px;
289
+ overflow-x: auto;
290
+ font-size: 14px;
291
+ line-height: 1.4;
292
+ }
293
+ blockquote {
294
+ border-left: 3px solid #ccc;
295
+ margin: 1.5em 0;
296
+ padding: 0.5em 1.5em;
297
+ background-color: #f5f5f5;
298
+ font-style: italic;
299
+ }
300
+ .poem {
301
+ font-family: 'Baskerville', 'Georgia', serif;
302
+ margin-left: 2em;
303
+ line-height: 1.8;
304
+ white-space: pre-wrap;
305
+ }
306
+ </style>
307
+ </head>
308
+ <body>
309
+ <div class="container">
310
+ """
311
+
312
+ # Process each page
313
+ pages_data = result_with_pages.get('pages_data', [])
314
+ for page_idx, page in enumerate(pages_data):
315
+ page_number = page.get('page_number', page_idx + 1)
316
+ page_markdown = page.get('markdown', '')
317
+ page_images = page.get('images', [])
318
+
319
+ # Add page header
320
+ html += f'<div class="page" id="page-{page_number}">\n'
321
+ if len(pages_data) > 1:
322
+ html += f'<div class="page-title">Page {page_number}</div>\n'
323
+
324
+ # Process markdown text and replace image references
325
+ if page_markdown:
326
+ # Replace image markers with actual images
327
+ for img in page_images:
328
+ img_id = img.get('id', '')
329
+ img_base64 = img.get('image_base64', '')
330
+
331
+ if img_id and img_base64:
332
+ # Format image tag
333
+ img_tag = f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">'
334
+ # Replace markdown image reference with HTML image
335
+ page_markdown = page_markdown.replace(f'![{img_id}]({img_id})', img_tag)
336
+
337
+ # Convert line breaks to <p> tags for proper HTML formatting
338
+ paragraphs = page_markdown.split('\n\n')
339
+ for paragraph in paragraphs:
340
+ if paragraph.strip():
341
+ # Check if this looks like a header
342
+ if paragraph.startswith('# '):
343
+ header_text = paragraph[2:].strip()
344
+ html += f'<h1>{header_text}</h1>\n'
345
+ elif paragraph.startswith('## '):
346
+ header_text = paragraph[3:].strip()
347
+ html += f'<h2>{header_text}</h2>\n'
348
+ elif paragraph.startswith('### '):
349
+ header_text = paragraph[4:].strip()
350
+ html += f'<h3>{header_text}</h3>\n'
351
+ else:
352
+ html += f'<p>{paragraph}</p>\n'
353
+
354
+ # Add any images that weren't referenced in the markdown
355
+ referenced_img_ids = [img.get('id') for img in page_images if img.get('id') in page_markdown]
356
+ for img in page_images:
357
+ img_id = img.get('id', '')
358
+ img_base64 = img.get('image_base64', '')
359
+
360
+ if img_id and img_base64 and img_id not in referenced_img_ids:
361
+ html += f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">\n'
362
+
363
+ # Close page div
364
+ html += '</div>\n'
365
+
366
+ # Close main container and document
367
+ html += """ </div>
368
+ </body>
369
+ </html>"""
370
+
371
+ return html
structured_ocr.py CHANGED
@@ -238,8 +238,31 @@ class StructuredOCR:
238
  # Add confidence score
239
  result['confidence_score'] = confidence_score
240
 
241
- # Store the raw OCR response for image rendering
242
- result['raw_response'] = pdf_response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
  logger.info(f"PDF processing completed successfully")
245
  return result
@@ -300,8 +323,31 @@ class StructuredOCR:
300
  # Add confidence score
301
  result['confidence_score'] = confidence_score
302
 
303
- # Store the raw OCR response for image rendering
304
- result['raw_response'] = image_response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
 
306
  logger.info("Image processing completed successfully")
307
  return result
@@ -336,7 +382,10 @@ class StructuredOCR:
336
  f"This is a historical document's OCR in markdown:\n"
337
  f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
338
  f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
339
- f"Extract topics, languages, and organize the content logically."
 
 
 
340
  ))
341
  ],
342
  },
@@ -371,7 +420,10 @@ class StructuredOCR:
371
  "content": f"This is a historical document's OCR in markdown:\n"
372
  f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
373
  f"Convert this into a structured JSON response with the OCR contents. "
374
- f"Extract topics, languages, and organize the content logically."
 
 
 
375
  },
376
  ],
377
  response_format=StructuredOCRModel,
 
238
  # Add confidence score
239
  result['confidence_score'] = confidence_score
240
 
241
+ # Store key parts of the OCR response for image rendering
242
+ # Extract and store image data in a format that can be serialized to JSON
243
+ has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
244
+ result['has_images'] = has_images
245
+
246
+ if has_images:
247
+ # Create a structured representation of images that can be serialized
248
+ result['pages_data'] = []
249
+ for page_idx, page in enumerate(pdf_response.pages):
250
+ page_data = {
251
+ 'page_number': page_idx + 1,
252
+ 'markdown': page.markdown if hasattr(page, 'markdown') else '',
253
+ 'images': []
254
+ }
255
+
256
+ # Extract images if present
257
+ if hasattr(page, 'images') and page.images:
258
+ for img_idx, img in enumerate(page.images):
259
+ if hasattr(img, 'image_base64') and img.image_base64:
260
+ page_data['images'].append({
261
+ 'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
262
+ 'image_base64': img.image_base64
263
+ })
264
+
265
+ result['pages_data'].append(page_data)
266
 
267
  logger.info(f"PDF processing completed successfully")
268
  return result
 
323
  # Add confidence score
324
  result['confidence_score'] = confidence_score
325
 
326
+ # Store key parts of the OCR response for image rendering
327
+ # Extract and store image data in a format that can be serialized to JSON
328
+ has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
329
+ result['has_images'] = has_images
330
+
331
+ if has_images:
332
+ # Create a structured representation of images that can be serialized
333
+ result['pages_data'] = []
334
+ for page_idx, page in enumerate(image_response.pages):
335
+ page_data = {
336
+ 'page_number': page_idx + 1,
337
+ 'markdown': page.markdown if hasattr(page, 'markdown') else '',
338
+ 'images': []
339
+ }
340
+
341
+ # Extract images if present
342
+ if hasattr(page, 'images') and page.images:
343
+ for img_idx, img in enumerate(page.images):
344
+ if hasattr(img, 'image_base64') and img.image_base64:
345
+ page_data['images'].append({
346
+ 'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
347
+ 'image_base64': img.image_base64
348
+ })
349
+
350
+ result['pages_data'].append(page_data)
351
 
352
  logger.info("Image processing completed successfully")
353
  return result
 
382
  f"This is a historical document's OCR in markdown:\n"
383
  f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
384
  f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
385
+ f"Extract topics, languages, document type, date if present, and key entities. "
386
+ f"For handwritten documents, carefully preserve the structure. "
387
+ f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
388
+ f"For tabular content, preserve the table structure as much as possible."
389
  ))
390
  ],
391
  },
 
420
  "content": f"This is a historical document's OCR in markdown:\n"
421
  f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
422
  f"Convert this into a structured JSON response with the OCR contents. "
423
+ f"Extract topics, languages, document type, date if present, and key entities. "
424
+ f"For handwritten documents, carefully preserve the structure. "
425
+ f"For printed texts, organize content logically by sections. "
426
+ f"For tabular content, preserve the table structure as much as possible."
427
  },
428
  ],
429
  response_format=StructuredOCRModel,