milwright commited on
Commit
aaf0eac
·
1 Parent(s): c9c9ec7

Streamline app architecture and improve image processing

Browse files

Remove educational components in favor of a single, robust application. Enhance image preprocessing with rotation detection, error handling, and API retries. Update documentation to reflect new project structure.

Files changed (10) hide show
  1. README.md +29 -28
  2. app.py +726 -471
  3. config.py +10 -5
  4. prepare_for_hf.py +8 -25
  5. process_file.py +6 -1
  6. requirements.txt +3 -2
  7. run_local.sh +3 -8
  8. simple_test.py +13 -4
  9. structured_ocr.py +190 -41
  10. ui/custom.css +40 -0
README.md CHANGED
@@ -38,43 +38,32 @@ The project is organized as follows:
38
  ```
39
  Historical OCR - Project Structure
40
 
41
- ┌─ Main Applications
42
- ├─ app.py # Standard Streamlit interface for OCR processing
43
- │ └─ streamlit_app.py # Educational modular version with learning components
44
 
45
  ├─ Core Functionality
46
  │ ├─ structured_ocr.py # Main OCR processing engine with Mistral AI integration
47
  │ ├─ ocr_utils.py # Utility functions for OCR text and image processing
48
  │ ├─ pdf_ocr.py # PDF-specific document processing functionality
49
- └─ config.py # Configuration settings and API keys
 
50
 
51
  ├─ Testing & Development
52
  │ ├─ simple_test.py # Basic OCR functionality test
53
  │ ├─ test_pdf.py # PDF processing test
54
  │ ├─ test_pdf_preview.py # PDF preview generation test
 
 
55
  │ └─ prepare_for_hf.py # Prepare project for Hugging Face deployment
56
 
57
  ├─ Scripts
58
- │ ├─ run_local.sh # Launch standard or educational app locally
59
  │ ├─ run_large_files.sh # Process large documents with optimized settings
60
  │ └─ setup_git.sh # Configure Git repositories
61
 
62
- ├─ Educational Modules (streamlit/)
63
- │ ├─ modules/
64
- │ ├─ module1.py # Introduction and Problematization
65
- │ │ ├─ module2.py # Historical Typography & OCR Challenges
66
- │ │ ├─ module3.py # Document Analysis Techniques
67
- │ │ ├─ module4.py # Processing Methods
68
- │ │ ├─ module5.py # Research Applications
69
- │ │ └─ module6.py # Future Directions
70
- │ │
71
- │ ├─ modular_app.py # Learning module framework
72
- │ ├─ layout.py # UI components for educational interface
73
- │ └─ process_file.py # File processing for educational app
74
-
75
- ├─ UI Components (ui/)
76
- │ ├─ layout.py # Shared UI components and styling
77
- │ └─ custom.css # Custom styling for the application
78
 
79
  ├─ Data Directories
80
  │ ├─ input/ # Sample documents for testing/demo
@@ -93,7 +82,6 @@ Historical OCR - Project Structure
93
  - On macOS: `brew install poppler`
94
  - On Ubuntu/Debian: `apt-get install poppler-utils`
95
  - On Windows: Download from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/) and add to PATH
96
- - For text recognition: `tesseract-ocr`
97
  3. Install Python dependencies:
98
  ```
99
  pip install -r requirements.txt
@@ -107,7 +95,13 @@ pip install -r requirements.txt
107
  ```
108
  export MISTRAL_API_KEY=your_api_key_here
109
  ```
 
 
 
 
110
  - Get your API key from [Mistral AI Console](https://console.mistral.ai/api-keys/)
 
 
111
  5. Run the Streamlit app using the script:
112
  ```
113
  ./run_local.sh
@@ -137,16 +131,23 @@ The application provides several specialized features for historical document pr
137
  4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
138
  5. **Document Export**: Download options for saving the processed document in HTML format
139
 
140
- ## Application Versions
 
 
 
 
 
 
 
 
 
141
 
142
- Two versions of the application are available:
143
 
144
- 1. **Standard Version** (`app.py`): Focused on document processing with a clean interface
145
- 2. **Educational Version** (`streamlit_app.py`): Enhanced with educational modules and interactive components
146
 
147
- To run the educational version:
148
  ```
149
- streamlit run streamlit_app.py
150
  ```
151
 
152
  ## Deployment on Hugging Face Spaces
 
38
  ```
39
  Historical OCR - Project Structure
40
 
41
+ ┌─ Main Application
42
+ └─ app.py # Streamlit interface for OCR processing
 
43
 
44
  ├─ Core Functionality
45
  │ ├─ structured_ocr.py # Main OCR processing engine with Mistral AI integration
46
  │ ├─ ocr_utils.py # Utility functions for OCR text and image processing
47
  │ ├─ pdf_ocr.py # PDF-specific document processing functionality
48
+ ├─ config.py # Configuration settings and API keys
49
+ │ └─ process_file.py # File processing utilities
50
 
51
  ├─ Testing & Development
52
  │ ├─ simple_test.py # Basic OCR functionality test
53
  │ ├─ test_pdf.py # PDF processing test
54
  │ ├─ test_pdf_preview.py # PDF preview generation test
55
+ │ ├─ test_pdf_handling.py # PDF handling test
56
+ │ ├─ test_image_formats.py # Image format compatibility test
57
  │ └─ prepare_for_hf.py # Prepare project for Hugging Face deployment
58
 
59
  ├─ Scripts
60
+ │ ├─ run_local.sh # Launch app locally
61
  │ ├─ run_large_files.sh # Process large documents with optimized settings
62
  │ └─ setup_git.sh # Configure Git repositories
63
 
64
+ ├─ UI Components
65
+ │ ├─ ui/layout.py # UI components and styling
66
+ └─ ui/custom.css # Custom styling for the application
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ├─ Data Directories
69
  │ ├─ input/ # Sample documents for testing/demo
 
82
  - On macOS: `brew install poppler`
83
  - On Ubuntu/Debian: `apt-get install poppler-utils`
84
  - On Windows: Download from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/) and add to PATH
 
85
  3. Install Python dependencies:
86
  ```
87
  pip install -r requirements.txt
 
95
  ```
96
  export MISTRAL_API_KEY=your_api_key_here
97
  ```
98
+ - Option 3: Test if your API key is working correctly:
99
+ ```
100
+ python test_api_key.py
101
+ ```
102
  - Get your API key from [Mistral AI Console](https://console.mistral.ai/api-keys/)
103
+
104
+ **Important**: Make sure your API key is correctly formatted with no extra spaces, newlines, or other characters. The application requires a valid Mistral API key with access to the OCR API.
105
  5. Run the Streamlit app using the script:
106
  ```
107
  ./run_local.sh
 
131
  4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
132
  5. **Document Export**: Download options for saving the processed document in HTML format
133
 
134
+ ## Testing
135
+
136
+ Run the test suite to ensure proper functionality:
137
+
138
+ ```
139
+ python simple_test.py # Basic OCR testing
140
+ python test_pdf.py # PDF processing testing
141
+ python test_image_formats.py # Test image format handling
142
+ python test_pdf_handling.py # Test PDF handling
143
+ ```
144
 
145
+ ## Large File Processing
146
 
147
+ For processing large files, use the specialized script:
 
148
 
 
149
  ```
150
+ ./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500
151
  ```
152
 
153
  ## Deployment on Hugging Face Spaces
app.py CHANGED
@@ -7,7 +7,8 @@ from pathlib import Path
7
  import tempfile
8
  import io
9
  from pdf2image import convert_from_bytes
10
- from PIL import Image, ImageEnhance, ImageFilter
 
11
  import cv2
12
  import numpy as np
13
 
@@ -15,12 +16,12 @@ import numpy as np
15
  from structured_ocr import StructuredOCR
16
  from config import MISTRAL_API_KEY
17
 
18
- # Check for modular UI components
19
  try:
20
- from ui.layout import tool_container, key_concept, research_question
21
- MODULAR_UI = True
22
  except ImportError:
23
- MODULAR_UI = False
24
 
25
  # Set page configuration
26
  st.set_page_config(
@@ -40,57 +41,116 @@ def convert_pdf_to_images(pdf_bytes, dpi=150):
40
  st.error(f"Error converting PDF: {str(e)}")
41
  return []
42
 
 
 
 
 
 
 
 
 
43
  @st.cache_data(ttl=3600, show_spinner=False)
44
  def preprocess_image(image_bytes, preprocessing_options):
45
  """Preprocess image with selected options"""
46
- # Convert bytes to OpenCV format
47
- image = Image.open(io.BytesIO(image_bytes))
48
- # Ensure image is in RGB mode for OpenCV processing
49
- if image.mode != 'RGB':
50
- image = image.convert('RGB')
51
- img_array = np.array(image)
52
-
53
- # Apply preprocessing based on selected options
54
- if preprocessing_options.get("grayscale", False):
55
- img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
56
- img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
57
-
58
- if preprocessing_options.get("contrast", 0) != 0:
59
- contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 10)
60
- image = Image.fromarray(img_array)
61
- enhancer = ImageEnhance.Contrast(image)
62
- image = enhancer.enhance(contrast_factor)
63
- img_array = np.array(image)
64
-
65
- if preprocessing_options.get("denoise", False):
66
- # Ensure the image is in the correct format for denoising (CV_8UC3)
67
- if len(img_array.shape) != 3 or img_array.shape[2] != 3:
68
- # Convert to RGB if it's not already a 3-channel color image
69
- if len(img_array.shape) == 2: # Grayscale
70
- img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
71
- img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 10, 10, 7, 21)
72
 
73
- if preprocessing_options.get("threshold", False):
74
- # Convert to grayscale if not already
75
- if len(img_array.shape) == 3:
76
- gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
77
- else:
78
- gray = img_array
79
- # Apply adaptive threshold
80
- binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
81
- cv2.THRESH_BINARY, 11, 2)
82
- # Convert back to RGB
83
- img_array = cv2.cvtColor(binary, cv2.COLOR_GRAY2RGB)
84
 
85
- # Convert back to PIL Image
86
- processed_image = Image.fromarray(img_array)
87
-
88
- # Convert to bytes
89
- byte_io = io.BytesIO()
90
- processed_image.save(byte_io, format='PNG')
91
- byte_io.seek(0)
92
-
93
- return byte_io.getvalue()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  # Define functions
96
  def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
@@ -120,13 +180,28 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
120
  # Return dummy data if no API key
121
  progress_bar.progress(100)
122
  status_text.empty()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  return {
124
  "file_name": uploaded_file.name,
125
- "topics": ["Sample Document"],
126
  "languages": ["English"],
127
  "ocr_contents": {
128
- "title": "Sample Document",
129
- "content": "This is sample content. To process real documents, please set the MISTRAL_API_KEY environment variable."
130
  }
131
  }
132
 
@@ -134,22 +209,51 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
134
  progress_bar.progress(20)
135
  status_text.text("Initializing OCR processor...")
136
 
137
- # Initialize OCR processor
138
- processor = StructuredOCR()
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
  # Determine file type from extension
141
  file_ext = Path(uploaded_file.name).suffix.lower()
142
  file_type = "pdf" if file_ext == ".pdf" else "image"
143
 
 
 
 
144
  # Apply preprocessing if needed
145
  if any(preprocessing_options.values()) and file_type == "image":
146
  status_text.text("Applying image preprocessing...")
147
- processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
148
-
149
- # Save processed image to temp file
150
- with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as proc_tmp:
151
- proc_tmp.write(processed_bytes)
152
- temp_path = proc_tmp.name
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
  # Get file size in MB
155
  file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
@@ -183,6 +287,12 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
183
  progress_bar.progress(100)
184
  status_text.empty()
185
 
 
 
 
 
 
 
186
  return result
187
  except Exception as e:
188
  progress_bar.progress(100)
@@ -194,25 +304,23 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
194
  if os.path.exists(temp_path):
195
  os.unlink(temp_path)
196
 
 
 
 
 
 
 
197
  # App title and description
198
  st.title("Historical Document OCR")
199
- st.subheader("Powered by Mistral AI")
200
 
201
- # Create main layout with tabs and columns
202
- main_tab1, main_tab2 = st.tabs(["Document Processing", "About this App"])
 
 
203
 
204
- with main_tab1:
205
- # Create a two-column layout for file upload and preview
206
- upload_col, preview_col = st.columns([1, 1])
207
-
208
- # File uploader in the left column
209
- with upload_col:
210
- st.markdown("""
211
- Upload an image or PDF file to get started.
212
-
213
- Using the `mistral-ocr-latest` model for advanced document understanding.
214
- """)
215
- uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"])
216
 
217
  # Sidebar with options
218
  with st.sidebar:
@@ -221,9 +329,9 @@ with st.sidebar:
221
  # Model options
222
  st.subheader("Model Settings")
223
  use_vision = st.checkbox("Use Vision Model", value=True,
224
- help="For image files, use the vision model for improved analysis (may be slower)")
225
 
226
- # Image preprocessing options (collapsible)
227
  st.subheader("Image Preprocessing")
228
  with st.expander("Preprocessing Options"):
229
  preprocessing_options = {}
@@ -235,21 +343,134 @@ with st.sidebar:
235
  help="Remove noise from the image")
236
  preprocessing_options["contrast"] = st.slider("Adjust Contrast", -5, 5, 0,
237
  help="Adjust image contrast (-5 to +5)")
 
 
 
 
 
238
 
239
- # PDF options (collapsible)
240
  st.subheader("PDF Options")
241
  with st.expander("PDF Settings"):
242
  pdf_dpi = st.slider("PDF Resolution (DPI)", 72, 300, 150,
243
  help="Higher DPI gives better quality but slower processing")
244
- max_pages = st.number_input("Maximum Pages to Process", 1, 20, 5,
245
  help="Limit number of pages to process")
246
 
247
- # About tab content
248
  with main_tab2:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  st.markdown("""
250
  ### About This Application
251
 
252
- This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents with enhanced formatting and presentation.
253
 
254
  It can process:
255
  - Image files (jpg, png, etc.)
@@ -266,427 +487,461 @@ with main_tab2:
266
  - **Raw JSON**: Complete data structure for developers
267
  - **With Images**: Document with embedded images preserving original layout
268
 
269
- **Special Features:**
270
- - **Poetry Formatting**: Special handling for poem structure with proper line spacing
271
- - **Image Embedding**: Original document images embedded at correct positions
272
- - **Multi-page Support**: Pagination controls for navigating multi-page documents
273
- - **Typography**: Historical-appropriate fonts for better readability
274
- - **Document Export**: Download options for saving in HTML format
275
-
276
- **Technical Features:**
277
- - Image preprocessing for better OCR quality
278
- - PDF resolution and page controls
279
- - Progress tracking during processing
280
- - Responsive design optimized for historical document presentation
281
  """)
282
 
 
283
  with main_tab1:
284
- if uploaded_file is not None:
285
- # Check file size (cap at 20MB)
286
- file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
 
 
 
287
 
288
- if file_size_mb > 20:
289
- with upload_col:
290
- st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
291
- st.stop()
292
 
293
- file_ext = Path(uploaded_file.name).suffix.lower()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
 
295
- # Display document preview in preview column
296
- with preview_col:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
  st.subheader("Document Preview")
298
- if file_ext == ".pdf":
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  try:
300
- # Convert first page of PDF to image for preview
301
  pdf_bytes = uploaded_file.getvalue()
302
- images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1, dpi=150)
 
 
303
 
304
  if images:
305
- # Convert PIL image to bytes for Streamlit
306
  first_page = images[0]
307
  img_bytes = io.BytesIO()
308
  first_page.save(img_bytes, format='JPEG')
309
  img_bytes.seek(0)
310
 
311
- # Display the PDF preview
312
- st.image(img_bytes, caption=f"PDF Preview: {uploaded_file.name}", use_container_width=True)
 
313
  else:
314
- st.info(f"PDF uploaded: {uploaded_file.name}")
315
  except Exception:
316
- # Simply show the file name without an error message
317
- st.info(f"PDF uploaded: {uploaded_file.name}")
318
- st.info("Click 'Process Document' to analyze the content.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
319
  else:
320
- st.image(uploaded_file, use_container_width=True)
321
-
322
- # Add image preprocessing preview in a collapsible section if needed
323
- if any(preprocessing_options.values()) and uploaded_file.type.startswith('image/'):
324
- with st.expander("Image Preprocessing Preview"):
325
- preview_cols = st.columns(2)
326
 
327
- with preview_cols[0]:
328
- st.markdown("**Original Image**")
329
- st.image(uploaded_file, use_container_width=True)
330
 
331
- with preview_cols[1]:
332
- st.markdown("**Preprocessed Image**")
333
- try:
334
- processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
335
- st.image(io.BytesIO(processed_bytes), use_container_width=True)
336
- except Exception as e:
337
- st.error(f"Error in preprocessing: {str(e)}")
338
-
339
- # Process button - flush left with similar padding as file browser
340
- with upload_col:
341
- process_button = st.button("Process Document", use_container_width=True)
342
-
343
- # Results section
344
- if process_button:
345
- try:
346
- # Get max_pages or default if not available
347
- max_pages_value = max_pages if 'max_pages' in locals() else None
348
 
349
- # Call process_file with all options
350
- result = process_file(uploaded_file, use_vision, preprocessing_options)
 
351
 
352
- # Single tab for document analysis
353
- with st.container():
354
- # Create two columns for metadata and content
355
- meta_col, content_col = st.columns([1, 2])
 
 
 
356
 
357
- with meta_col:
358
- st.subheader("Document Metadata")
359
- st.success("**Document processed successfully**")
360
-
361
- # Display file info
362
- st.write(f"**File Name:** {result.get('file_name', uploaded_file.name)}")
363
-
364
- # Display info if only limited pages were processed
365
- if 'limited_pages' in result:
366
- st.info(f"Processed {result['limited_pages']['processed']} of {result['limited_pages']['total']} pages")
367
-
368
- # Display languages if available
369
- if 'languages' in result:
370
- languages = [lang for lang in result['languages'] if lang is not None]
371
- if languages:
372
- st.write(f"**Languages:** {', '.join(languages)}")
373
 
374
- # Confidence score if available
375
- if 'confidence_score' in result:
376
- confidence = result['confidence_score']
377
- st.write(f"**OCR Confidence:** {confidence:.1%}")
378
-
379
- # Display topics if available
380
- if 'topics' in result and result['topics']:
381
- st.write(f"**Topics:** {', '.join(result['topics'])}")
382
-
383
- with content_col:
384
- st.subheader("Document Contents")
385
- if 'ocr_contents' in result:
386
- # Check if there are images in the OCR result
387
- has_images = result.get('has_images', False)
388
 
389
- # Create tabs for different views
390
- if has_images:
391
- view_tab1, view_tab2, view_tab3 = st.tabs(["Structured View", "Raw JSON", "With Images"])
 
 
392
  else:
393
- view_tab1, view_tab2 = st.tabs(["Structured View", "Raw JSON"])
394
-
395
- with view_tab1:
396
- # Display in a more user-friendly format based on the content structure
397
- html_content = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<title>OCR Document</title>\n<style>\n'
398
- html_content += """
399
- body {
400
- font-family: 'Georgia', serif;
401
- line-height: 1.6;
402
- margin: 0;
403
- padding: 20px;
404
- background-color: #f9f9f9;
405
- color: #333;
406
- }
407
- .container {
408
- max-width: 1000px;
409
- margin: 0 auto;
410
- background-color: #fff;
411
- padding: 30px;
412
- border-radius: 8px;
413
- box-shadow: 0 4px 12px rgba(0,0,0,0.1);
414
- }
415
- h1, h2, h3, h4 {
416
- font-family: 'Bookman', 'Georgia', serif;
417
- margin-top: 1.5em;
418
- margin-bottom: 0.5em;
419
- color: #222;
420
- }
421
- h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
422
- h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
423
- h3 { font-size: 1.5em; }
424
- h4 { font-size: 1.2em; }
425
- p { margin-bottom: 1.2em; text-align: justify; }
426
- ul, ol { margin-bottom: 1.5em; }
427
- li { margin-bottom: 0.5em; }
428
- .poem {
429
- font-family: 'Baskerville', 'Georgia', serif;
430
- margin-left: 2em;
431
- line-height: 1.8;
432
- white-space: pre-wrap;
433
- }
434
- .subtitle {
435
- font-style: italic;
436
- font-size: 1.1em;
437
- margin-bottom: 1.5em;
438
- color: #555;
439
- }
440
- blockquote {
441
- border-left: 3px solid #ccc;
442
- margin: 1.5em 0;
443
- padding: 0.5em 1.5em;
444
- background-color: #f5f5f5;
445
- font-style: italic;
446
- }
447
- dl {
448
- margin-bottom: 1.5em;
449
- }
450
- dt {
451
- font-weight: bold;
452
- margin-top: 1em;
453
- }
454
- dd {
455
- margin-left: 2em;
456
- margin-bottom: 0.5em;
457
- }
458
- </style>
459
- </head>
460
- <body>
461
- <div class="container">
462
- """
463
 
464
- if isinstance(result['ocr_contents'], dict):
465
- for section, content in result['ocr_contents'].items():
466
- if not content: # Skip empty sections
467
- continue
468
-
469
- section_title = section.replace('_', ' ').title()
470
-
471
- # Special handling for title and subtitle
472
- if section.lower() == 'title':
473
- html_content += f'<h1>{content}</h1>\n'
474
- st.markdown(f"## {content}")
475
- elif section.lower() == 'subtitle':
476
- html_content += f'<div class="subtitle">{content}</div>\n'
477
- st.markdown(f"*{content}*")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
478
  else:
479
- # Section headers for non-title sections
480
- html_content += f'<h3>{section_title}</h3>\n'
481
- st.markdown(f"### {section_title}")
482
-
483
- # Process different content types
484
- if isinstance(content, str):
485
- # Handle poem type specifically
486
- if section.lower() == 'type' and content.lower() == 'poem':
487
- # Don't add special formatting here, just for the lines
488
- st.markdown(content)
489
- html_content += f'<p>{content}</p>\n'
490
- elif 'content' in result['ocr_contents'] and isinstance(result['ocr_contents']['content'], dict) and 'type' in result['ocr_contents']['content'] and result['ocr_contents']['content']['type'] == 'poem' and section.lower() == 'content':
491
- # This is handled in the dict case below
492
- pass
493
- else:
494
- # Regular text content
495
- paragraphs = content.split('\n\n')
496
- for p in paragraphs:
497
- if p.strip():
498
- html_content += f'<p>{p.strip()}</p>\n'
499
- st.markdown(content)
500
-
501
- elif isinstance(content, list):
502
- # Handle lists (bullet points, etc.)
503
- html_content += '<ul>\n'
504
- for item in content:
505
- if isinstance(item, str):
506
- html_content += f'<li>{item}</li>\n'
507
- st.markdown(f"- {item}")
508
- elif isinstance(item, dict):
509
- # Format dictionary items in a readable way
510
- html_content += f'<li><pre>{json.dumps(item, indent=2)}</pre></li>\n'
511
- st.json(item)
512
- html_content += '</ul>\n'
513
-
514
- elif isinstance(content, dict):
515
- # Special handling for poem type
516
- if 'type' in content and content['type'] == 'poem' and 'lines' in content:
517
- html_content += '<div class="poem">\n'
518
- for line in content['lines']:
519
- html_content += f'{line}\n'
520
- st.markdown(line)
521
- html_content += '</div>\n'
522
- else:
523
- # Regular dictionary display
524
- html_content += '<dl>\n'
525
- for k, v in content.items():
526
- html_content += f'<dt>{k}</dt>\n<dd>'
527
- if isinstance(v, str):
528
- html_content += v
529
- elif isinstance(v, list):
530
- html_content += ', '.join(str(item) for item in v)
531
- else:
532
- html_content += str(v)
533
- html_content += '</dd>\n'
534
- st.markdown(f"**{k}:** {v}")
535
- html_content += '</dl>\n'
 
 
536
 
537
- # Close HTML document
538
- html_content += '</div>\n</body>\n</html>'
 
 
 
 
 
 
 
539
 
540
- # Add download button in a smaller section
541
- with st.expander("Export Content"):
542
- # Alternative download button
543
- html_bytes = html_content.encode()
544
- st.download_button(
545
- label="Download as HTML",
546
- data=html_bytes,
547
- file_name="document_content.html",
548
- mime="text/html"
549
- )
550
 
551
- with view_tab2:
552
- # Show the raw JSON for developers
553
- st.json(result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554
 
555
- if has_images:
556
- with view_tab3:
557
- # Show loading indicator while preparing images
558
- with st.spinner("Preparing document with embedded images..."):
559
- try:
560
- # Import function
561
- try:
562
- from ocr_utils import create_html_with_images
563
- except ImportError:
564
- st.error("Required module ocr_utils not found.")
565
- st.stop()
566
-
567
- # Check if has_images flag is set
568
- if not result.get('has_images', False) or 'pages_data' not in result:
569
- st.warning("No image data available in the OCR response.")
570
- st.stop()
571
-
572
- # Count images in the result
573
- image_count = 0
574
- for page in result.get('pages_data', []):
575
- image_count += len(page.get('images', []))
576
-
577
- # Add warning for image-heavy documents
578
- if image_count > 10:
579
- st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
580
-
581
- # Generate HTML with images
582
- html_with_images = create_html_with_images(result)
583
-
584
- # For multi-page documents, create page navigation
585
- page_count = len(result.get('pages_data', []))
586
-
587
- if page_count > 1:
588
- st.info(f"Document contains {page_count} pages. You can scroll to view all pages or use the page selector below.")
589
-
590
- # Create a page selector
591
- page_options = [f"Page {i+1}" for i in range(page_count)]
592
- selected_page = st.selectbox("Jump to page:", options=page_options, index=0)
593
-
594
- # Extract page number from selection
595
- page_num = int(selected_page.split(" ")[1])
596
-
597
- # Add JavaScript to scroll to the selected page
598
- st.markdown(f"""
599
- <script>
600
- document.addEventListener('DOMContentLoaded', function() {{
601
- const element = document.getElementById('page-{page_num}');
602
- if (element) {{
603
- element.scrollIntoView({{ behavior: 'smooth' }});
604
- }}
605
- }});
606
- </script>
607
- """, unsafe_allow_html=True)
608
-
609
- # Display the HTML content
610
- st.components.v1.html(html_with_images, height=600, scrolling=True)
611
-
612
- # Add download button for the content with images
613
- st.download_button(
614
- label="Download with Images (HTML)",
615
- data=html_with_images,
616
- file_name="document_with_images.html",
617
- mime="text/html"
618
- )
619
-
620
- except Exception as e:
621
- st.error(f"Could not display document with images: {str(e)}")
622
- st.info("Try refreshing or processing the document again.")
623
- else:
624
- st.error("No OCR content was extracted from the document.")
625
 
626
- except Exception as e:
627
- st.error(f"Error processing document: {str(e)}")
628
- else:
629
- # Display sample images in the main area when no file is uploaded
630
- st.info("Upload a document to get started using the file uploader above.")
631
-
632
- # Show example images in a grid
633
- # Add a sample images container
634
- with st.container():
635
- # Find sample images from the input directory to display
636
- input_dir = Path(__file__).parent / "input"
637
- sample_images = []
638
- if input_dir.exists():
639
- # Get all potential image files - exclude PDF files
640
- all_images = []
641
- all_images.extend(list(input_dir.glob("*.jpg")))
642
- all_images.extend(list(input_dir.glob("*.jpeg")))
643
- all_images.extend(list(input_dir.glob("*.png")))
644
-
645
- # Filter to get a good set of diverse images - not too small, not too large
646
- valid_images = [path for path in all_images if 50000 < path.stat().st_size < 1000000]
647
-
648
- # Deduplicate any images that might have the same content (like recipe and historical-recipe)
649
- seen_sizes = {}
650
- deduplicated_images = []
651
- for img in valid_images:
652
- size = img.stat().st_size
653
- # If we haven't seen this exact file size before, include it
654
- # This simple heuristic works well enough for images with identical content
655
- if size not in seen_sizes:
656
- seen_sizes[size] = True
657
- deduplicated_images.append(img)
658
 
659
- valid_images = deduplicated_images
660
-
661
- # Select a random sample of 6 images if we have enough
662
- import random
663
- if len(valid_images) > 6:
664
- sample_images = random.sample(valid_images, 6)
665
- else:
666
- sample_images = valid_images
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
667
 
668
- if sample_images:
669
- # Create two rows of three columns
670
-
671
- # First row
672
- row1 = st.columns(3)
673
- for i in range(3):
674
- if i < len(sample_images):
675
- with row1[i]:
676
- try:
677
- st.image(str(sample_images[i]), caption=sample_images[i].name, use_container_width=True)
678
- except Exception:
679
- # Silently skip problematic images
680
- pass
681
-
682
- # Second row
683
- row2 = st.columns(3)
684
- for i in range(3):
685
- idx = i + 3
686
- if idx < len(sample_images):
687
- with row2[i]:
688
- try:
689
- st.image(str(sample_images[idx]), caption=sample_images[idx].name, use_container_width=True)
690
- except Exception:
691
- # Silently skip problematic images
692
- pass
 
7
  import tempfile
8
  import io
9
  from pdf2image import convert_from_bytes
10
+ from PIL import Image, ImageEnhance, ImageFilter, UnidentifiedImageError
11
+ import PIL
12
  import cv2
13
  import numpy as np
14
 
 
16
  from structured_ocr import StructuredOCR
17
  from config import MISTRAL_API_KEY
18
 
19
+ # Import UI layout if available
20
  try:
21
+ from ui.layout import tool_container
22
+ UI_LAYOUT_AVAILABLE = True
23
  except ImportError:
24
+ UI_LAYOUT_AVAILABLE = False
25
 
26
  # Set page configuration
27
  st.set_page_config(
 
41
  st.error(f"Error converting PDF: {str(e)}")
42
  return []
43
 
44
+ def safe_open_image(image_bytes):
45
+ """Safe wrapper for PIL.Image.open with robust error handling"""
46
+ try:
47
+ return Image.open(io.BytesIO(image_bytes))
48
+ except Exception:
49
+ # Return None if image can't be opened
50
+ return None
51
+
52
  @st.cache_data(ttl=3600, show_spinner=False)
53
  def preprocess_image(image_bytes, preprocessing_options):
54
  """Preprocess image with selected options"""
55
+ try:
56
+ # Attempt to open the image safely
57
+ image = safe_open_image(image_bytes)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
+ # If image could not be opened, return the original bytes
60
+ if image is None:
61
+ return image_bytes
 
 
 
 
 
 
 
 
62
 
63
+ # Ensure image is in RGB mode for OpenCV processing
64
+ if image.mode not in ['RGB', 'RGBA']:
65
+ image = image.convert('RGB')
66
+ elif image.mode == 'RGBA':
67
+ # Handle RGBA images by removing transparency
68
+ background = Image.new('RGB', image.size, (255, 255, 255))
69
+ background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
70
+ image = background
71
+
72
+ # Handle image rotation based on user selection
73
+ rotation_option = preprocessing_options.get("rotation", "None")
74
+ if rotation_option != "None":
75
+ if rotation_option == "Rotate 90° clockwise":
76
+ image = image.transpose(Image.ROTATE_270)
77
+ elif rotation_option == "Rotate 90° counterclockwise":
78
+ image = image.transpose(Image.ROTATE_90)
79
+ elif rotation_option == "Rotate 180°":
80
+ image = image.transpose(Image.ROTATE_180)
81
+ elif rotation_option == "Auto-detect":
82
+ # Auto-detect orientation
83
+ width, height = image.size
84
+ # If image is in landscape and likely a document (typically portrait is better for OCR)
85
+ if width > height and (width / height) > 1.5:
86
+ image = image.transpose(Image.ROTATE_90)
87
+
88
+ # Convert to numpy array for OpenCV processing
89
+ try:
90
+ img_array = np.array(image)
91
+ except Exception:
92
+ # Return the original image as JPEG if we can't convert to array
93
+ byte_io = io.BytesIO()
94
+ image.save(byte_io, format='JPEG')
95
+ byte_io.seek(0)
96
+ return byte_io.getvalue()
97
+
98
+ # Apply preprocessing based on selected options
99
+ try:
100
+ if preprocessing_options.get("grayscale", False):
101
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
102
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
103
+
104
+ if preprocessing_options.get("contrast", 0) != 0:
105
+ contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 10)
106
+ image = Image.fromarray(img_array)
107
+ enhancer = ImageEnhance.Contrast(image)
108
+ image = enhancer.enhance(contrast_factor)
109
+ img_array = np.array(image)
110
+
111
+ if preprocessing_options.get("denoise", False):
112
+ # Ensure the image is in the correct format for denoising (CV_8UC3)
113
+ if len(img_array.shape) != 3 or img_array.shape[2] != 3:
114
+ # Convert to RGB if it's not already a 3-channel color image
115
+ if len(img_array.shape) == 2: # Grayscale
116
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
117
+ img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 10, 10, 7, 21)
118
+
119
+ if preprocessing_options.get("threshold", False):
120
+ # Convert to grayscale if not already
121
+ if len(img_array.shape) == 3:
122
+ gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
123
+ else:
124
+ gray = img_array
125
+ # Apply adaptive threshold
126
+ binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
127
+ cv2.THRESH_BINARY, 11, 2)
128
+ # Convert back to RGB
129
+ img_array = cv2.cvtColor(binary, cv2.COLOR_GRAY2RGB)
130
+ except Exception:
131
+ # Return the original image if preprocessing fails
132
+ byte_io = io.BytesIO()
133
+ image.save(byte_io, format='JPEG')
134
+ byte_io.seek(0)
135
+ return byte_io.getvalue()
136
+
137
+ # Convert back to PIL Image
138
+ try:
139
+ processed_image = Image.fromarray(img_array)
140
+
141
+ # Convert to bytes
142
+ byte_io = io.BytesIO()
143
+ processed_image.save(byte_io, format='JPEG') # Use JPEG for better compatibility
144
+ byte_io.seek(0)
145
+
146
+ return byte_io.getvalue()
147
+ except Exception:
148
+ # Final fallback - return original bytes
149
+ return image_bytes
150
+
151
+ except Exception:
152
+ # Return original image bytes as fallback
153
+ return image_bytes
154
 
155
  # Define functions
156
  def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
 
180
  # Return dummy data if no API key
181
  progress_bar.progress(100)
182
  status_text.empty()
183
+
184
+ # Show a clear message about the missing API key
185
+ st.error("🔑 **Missing API Key**: Cannot process document without a valid Mistral AI API key.")
186
+ st.info("""
187
+ **How to add your API key:**
188
+
189
+ For Hugging Face Spaces:
190
+ 1. Go to your Space settings
191
+ 2. Add a secret named `MISTRAL_API_KEY` with your API key value
192
+
193
+ For local development:
194
+ 1. Add to your shell: `export MISTRAL_API_KEY=your_key_here`
195
+ 2. Or create a `.env` file with `MISTRAL_API_KEY=your_key_here`
196
+ """)
197
+
198
  return {
199
  "file_name": uploaded_file.name,
200
+ "topics": ["API Key Required"],
201
  "languages": ["English"],
202
  "ocr_contents": {
203
+ "title": "Missing Mistral API Key",
204
+ "content": "To process real documents, please set the MISTRAL_API_KEY environment variable as described above."
205
  }
206
  }
207
 
 
209
  progress_bar.progress(20)
210
  status_text.text("Initializing OCR processor...")
211
 
212
+ # Initialize OCR processor with explicit API key
213
+ try:
214
+ # Make sure the API key is properly formatted
215
+ api_key = MISTRAL_API_KEY.strip()
216
+ processor = StructuredOCR(api_key=api_key)
217
+ except Exception as e:
218
+ st.error(f"Error initializing OCR processor: {str(e)}")
219
+ return {
220
+ "file_name": uploaded_file.name,
221
+ "error": "API authentication failed",
222
+ "ocr_contents": {
223
+ "error": "Could not authenticate with Mistral API. Please check your API key."
224
+ }
225
+ }
226
 
227
  # Determine file type from extension
228
  file_ext = Path(uploaded_file.name).suffix.lower()
229
  file_type = "pdf" if file_ext == ".pdf" else "image"
230
 
231
+ # Store original filename in session state for preservation
232
+ st.session_state.original_filename = uploaded_file.name
233
+
234
  # Apply preprocessing if needed
235
  if any(preprocessing_options.values()) and file_type == "image":
236
  status_text.text("Applying image preprocessing...")
237
+ try:
238
+ processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
239
+
240
+ # Save processed image to temp file but preserve original filename for results
241
+ original_ext = Path(uploaded_file.name).suffix.lower()
242
+
243
+ # Use original extension when possible for better format recognition
244
+ if original_ext in ['.jpg', '.jpeg', '.png']:
245
+ suffix = original_ext
246
+ else:
247
+ suffix = '.jpg' # Default fallback to JPEG
248
+
249
+ with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as proc_tmp:
250
+ proc_tmp.write(processed_bytes)
251
+ temp_path = proc_tmp.name
252
+
253
+ except Exception as e:
254
+ st.warning(f"Image preprocessing failed: {str(e)}. Proceeding with original image.")
255
+ # If preprocessing fails, use original file
256
+ # This ensures the OCR process continues even if preprocessing has issues
257
 
258
  # Get file size in MB
259
  file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
 
287
  progress_bar.progress(100)
288
  status_text.empty()
289
 
290
+ # Preserve original filename in results
291
+ if hasattr(st.session_state, 'original_filename'):
292
+ result['file_name'] = st.session_state.original_filename
293
+ # Clear the stored filename for next run
294
+ del st.session_state.original_filename
295
+
296
  return result
297
  except Exception as e:
298
  progress_bar.progress(100)
 
304
  if os.path.exists(temp_path):
305
  os.unlink(temp_path)
306
 
307
+ # Initialize session state for storing results
308
+ if 'previous_results' not in st.session_state:
309
+ st.session_state.previous_results = []
310
+ if 'current_result' not in st.session_state:
311
+ st.session_state.current_result = None
312
+
313
  # App title and description
314
  st.title("Historical Document OCR")
315
+ st.write("Process historical documents and images with AI-powered OCR.")
316
 
317
+ # Check if API key is available
318
+ if not MISTRAL_API_KEY:
319
+ st.warning("⚠️ **No Mistral API key found.** Please set the MISTRAL_API_KEY environment variable.")
320
+ st.info("For Hugging Face Spaces, add it as a secret. For local development, export it in your shell or add it to a .env file.")
321
 
322
+ # Create main layout with tabs
323
+ main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "Previous Results", "About"])
 
 
 
 
 
 
 
 
 
 
324
 
325
  # Sidebar with options
326
  with st.sidebar:
 
329
  # Model options
330
  st.subheader("Model Settings")
331
  use_vision = st.checkbox("Use Vision Model", value=True,
332
+ help="For image files, use the vision model for improved analysis")
333
 
334
+ # Image preprocessing options
335
  st.subheader("Image Preprocessing")
336
  with st.expander("Preprocessing Options"):
337
  preprocessing_options = {}
 
343
  help="Remove noise from the image")
344
  preprocessing_options["contrast"] = st.slider("Adjust Contrast", -5, 5, 0,
345
  help="Adjust image contrast (-5 to +5)")
346
+
347
+ # Add rotation options
348
+ rotation_options = ["None", "Rotate 90° clockwise", "Rotate 90° counterclockwise", "Rotate 180°", "Auto-detect"]
349
+ preprocessing_options["rotation"] = st.selectbox("Image Orientation", rotation_options, index=0,
350
+ help="Rotate image to correct orientation")
351
 
352
+ # PDF options
353
  st.subheader("PDF Options")
354
  with st.expander("PDF Settings"):
355
  pdf_dpi = st.slider("PDF Resolution (DPI)", 72, 300, 150,
356
  help="Higher DPI gives better quality but slower processing")
357
+ max_pages = st.number_input("Maximum Pages", 1, 20, 5,
358
  help="Limit number of pages to process")
359
 
360
+ # Previous Results tab
361
  with main_tab2:
362
+ if not st.session_state.previous_results:
363
+ st.info("No previous documents have been processed yet. Process a document to see results here.")
364
+ else:
365
+ st.subheader("Previously Processed Documents")
366
+
367
+ # Display previous results in a selectable list, with default confidence of 85%
368
+ previous_files = [f"{i+1}. {result.get('file_name', 'Document')} ({result.get('confidence_score', 0.85):.1%} confidence)"
369
+ for i, result in enumerate(st.session_state.previous_results)]
370
+
371
+ selected_index = st.selectbox("Select a previous document:",
372
+ options=range(len(previous_files)),
373
+ format_func=lambda i: previous_files[i])
374
+
375
+ selected_result = st.session_state.previous_results[selected_index]
376
+
377
+ # Display selected result in tabs
378
+ has_images = selected_result.get('has_images', False)
379
+ if has_images:
380
+ prev_tabs = st.tabs(["Document Info", "Content", "With Images"])
381
+ else:
382
+ prev_tabs = st.tabs(["Document Info", "Content"])
383
+
384
+ # Document Info tab
385
+ with prev_tabs[0]:
386
+ st.write(f"**File:** {selected_result.get('file_name', 'Document')}")
387
+
388
+ # Show confidence score (default to 85% if not available)
389
+ confidence = selected_result.get('confidence_score', 0.85)
390
+ st.write(f"**OCR Confidence:** {confidence:.1%}")
391
+
392
+ # Show languages if available
393
+ if 'languages' in selected_result and selected_result['languages']:
394
+ languages = [lang for lang in selected_result['languages'] if lang is not None]
395
+ if languages:
396
+ st.write(f"**Languages:** {', '.join(languages)}")
397
+
398
+ # Show topics if available
399
+ if 'topics' in selected_result and selected_result['topics']:
400
+ st.write(f"**Topics:** {', '.join(selected_result['topics'])}")
401
+
402
+ # Show any limited pages info
403
+ if 'limited_pages' in selected_result:
404
+ st.info(f"Processed {selected_result['limited_pages']['processed']} of {selected_result['limited_pages']['total']} pages")
405
+
406
+ # Content tab
407
+ with prev_tabs[1]:
408
+ if 'ocr_contents' in selected_result:
409
+ st.markdown("## Document Contents")
410
+
411
+ if isinstance(selected_result['ocr_contents'], dict):
412
+ for section, content in selected_result['ocr_contents'].items():
413
+ if not content:
414
+ continue
415
+
416
+ section_title = section.replace('_', ' ').title()
417
+
418
+ # Special handling for title and subtitle
419
+ if section.lower() == 'title':
420
+ st.markdown(f"# {content}")
421
+ elif section.lower() == 'subtitle':
422
+ st.markdown(f"*{content}*")
423
+ else:
424
+ st.markdown(f"### {section_title}")
425
+
426
+ # Handle different content types
427
+ if isinstance(content, str):
428
+ st.markdown(content)
429
+ elif isinstance(content, list):
430
+ for item in content:
431
+ if isinstance(item, str):
432
+ st.markdown(f"* {item}")
433
+ else:
434
+ st.json(item)
435
+ elif isinstance(content, dict):
436
+ for k, v in content.items():
437
+ st.markdown(f"**{k}:** {v}")
438
+ else:
439
+ st.warning("No content available for this document.")
440
+
441
+ # Images tab if available
442
+ if has_images and len(prev_tabs) > 2:
443
+ with prev_tabs[2]:
444
+ try:
445
+ # Import function
446
+ from ocr_utils import create_html_with_images
447
+
448
+ if 'pages_data' in selected_result:
449
+ # Generate HTML with images
450
+ html_with_images = create_html_with_images(selected_result)
451
+
452
+ # Display HTML content
453
+ st.components.v1.html(html_with_images, height=600, scrolling=True)
454
+
455
+ # Download button with unique key to prevent resets
456
+ st.download_button(
457
+ label="Download with Images (HTML)",
458
+ data=html_with_images,
459
+ file_name=f"{selected_result.get('file_name', 'document')}_with_images.html",
460
+ mime="text/html",
461
+ key=f"prev_download_{hash(selected_result.get('file_name', 'doc'))}_{selected_index}"
462
+ )
463
+ else:
464
+ st.warning("No image data available for this document.")
465
+ except Exception as e:
466
+ st.error(f"Could not display document with images: {str(e)}")
467
+
468
+ # About tab content
469
+ with main_tab3:
470
  st.markdown("""
471
  ### About This Application
472
 
473
+ This app uses Mistral AI's Document OCR to extract text and images from historical documents with enhanced formatting.
474
 
475
  It can process:
476
  - Image files (jpg, png, etc.)
 
487
  - **Raw JSON**: Complete data structure for developers
488
  - **With Images**: Document with embedded images preserving original layout
489
 
490
+ **History Feature:**
491
+ - All processed documents are saved in the session history
492
+ - Access previous documents in the "Previous Results" tab
493
+ - No need to reprocess the same document multiple times
 
 
 
 
 
 
 
 
494
  """)
495
 
496
+ # Main tab content
497
  with main_tab1:
498
+ # Create two columns for the main interface
499
+ col1, col2 = st.columns([1, 1])
500
+
501
+ # File upload column
502
+ with col1:
503
+ st.subheader("Upload Document")
504
 
505
+ # File uploader
506
+ uploaded_file = st.file_uploader("Choose an image or PDF file",
507
+ type=["pdf", "png", "jpg", "jpeg"],
508
+ help="Select a document to process with OCR")
509
 
510
+ # Show preprocessing summary if options are selected
511
+ if uploaded_file is not None and any(preprocessing_options.values()):
512
+ st.write("**Active preprocessing:**")
513
+ prep_list = []
514
+
515
+ if preprocessing_options.get("grayscale", False):
516
+ prep_list.append("Grayscale conversion")
517
+ if preprocessing_options.get("threshold", False):
518
+ prep_list.append("Adaptive thresholding")
519
+ if preprocessing_options.get("denoise", False):
520
+ prep_list.append("Noise reduction")
521
+
522
+ contrast_value = preprocessing_options.get("contrast", 0)
523
+ if contrast_value != 0:
524
+ direction = "increased" if contrast_value > 0 else "decreased"
525
+ prep_list.append(f"Contrast {direction} by {abs(contrast_value)}")
526
+
527
+ rotation = preprocessing_options.get("rotation", "None")
528
+ if rotation != "None":
529
+ prep_list.append(f"{rotation}")
530
+
531
+ for item in prep_list:
532
+ st.write(f"- {item}")
533
 
534
+ # Process button - show only when file is uploaded
535
+ if uploaded_file is not None:
536
+ # Check file size (cap at 20MB)
537
+ file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
538
+
539
+ if file_size_mb > 20:
540
+ st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
541
+ else:
542
+ # Display file info
543
+ st.write(f"**File:** {uploaded_file.name} ({file_size_mb:.2f} MB)")
544
+
545
+ # Process button
546
+ process_button = st.button("Process Document",
547
+ type="primary",
548
+ use_container_width=True,
549
+ help="Start OCR processing with the selected options")
550
+
551
+ # Preview column
552
+ with col2:
553
+ if uploaded_file is not None:
554
  st.subheader("Document Preview")
555
+
556
+ file_ext = Path(uploaded_file.name).suffix.lower()
557
+
558
+ # Show preview tabs for original and processed (if applicable)
559
+ if uploaded_file.type and uploaded_file.type.startswith('image/'):
560
+ # For image files
561
+ preview_tabs = st.tabs(["Original"])
562
+
563
+ # Show original image preview
564
+ with preview_tabs[0]:
565
+ try:
566
+ image = safe_open_image(uploaded_file.getvalue())
567
+ if image:
568
+ # Display with controlled size
569
+ st.image(image, caption=uploaded_file.name, width=400)
570
+ else:
571
+ st.info("Image preview not available")
572
+ except Exception:
573
+ st.info("Image preview could not be displayed")
574
+
575
+ # Add processed preview if preprocessing options are selected
576
+ if any(preprocessing_options.values()):
577
+ # Create a before-after comparison
578
+ st.subheader("Preprocessing Preview")
579
+
580
+ try:
581
+ # Process the image with selected options
582
+ processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
583
+ processed_image = safe_open_image(processed_bytes)
584
+
585
+ # Show before/after in columns
586
+ col1, col2 = st.columns(2)
587
+
588
+ with col1:
589
+ st.write("**Original**")
590
+ image = safe_open_image(uploaded_file.getvalue())
591
+ if image:
592
+ st.image(image, width=300)
593
+
594
+ with col2:
595
+ st.write("**Processed**")
596
+ if processed_image:
597
+ st.image(processed_image, width=300)
598
+ else:
599
+ st.info("Processed preview not available")
600
+ except Exception:
601
+ st.info("Preprocessing preview could not be generated")
602
+
603
+ elif file_ext == ".pdf":
604
+ # For PDF files
605
  try:
606
+ # Convert first page of PDF to image
607
  pdf_bytes = uploaded_file.getvalue()
608
+
609
+ with st.spinner("Generating PDF preview..."):
610
+ images = convert_from_bytes(pdf_bytes, first_page=1, last_page=1, dpi=150)
611
 
612
  if images:
613
+ # Convert to JPEG for display
614
  first_page = images[0]
615
  img_bytes = io.BytesIO()
616
  first_page.save(img_bytes, format='JPEG')
617
  img_bytes.seek(0)
618
 
619
+ # Display preview
620
+ st.image(img_bytes, caption=f"PDF Preview: {uploaded_file.name}", width=400)
621
+ st.info(f"PDF document with {len(convert_from_bytes(pdf_bytes, dpi=100))} pages")
622
  else:
623
+ st.info(f"PDF preview not available: {uploaded_file.name}")
624
  except Exception:
625
+ st.info(f"PDF preview could not be displayed: {uploaded_file.name}")
626
+
627
+ # Results section - spans full width
628
+ if 'process_button' in locals() and process_button:
629
+ # Horizontal line to separate input and results
630
+ st.markdown("---")
631
+ st.subheader("Processing Results")
632
+
633
+ try:
634
+ # Process the file with selected options
635
+ result = process_file(uploaded_file, use_vision, preprocessing_options)
636
+
637
+ # Save result to session state
638
+ st.session_state.current_result = result
639
+
640
+ # Add to previous results if not already there
641
+ if result not in st.session_state.previous_results:
642
+ st.session_state.previous_results.append(result)
643
+ # Keep only the last 10 results to avoid memory issues
644
+ if len(st.session_state.previous_results) > 10:
645
+ st.session_state.previous_results.pop(0)
646
+
647
+ # Create tabs for viewing results
648
+ has_images = result.get('has_images', False)
649
+ if has_images:
650
+ result_tabs = st.tabs(["Structured View", "Raw JSON", "With Images"])
651
  else:
652
+ result_tabs = st.tabs(["Structured View", "Raw JSON"])
653
+
654
+ # Structured view tab
655
+ with result_tabs[0]:
656
+ # Display file info
657
+ st.write(f"**File:** {result.get('file_name', uploaded_file.name)}")
658
 
659
+ # Show confidence score (default to 85% if not available)
660
+ confidence = result.get('confidence_score', 0.85)
661
+ st.write(f"**OCR Confidence:** {confidence:.1%}")
662
 
663
+ # Show languages if available
664
+ if 'languages' in result and result['languages']:
665
+ languages = [lang for lang in result['languages'] if lang is not None]
666
+ if languages:
667
+ st.write(f"**Languages:** {', '.join(languages)}")
 
 
 
 
 
 
 
 
 
 
 
 
668
 
669
+ # Show topics if available
670
+ if 'topics' in result and result['topics']:
671
+ st.write(f"**Topics:** {', '.join(result['topics'])}")
672
 
673
+ # Display limited pages info if applicable
674
+ if 'limited_pages' in result:
675
+ st.info(f"Processed {result['limited_pages']['processed']} of {result['limited_pages']['total']} pages")
676
+
677
+ # Display structured content
678
+ if 'ocr_contents' in result:
679
+ st.markdown("## Document Contents")
680
 
681
+ # Format based on content structure
682
+ if isinstance(result['ocr_contents'], dict):
683
+ for section, content in result['ocr_contents'].items():
684
+ if not content: # Skip empty sections
685
+ continue
 
 
 
 
 
 
 
 
 
 
 
686
 
687
+ section_title = section.replace('_', ' ').title()
 
 
 
 
 
 
 
 
 
 
 
 
 
688
 
689
+ # Special handling for title and subtitle
690
+ if section.lower() == 'title':
691
+ st.markdown(f"# {content}")
692
+ elif section.lower() == 'subtitle':
693
+ st.markdown(f"*{content}*")
694
  else:
695
+ # Section headers for non-title sections
696
+ st.markdown(f"### {section_title}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
697
 
698
+ # Process different content types
699
+ if isinstance(content, str):
700
+ st.markdown(content)
701
+ elif isinstance(content, list):
702
+ # Display list items with proper formatting
703
+ st.write("") # Add spacing
704
+ for item in content:
705
+ if isinstance(item, str):
706
+ st.markdown(f"* {item}")
707
+ elif isinstance(item, dict):
708
+ # Create formatted display for dictionary items instead of raw JSON
709
+ with st.expander(f"Details {list(item.keys())[0] if item else ''}"):
710
+ for k, v in item.items():
711
+ st.markdown(f"**{k}:** {v}")
712
+ elif isinstance(content, dict):
713
+ # Special handling for poem type
714
+ if 'type' in content and content['type'] == 'poem' and 'lines' in content:
715
+ st.markdown("```") # Use code block for poem to preserve spacing
716
+ for line in content['lines']:
717
+ st.markdown(line)
718
+ st.markdown("```")
719
+ else:
720
+ # Regular dictionary display with better formatting
721
+ st.write("") # Add spacing
722
+ for k, v in content.items():
723
+ if isinstance(v, str):
724
+ st.markdown(f"**{k}:** {v}")
725
+ elif isinstance(v, list):
726
+ st.markdown(f"**{k}:**")
727
+ for item in v:
728
+ st.markdown(f" * {item}")
729
  else:
730
+ st.markdown(f"**{k}:** {v}")
731
+
732
+ # Download button
733
+ with st.expander("Export Content"):
734
+ # Generate HTML content for download with proper CSS styling
735
+ html_content = '''<!DOCTYPE html>
736
+ <html lang="en">
737
+ <head>
738
+ <meta charset="UTF-8">
739
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
740
+ <title>OCR Document</title>
741
+ <style>
742
+ body {
743
+ font-family: 'Georgia', serif;
744
+ line-height: 1.6;
745
+ margin: 0;
746
+ padding: 20px;
747
+ background-color: #f9f9f9;
748
+ color: #333;
749
+ }
750
+ .container {
751
+ max-width: 1000px;
752
+ margin: 0 auto;
753
+ background-color: #fff;
754
+ padding: 30px;
755
+ border-radius: 8px;
756
+ box-shadow: 0 4px 12px rgba(0,0,0,0.1);
757
+ }
758
+ h1, h2, h3 {
759
+ font-family: 'Bookman', 'Georgia', serif;
760
+ margin-top: 1.5em;
761
+ margin-bottom: 0.5em;
762
+ color: #222;
763
+ }
764
+ h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
765
+ h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
766
+ h3 { font-size: 1.5em; }
767
+ p { margin-bottom: 1.2em; text-align: justify; }
768
+ ul { margin-bottom: 1.5em; }
769
+ li { margin-bottom: 0.3em; }
770
+ dl { margin-bottom: 1.5em; }
771
+ dt { font-weight: bold; margin-top: 1em; }
772
+ dd { margin-left: 2em; margin-bottom: 0.5em; }
773
+ .poem {
774
+ font-family: 'Baskerville', 'Georgia', serif;
775
+ margin-left: 2em;
776
+ line-height: 1.8;
777
+ white-space: pre-wrap;
778
+ }
779
+ </style>
780
+ </head>
781
+ <body>
782
+ <div class="container">'''
783
+
784
+ # Add content to HTML with proper formatting
785
+ if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
786
+ for section, content in result['ocr_contents'].items():
787
+ if not content:
788
+ continue
789
 
790
+ section_title = section.replace('_', ' ').title()
791
+
792
+ # Handle title and subtitle with special formatting
793
+ if section.lower() == 'title':
794
+ html_content += f'<h1>{content}</h1>\n'
795
+ elif section.lower() == 'subtitle':
796
+ html_content += f'<div style="font-style:italic;font-size:1.1em;margin-bottom:1.5em;">{content}</div>\n'
797
+ else:
798
+ html_content += f'<h3>{section_title}</h3>\n'
799
 
800
+ # Handle different content types with appropriate HTML
801
+ if isinstance(content, str):
802
+ # Split into paragraphs and format each properly
803
+ paragraphs = content.split('\n\n')
804
+ for p in paragraphs:
805
+ if p.strip():
806
+ html_content += f'<p>{p.strip()}</p>\n'
 
 
 
807
 
808
+ elif isinstance(content, list):
809
+ # Properly format lists with better handling for dict items
810
+ html_content += '<ul>\n'
811
+ for item in content:
812
+ if isinstance(item, str):
813
+ html_content += f'<li>{item}</li>\n'
814
+ elif isinstance(item, dict):
815
+ # Format dictionary items in the list
816
+ html_content += '<li>\n'
817
+ html_content += '<details>\n'
818
+ html_content += f'<summary>{list(item.keys())[0] if item else "Details"}</summary>\n'
819
+ html_content += '<dl>\n'
820
+ for k, v in item.items():
821
+ html_content += f'<dt>{k}</dt>\n<dd>{v}</dd>\n'
822
+ html_content += '</dl>\n'
823
+ html_content += '</details>\n'
824
+ html_content += '</li>\n'
825
+ else:
826
+ html_content += f'<li>{str(item)}</li>\n'
827
+ html_content += '</ul>\n'
828
 
829
+ elif isinstance(content, dict):
830
+ # Special handling for poem content
831
+ if 'type' in content and content['type'] == 'poem' and 'lines' in content:
832
+ html_content += '<div class="poem">\n'
833
+ for line in content['lines']:
834
+ html_content += f'{line}<br>\n'
835
+ html_content += '</div>\n'
836
+ else:
837
+ # Regular dictionary display with proper nesting
838
+ html_content += '<dl>\n'
839
+ for k, v in content.items():
840
+ html_content += f'<dt>{k}</dt>\n'
841
+
842
+ if isinstance(v, str):
843
+ html_content += f'<dd>{v}</dd>\n'
844
+ elif isinstance(v, list):
845
+ html_content += '<dd><ul>\n'
846
+ for item in v:
847
+ html_content += f'<li>{item}</li>\n'
848
+ html_content += '</ul></dd>\n'
849
+ else:
850
+ html_content += f'<dd>{str(v)}</dd>\n'
851
+ html_content += '</dl>\n'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
852
 
853
+ # Close HTML
854
+ html_content += '''
855
+ </div>
856
+ </body>
857
+ </html>'''
858
+
859
+ # Create download button with unique key to prevent resets
860
+ html_bytes = html_content.encode()
861
+ st.download_button(
862
+ label="Download as HTML",
863
+ data=html_bytes,
864
+ file_name="document_content.html",
865
+ mime="text/html",
866
+ key=f"download_html_{hash(result.get('file_name', 'doc'))}"
867
+ )
868
+
869
+ # Raw JSON tab
870
+ with result_tabs[1]:
871
+ st.json(result)
872
+
873
+ # Images tab (if available)
874
+ if has_images:
875
+ with result_tabs[2]:
876
+ try:
877
+ # Import create_html_with_images function
878
+ from ocr_utils import create_html_with_images
 
 
 
 
 
 
879
 
880
+ # Check if images are available
881
+ if 'pages_data' not in result:
882
+ st.warning("No image data available in the OCR response.")
883
+ else:
884
+ # Count images for warning
885
+ image_count = 0
886
+ for page in result.get('pages_data', []):
887
+ image_count += len(page.get('images', []))
888
+
889
+ if image_count > 10:
890
+ st.warning(f"This document contains {image_count} images. Rendering may take longer.")
891
+
892
+ # Display info about pages and images
893
+ page_count = len(result.get('pages_data', []))
894
+ st.write(f"**Document contains {page_count} page{'' if page_count == 1 else 's'} with {image_count} image{'' if image_count == 1 else 's'} total**")
895
+
896
+ # Add pagination if multiple pages
897
+ if page_count > 1:
898
+ page_options = [f"Page {i+1}" for i in range(page_count)]
899
+ selected_page = st.selectbox("Select page to view:", options=page_options)
900
+ selected_page_num = int(selected_page.split(" ")[1])
901
+ st.write(f"**Viewing {selected_page}**")
902
+
903
+ # Generate HTML with images
904
+ with st.spinner("Generating document with embedded images..."):
905
+ html_with_images = create_html_with_images(result)
906
+
907
+ # Display document in a fixed height container with scrolling
908
+ st.write("**Document with Original Images**")
909
+ st.components.v1.html(html_with_images, height=600, scrolling=True)
910
+
911
+ # Provide a download option
912
+ col1, col2 = st.columns([3, 1])
913
+ with col2:
914
+ st.download_button(
915
+ label="Download with Images",
916
+ data=html_with_images,
917
+ file_name=f"{result.get('file_name', 'document')}_with_images.html",
918
+ mime="text/html",
919
+ use_container_width=True,
920
+ key=f"download_images_{hash(result.get('file_name', 'doc'))}"
921
+ )
922
+ with col1:
923
+ st.info("This HTML document includes the original document images embedded at their correct positions.")
924
+ st.write("Original filenames and image positions are preserved in the downloaded file.")
925
+ except Exception as e:
926
+ st.error(f"Could not display document with images: {str(e)}")
927
+
928
+ except Exception as e:
929
+ st.error(f"Error processing document: {str(e)}")
930
+
931
+ # Show sample examples when no file is uploaded
932
+ elif uploaded_file is None:
933
+ # Show info about supported formats
934
+ st.info("📝 Upload a document to get started. Supported formats: JPG, PNG, PDF")
935
+
936
+ # Show example usage
937
+ with st.expander("Tips for best results"):
938
+ st.markdown("""
939
+ **For best OCR results:**
940
 
941
+ 1. **Image quality** - Higher resolution images produce better results
942
+ 2. **Document orientation** - Use rotation options for incorrectly oriented documents
943
+ 3. **Preprocessing** - Try grayscale and thresholding for low-contrast documents
944
+ 4. **File size** - Keep files under 10MB for best API performance
945
+
946
+ **File preservation:** Original filenames are preserved in the results.
947
+ """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.py CHANGED
@@ -4,14 +4,19 @@ Configuration file for Mistral OCR processing.
4
  Contains API key and other settings.
5
  """
6
  import os
 
 
 
 
7
 
8
  # Your Mistral API key - get from Hugging Face secrets or environment variable
9
- # The priority order is: HF_SPACES environment var > regular environment var > empty string
10
- # Note: No default API key is provided for security reasons
11
- MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY", # First check HF-specific env var
12
- os.environ.get("MISTRAL_API_KEY", "")) # Then check regular env var
 
 
13
 
14
  # Model settings
15
  OCR_MODEL = "mistral-ocr-latest"
16
- TEXT_MODEL = "ministral-8b-latest"
17
  VISION_MODEL = "pixtral-12b-latest"
 
4
  Contains API key and other settings.
5
  """
6
  import os
7
+ from dotenv import load_dotenv
8
+
9
+ # Load environment variables from .env file if it exists
10
+ load_dotenv()
11
 
12
  # Your Mistral API key - get from Hugging Face secrets or environment variable
13
+ # The priority order is:
14
+ # 1. HF_MISTRAL_API_KEY environment var (specific to Hugging Face)
15
+ # 2. MISTRAL_API_KEY environment var (standard environment variable)
16
+ # 3. Empty string (will show warning in app)
17
+ MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",
18
+ os.environ.get("MISTRAL_API_KEY", ""))
19
 
20
  # Model settings
21
  OCR_MODEL = "mistral-ocr-latest"
 
22
  VISION_MODEL = "pixtral-12b-latest"
prepare_for_hf.py CHANGED
@@ -13,34 +13,17 @@ import shutil
13
  import sys
14
  from pathlib import Path
15
 
16
- # Configuration for HF module
17
- HF_MODULE_ENABLED = True # Set to False to disable the educational module
18
 
19
  def setup_hf_module():
20
- """Setup the Hugging Face educational module if enabled"""
21
- if not HF_MODULE_ENABLED:
22
- print("Hugging Face educational module is disabled.")
23
- return
24
 
25
- print("Setting up Hugging Face educational module...")
26
-
27
- # Ensure directories exist
28
- for directory in ["modules", "ui"]:
29
- if not os.path.exists(directory):
30
- os.makedirs(directory)
31
- print(f"Created {directory} directory")
32
-
33
- # Check if module files exist
34
- required_files = ["streamlit_app.py", "modules/modular_app.py", "ui/layout.py"]
35
- missing_files = [f for f in required_files if not os.path.exists(f)]
36
-
37
- if missing_files:
38
- print("Warning: Some module files are missing:")
39
- for file in missing_files:
40
- print(f" - {file}")
41
- print("The educational version may not work correctly.")
42
- else:
43
- print("All required module files are present.")
44
 
45
  def main():
46
  print("Preparing repository for Hugging Face Spaces deployment...")
 
13
  import sys
14
  from pathlib import Path
15
 
16
+ # No educational module needed
17
+ HF_MODULE_ENABLED = False
18
 
19
  def setup_hf_module():
20
+ """Setup the Hugging Face integration"""
21
+ print("No educational module needed - using simplified app structure.")
 
 
22
 
23
+ # Ensure ui directory exists for layout files
24
+ if not os.path.exists("ui"):
25
+ os.makedirs("ui")
26
+ print("Created ui directory")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  def main():
29
  print("Preparing repository for Hugging Face Spaces deployment...")
process_file.py CHANGED
@@ -54,11 +54,16 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
54
  "use_vision": use_vision
55
  })
56
 
 
 
 
 
57
  return result
58
  except Exception as e:
59
  return {
60
  "error": str(e),
61
- "file_name": uploaded_file.name
 
62
  }
63
  finally:
64
  # Clean up the temporary file
 
54
  "use_vision": use_vision
55
  })
56
 
57
+ # Always ensure confidence score is present (default to 85%)
58
+ if 'confidence_score' not in result:
59
+ result['confidence_score'] = 0.85
60
+
61
  return result
62
  except Exception as e:
63
  return {
64
  "error": str(e),
65
+ "file_name": uploaded_file.name,
66
+ "confidence_score": 0.85 # Add default confidence score even to error results
67
  }
68
  finally:
69
  # Clean up the temporary file
requirements.txt CHANGED
@@ -1,5 +1,5 @@
1
  streamlit>=1.43.2
2
- mistralai>=0.0.7
3
  pydantic>=2.0.0
4
  pycountry>=23.12.11
5
  pillow>=10.0.0
@@ -7,4 +7,5 @@ python-multipart>=0.0.6
7
  pdf2image>=1.17.0
8
  pytesseract>=0.3.10
9
  opencv-python-headless>=4.6.0
10
- numpy>=1.23.5
 
 
1
  streamlit>=1.43.2
2
+ mistralai>=0.0.7,<2.0.0
3
  pydantic>=2.0.0
4
  pycountry>=23.12.11
5
  pillow>=10.0.0
 
7
  pdf2image>=1.17.0
8
  pytesseract>=0.3.10
9
  opencv-python-headless>=4.6.0
10
+ numpy>=1.23.5
11
+ python-dotenv>=1.0.0
run_local.sh CHANGED
@@ -1,13 +1,8 @@
1
  #!/bin/bash
2
 
3
- # Determine which version of the app to run
4
- if [ "$1" == "educational" ]; then
5
- APP_FILE="streamlit_app.py"
6
- echo "Starting Educational Version..."
7
- else
8
- APP_FILE="app.py"
9
- echo "Starting Standard Version..."
10
- fi
11
 
12
  # Check if .env file exists and load it
13
  if [ -f .env ]; then
 
1
  #!/bin/bash
2
 
3
+ # Run the standard app
4
+ APP_FILE="app.py"
5
+ echo "Starting OCR Application..."
 
 
 
 
 
6
 
7
  # Check if .env file exists and load it
8
  if [ -f .env ]; then
simple_test.py CHANGED
@@ -12,7 +12,7 @@ def main():
12
  print("Testing OCR with a sample image file")
13
 
14
  # Path to the sample image file
15
- image_path = os.path.join("input", "recipe.jpg")
16
 
17
  # Check if the file exists
18
  if not os.path.isfile(image_path):
@@ -25,7 +25,7 @@ def main():
25
  output_dir = "output"
26
  os.makedirs(output_dir, exist_ok=True)
27
 
28
- output_path = os.path.join(output_dir, "recipe_test.json")
29
 
30
  # Import the StructuredOCR class
31
  from structured_ocr import StructuredOCR
@@ -38,9 +38,18 @@ def main():
38
  print(f"Processing image file: {image_path}")
39
  result = processor.process_file(image_path, file_type="image")
40
 
41
- # Save the result to the output file
 
 
 
 
 
 
 
 
 
42
  with open(output_path, 'w') as f:
43
- json.dump(result, f, indent=2)
44
 
45
  print(f"Image processing completed successfully. Output saved to {output_path}")
46
 
 
12
  print("Testing OCR with a sample image file")
13
 
14
  # Path to the sample image file
15
+ image_path = os.path.join("input", "magician-satire.jpg")
16
 
17
  # Check if the file exists
18
  if not os.path.isfile(image_path):
 
25
  output_dir = "output"
26
  os.makedirs(output_dir, exist_ok=True)
27
 
28
+ output_path = os.path.join(output_dir, "magician_test.json")
29
 
30
  # Import the StructuredOCR class
31
  from structured_ocr import StructuredOCR
 
38
  print(f"Processing image file: {image_path}")
39
  result = processor.process_file(image_path, file_type="image")
40
 
41
+ # Convert any non-serializable objects in the result
42
+ def sanitize_for_json(obj):
43
+ if hasattr(obj, 'to_dict'):
44
+ return obj.to_dict()
45
+ elif hasattr(obj, '__dict__'):
46
+ return obj.__dict__
47
+ else:
48
+ return str(obj)
49
+
50
+ # Save the result to the output file with a custom serializer
51
  with open(output_path, 'w') as f:
52
+ json.dump(result, f, indent=2, default=sanitize_for_json)
53
 
54
  print(f"Image processing completed successfully. Output saved to {output_path}")
55
 
structured_ocr.py CHANGED
@@ -37,7 +37,7 @@ except ImportError:
37
  return "\n\n".join(markdowns)
38
 
39
  # Import config directly (now local to historical-ocr)
40
- from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL
41
 
42
  # Create language enum for structured output
43
  languages = {lang.alpha_2: lang.name for lang in pycountry.languages if hasattr(lang, 'alpha_2')}
@@ -61,9 +61,36 @@ class StructuredOCR:
61
  def __init__(self, api_key=None):
62
  """Initialize the OCR processor with API key"""
63
  self.api_key = api_key or MISTRAL_API_KEY
64
- self.client = Mistral(api_key=self.api_key)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- def process_file(self, file_path, file_type=None, use_vision=True, max_pages=None, file_size_mb=None, custom_pages=None):
67
  """Process a file and return structured OCR results
68
 
69
  Args:
@@ -120,9 +147,9 @@ class StructuredOCR:
120
 
121
  # Read and process the file
122
  if file_type == "pdf":
123
- result = self._process_pdf(file_path, use_vision, max_pages, custom_pages)
124
  else:
125
- result = self._process_image(file_path, use_vision)
126
 
127
  # Add processing time information
128
  processing_time = time.time() - start_time
@@ -134,7 +161,7 @@ class StructuredOCR:
134
 
135
  return result
136
 
137
- def _process_pdf(self, file_path, use_vision=True, max_pages=None, custom_pages=None):
138
  """Process a PDF file with OCR
139
 
140
  Args:
@@ -162,11 +189,57 @@ class StructuredOCR:
162
 
163
  # Process the PDF with OCR
164
  logger.info(f"Processing PDF with OCR using {OCR_MODEL}")
165
- pdf_response = self.client.ocr.process(
166
- document=DocumentURLChunk(document_url=signed_url.url),
167
- model=OCR_MODEL,
168
- include_image_base64=True
169
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  # Limit pages if requested
172
  pages_to_process = pdf_response.pages
@@ -218,15 +291,15 @@ class StructuredOCR:
218
  if first_page_image:
219
  # Use vision model
220
  logger.info(f"Using vision model: {VISION_MODEL}")
221
- result = self._extract_structured_data_with_vision(first_page_image, all_markdown, file_path.name)
222
  else:
223
- # Fall back to text-only model if no image available
224
- logger.info(f"No images in PDF, falling back to text model: {TEXT_MODEL}")
225
- result = self._extract_structured_data_text_only(all_markdown, file_path.name)
226
  else:
227
- # Use text-only model
228
- logger.info(f"Using text-only model: {TEXT_MODEL}")
229
- result = self._extract_structured_data_text_only(all_markdown, file_path.name)
230
 
231
  # Add page limit info to result if needed
232
  if limited_pages:
@@ -239,7 +312,8 @@ class StructuredOCR:
239
  result['confidence_score'] = confidence_score
240
 
241
  # Store key parts of the OCR response for image rendering
242
- # Extract and store image data in a format that can be serialized to JSON
 
243
  has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
244
  result['has_images'] = has_images
245
 
@@ -282,7 +356,7 @@ class StructuredOCR:
282
  }
283
  }
284
 
285
- def _process_image(self, file_path, use_vision=True):
286
  """Process an image file with OCR"""
287
  logger = logging.getLogger("image_processor")
288
  logger.info(f"Processing image: {file_path}")
@@ -299,24 +373,43 @@ class StructuredOCR:
299
  from PIL import Image
300
  import io
301
 
302
- # Open and resize the image
303
  with Image.open(file_path) as img:
304
  # Convert to RGB if not already (prevents mode errors)
305
  if img.mode != 'RGB':
306
  img = img.convert('RGB')
307
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
308
  # Calculate new dimensions (maintain aspect ratio)
309
  # Target around 2000-3000 pixels on longest side for good OCR quality
310
- width, height = img.size
311
- max_dimension = max(width, height)
312
  target_dimension = 2500 # Good balance between quality and size
313
 
314
  if max_dimension > target_dimension:
315
  scale_factor = target_dimension / max_dimension
316
- new_width = int(width * scale_factor)
317
- new_height = int(height * scale_factor)
318
- img = img.resize((new_width, new_height), Image.LANCZOS)
319
-
320
  # Save to bytes with compression
321
  buffer = io.BytesIO()
322
  img.save(buffer, format="JPEG", quality=85, optimize=True)
@@ -344,11 +437,64 @@ class StructuredOCR:
344
 
345
  # Process the image with OCR
346
  logger.info(f"Processing image with OCR using {OCR_MODEL}")
347
- image_response = self.client.ocr.process(
348
- document=ImageURLChunk(image_url=base64_data_url),
349
- model=OCR_MODEL,
350
- include_image_base64=True
351
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
 
353
  # Get the OCR markdown from the first page
354
  image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
@@ -364,16 +510,17 @@ class StructuredOCR:
364
  # Extract structured data using the appropriate model
365
  if use_vision:
366
  logger.info(f"Using vision model: {VISION_MODEL}")
367
- result = self._extract_structured_data_with_vision(base64_data_url, image_ocr_markdown, file_path.name)
368
  else:
369
  logger.info(f"Using text-only model: {TEXT_MODEL}")
370
- result = self._extract_structured_data_text_only(image_ocr_markdown, file_path.name)
371
 
372
  # Add confidence score
373
  result['confidence_score'] = confidence_score
374
 
375
  # Store key parts of the OCR response for image rendering
376
- # Extract and store image data in a format that can be serialized to JSON
 
377
  has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
378
  result['has_images'] = has_images
379
 
@@ -416,7 +563,7 @@ class StructuredOCR:
416
  }
417
  }
418
 
419
- def _extract_structured_data_with_vision(self, image_base64, ocr_markdown, filename):
420
  """Extract structured data using vision model"""
421
  try:
422
  # Parse with vision model with a timeout
@@ -435,6 +582,7 @@ class StructuredOCR:
435
  f"For handwritten documents, carefully preserve the structure. "
436
  f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
437
  f"For tabular content, preserve the table structure as much as possible."
 
438
  ))
439
  ],
440
  },
@@ -457,12 +605,12 @@ class StructuredOCR:
457
 
458
  return result
459
 
460
- def _extract_structured_data_text_only(self, ocr_markdown, filename):
461
- """Extract structured data using text-only model"""
462
  try:
463
- # Parse with text-only model with a timeout
464
  chat_response = self.client.chat.parse(
465
- model=TEXT_MODEL,
466
  messages=[
467
  {
468
  "role": "user",
@@ -473,6 +621,7 @@ class StructuredOCR:
473
  f"For handwritten documents, carefully preserve the structure. "
474
  f"For printed texts, organize content logically by sections. "
475
  f"For tabular content, preserve the table structure as much as possible."
 
476
  },
477
  ],
478
  response_format=StructuredOCRModel,
 
37
  return "\n\n".join(markdowns)
38
 
39
  # Import config directly (now local to historical-ocr)
40
+ from config import MISTRAL_API_KEY, OCR_MODEL, VISION_MODEL
41
 
42
  # Create language enum for structured output
43
  languages = {lang.alpha_2: lang.name for lang in pycountry.languages if hasattr(lang, 'alpha_2')}
 
61
  def __init__(self, api_key=None):
62
  """Initialize the OCR processor with API key"""
63
  self.api_key = api_key or MISTRAL_API_KEY
64
+
65
+ # Ensure we have a valid API key
66
+ if not self.api_key:
67
+ raise ValueError("No Mistral API key provided. Please set the MISTRAL_API_KEY environment variable.")
68
+
69
+ # Clean the API key by removing any whitespace
70
+ self.api_key = self.api_key.strip()
71
+
72
+ # Basic validation of API key format (Mistral keys are typically 32 characters)
73
+ if len(self.api_key) != 32:
74
+ logger = logging.getLogger("api_validator")
75
+ logger.warning(f"Warning: API key length ({len(self.api_key)}) is not the expected 32 characters")
76
+
77
+ # Initialize client with the API key
78
+ try:
79
+ self.client = Mistral(api_key=self.api_key)
80
+
81
+ # Validate API key by making a small request
82
+ # This is optional but catches authentication issues early
83
+ # Uncomment for early validation (costs a small API call)
84
+ # self.client.models.list()
85
+
86
+ except Exception as e:
87
+ error_msg = str(e).lower()
88
+ if "unauthorized" in error_msg or "401" in error_msg:
89
+ raise ValueError(f"API key authentication failed. Please check your Mistral API key: {str(e)}")
90
+ else:
91
+ raise
92
 
93
+ def process_file(self, file_path, file_type=None, use_vision=True, max_pages=None, file_size_mb=None, custom_pages=None, custom_prompt=None):
94
  """Process a file and return structured OCR results
95
 
96
  Args:
 
147
 
148
  # Read and process the file
149
  if file_type == "pdf":
150
+ result = self._process_pdf(file_path, use_vision, max_pages, custom_pages, custom_prompt)
151
  else:
152
+ result = self._process_image(file_path, use_vision, custom_prompt)
153
 
154
  # Add processing time information
155
  processing_time = time.time() - start_time
 
161
 
162
  return result
163
 
164
+ def _process_pdf(self, file_path, use_vision=True, max_pages=None, custom_pages=None, custom_prompt=None):
165
  """Process a PDF file with OCR
166
 
167
  Args:
 
189
 
190
  # Process the PDF with OCR
191
  logger.info(f"Processing PDF with OCR using {OCR_MODEL}")
192
+
193
+ # Add retry logic with exponential backoff for API errors
194
+ max_retries = 3
195
+ retry_delay = 2
196
+
197
+ for retry in range(max_retries):
198
+ try:
199
+ pdf_response = self.client.ocr.process(
200
+ document=DocumentURLChunk(document_url=signed_url.url),
201
+ model=OCR_MODEL,
202
+ include_image_base64=True
203
+ )
204
+ break # Success, exit retry loop
205
+ except Exception as e:
206
+ error_msg = str(e)
207
+ logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
208
+
209
+ # Check specific error types to handle them appropriately
210
+ error_lower = error_msg.lower()
211
+
212
+ # Authentication errors - no point in retrying
213
+ if "unauthorized" in error_lower or "401" in error_lower:
214
+ logger.error("API authentication failed. Check your API key.")
215
+ raise ValueError(f"Authentication failed with API key. Please verify your Mistral API key is correct and active: {error_msg}")
216
+
217
+ # Connection errors - worth retrying
218
+ elif "connection" in error_lower or "timeout" in error_lower or "520" in error_msg or "server error" in error_lower:
219
+ if retry < max_retries - 1:
220
+ # Wait with exponential backoff before retrying
221
+ wait_time = retry_delay * (2 ** retry)
222
+ logger.info(f"Connection issue detected. Waiting {wait_time}s before retry...")
223
+ time.sleep(wait_time)
224
+ else:
225
+ # Last retry failed
226
+ logger.error("Maximum retries reached, API connection error persists.")
227
+ raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
228
+
229
+ # Rate limit errors
230
+ elif "rate limit" in error_lower or "429" in error_lower:
231
+ if retry < max_retries - 1:
232
+ wait_time = retry_delay * (2 ** retry) * 2 # Wait longer for rate limits
233
+ logger.info(f"Rate limit exceeded. Waiting {wait_time}s before retry...")
234
+ time.sleep(wait_time)
235
+ else:
236
+ logger.error("Maximum retries reached, rate limit error persists.")
237
+ raise ValueError(f"Mistral API rate limit exceeded. Please try again later: {error_msg}")
238
+
239
+ # Other errors - no retry
240
+ else:
241
+ logger.error(f"Unrecoverable API error: {error_msg}")
242
+ raise
243
 
244
  # Limit pages if requested
245
  pages_to_process = pdf_response.pages
 
291
  if first_page_image:
292
  # Use vision model
293
  logger.info(f"Using vision model: {VISION_MODEL}")
294
+ result = self._extract_structured_data_with_vision(first_page_image, all_markdown, file_path.name, custom_prompt)
295
  else:
296
+ # Fall back to vision model but without image
297
+ logger.info(f"No images in PDF, falling back to using vision model without image")
298
+ result = self._extract_structured_data_text_only(all_markdown, file_path.name, custom_prompt)
299
  else:
300
+ # Use vision model without image
301
+ logger.info(f"Using vision model without image")
302
+ result = self._extract_structured_data_text_only(all_markdown, file_path.name, custom_prompt)
303
 
304
  # Add page limit info to result if needed
305
  if limited_pages:
 
312
  result['confidence_score'] = confidence_score
313
 
314
  # Store key parts of the OCR response for image rendering
315
+ # First store the raw response for backwards compatibility
316
+ # Then extract and store image data in a format that can be serialized to JSON
317
  has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
318
  result['has_images'] = has_images
319
 
 
356
  }
357
  }
358
 
359
+ def _process_image(self, file_path, use_vision=True, custom_prompt=None):
360
  """Process an image file with OCR"""
361
  logger = logging.getLogger("image_processor")
362
  logger.info(f"Processing image: {file_path}")
 
373
  from PIL import Image
374
  import io
375
 
376
+ # Open and process the image
377
  with Image.open(file_path) as img:
378
  # Convert to RGB if not already (prevents mode errors)
379
  if img.mode != 'RGB':
380
  img = img.convert('RGB')
381
+
382
+ # Detect and correct orientation based on aspect ratio
383
+ # For OCR, portrait (vertical) orientation typically works better
384
+ width, height = img.size
385
+
386
+ # If image is horizontally oriented (landscape) and significantly wider than tall
387
+ # OCR models often work better with portrait orientation
388
+ is_horizontal = width > height and (width / height) > 1.2
389
+
390
+ # For documents, we can also use a heuristic that very wide images might need rotation
391
+ needs_rotation = is_horizontal and width > 1000 and (width / height) > 1.5
392
+
393
+ # Rotate if needed for OCR processing
394
+ if needs_rotation:
395
+ logger.info("Detected horizontal document, rotating for better OCR performance")
396
+ # Try to determine whether to rotate 90° clockwise or counterclockwise
397
+ # For OCR, we generally want to ensure text reads from left to right
398
+ # Simple approach: rotate counterclockwise by default (often correct for scanned docs)
399
+ img = img.transpose(Image.ROTATE_90)
400
+
401
  # Calculate new dimensions (maintain aspect ratio)
402
  # Target around 2000-3000 pixels on longest side for good OCR quality
403
+ new_width, new_height = img.size # Now potentially rotated
404
+ max_dimension = max(new_width, new_height)
405
  target_dimension = 2500 # Good balance between quality and size
406
 
407
  if max_dimension > target_dimension:
408
  scale_factor = target_dimension / max_dimension
409
+ resized_width = int(new_width * scale_factor)
410
+ resized_height = int(new_height * scale_factor)
411
+ img = img.resize((resized_width, resized_height), Image.LANCZOS)
412
+
413
  # Save to bytes with compression
414
  buffer = io.BytesIO()
415
  img.save(buffer, format="JPEG", quality=85, optimize=True)
 
437
 
438
  # Process the image with OCR
439
  logger.info(f"Processing image with OCR using {OCR_MODEL}")
440
+
441
+ # Log API key information (first and last characters only)
442
+ if self.api_key:
443
+ key_preview = f"{self.api_key[:3]}...{self.api_key[-3:]}"
444
+ logger.info(f"Using API key: {key_preview} (length: {len(self.api_key)})")
445
+ else:
446
+ logger.error("No API key provided!")
447
+
448
+ # Add retry logic with exponential backoff for API errors
449
+ max_retries = 3
450
+ retry_delay = 2
451
+
452
+ for retry in range(max_retries):
453
+ try:
454
+ image_response = self.client.ocr.process(
455
+ document=ImageURLChunk(image_url=base64_data_url),
456
+ model=OCR_MODEL,
457
+ include_image_base64=True
458
+ )
459
+ break # Success, exit retry loop
460
+ except Exception as e:
461
+ error_msg = str(e)
462
+ logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
463
+
464
+ # Check specific error types to handle them appropriately
465
+ error_lower = error_msg.lower()
466
+
467
+ # Authentication errors - no point in retrying
468
+ if "unauthorized" in error_lower or "401" in error_lower:
469
+ logger.error("API authentication failed. Check your API key.")
470
+ raise ValueError(f"Authentication failed with API key. Please verify your Mistral API key is correct and active: {error_msg}")
471
+
472
+ # Connection errors - worth retrying
473
+ elif "connection" in error_lower or "timeout" in error_lower or "520" in error_msg or "server error" in error_lower:
474
+ if retry < max_retries - 1:
475
+ # Wait with exponential backoff before retrying
476
+ wait_time = retry_delay * (2 ** retry)
477
+ logger.info(f"Connection issue detected. Waiting {wait_time}s before retry...")
478
+ time.sleep(wait_time)
479
+ else:
480
+ # Last retry failed
481
+ logger.error("Maximum retries reached, API connection error persists.")
482
+ raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
483
+
484
+ # Rate limit errors
485
+ elif "rate limit" in error_lower or "429" in error_lower:
486
+ if retry < max_retries - 1:
487
+ wait_time = retry_delay * (2 ** retry) * 2 # Wait longer for rate limits
488
+ logger.info(f"Rate limit exceeded. Waiting {wait_time}s before retry...")
489
+ time.sleep(wait_time)
490
+ else:
491
+ logger.error("Maximum retries reached, rate limit error persists.")
492
+ raise ValueError(f"Mistral API rate limit exceeded. Please try again later: {error_msg}")
493
+
494
+ # Other errors - no retry
495
+ else:
496
+ logger.error(f"Unrecoverable API error: {error_msg}")
497
+ raise
498
 
499
  # Get the OCR markdown from the first page
500
  image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
 
510
  # Extract structured data using the appropriate model
511
  if use_vision:
512
  logger.info(f"Using vision model: {VISION_MODEL}")
513
+ result = self._extract_structured_data_with_vision(base64_data_url, image_ocr_markdown, file_path.name, custom_prompt)
514
  else:
515
  logger.info(f"Using text-only model: {TEXT_MODEL}")
516
+ result = self._extract_structured_data_text_only(image_ocr_markdown, file_path.name, custom_prompt)
517
 
518
  # Add confidence score
519
  result['confidence_score'] = confidence_score
520
 
521
  # Store key parts of the OCR response for image rendering
522
+ # First store the raw response for backwards compatibility
523
+ # Then extract and store image data in a format that can be serialized to JSON
524
  has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
525
  result['has_images'] = has_images
526
 
 
563
  }
564
  }
565
 
566
+ def _extract_structured_data_with_vision(self, image_base64, ocr_markdown, filename, custom_prompt=None):
567
  """Extract structured data using vision model"""
568
  try:
569
  # Parse with vision model with a timeout
 
582
  f"For handwritten documents, carefully preserve the structure. "
583
  f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
584
  f"For tabular content, preserve the table structure as much as possible."
585
+ + (f"\n\nAdditional instructions: {custom_prompt}" if custom_prompt else "")
586
  ))
587
  ],
588
  },
 
605
 
606
  return result
607
 
608
+ def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
609
+ """Extract structured data without using vision capabilities"""
610
  try:
611
+ # Parse with vision model but without image
612
  chat_response = self.client.chat.parse(
613
+ model=VISION_MODEL,
614
  messages=[
615
  {
616
  "role": "user",
 
621
  f"For handwritten documents, carefully preserve the structure. "
622
  f"For printed texts, organize content logically by sections. "
623
  f"For tabular content, preserve the table structure as much as possible."
624
+ + (f"\n\nAdditional instructions: {custom_prompt}" if custom_prompt else "")
625
  },
626
  ],
627
  response_format=StructuredOCRModel,
ui/custom.css CHANGED
@@ -300,4 +300,44 @@
300
 
301
  .stTabs [data-baseweb="tab-highlight"] {
302
  background-color: var(--color-blue-600);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
  }
 
300
 
301
  .stTabs [data-baseweb="tab-highlight"] {
302
  background-color: var(--color-blue-600);
303
+ }
304
+
305
+ /* Workflow steps */
306
+ .workflow-step {
307
+ background-color: var(--color-gray-800);
308
+ border-radius: 8px;
309
+ padding: 15px;
310
+ border-left: 5px solid var(--color-blue-500);
311
+ margin-bottom: 15px;
312
+ }
313
+
314
+ .workflow-step.active {
315
+ border-left: 5px solid var(--color-blue-400);
316
+ background-color: var(--color-blue-900);
317
+ }
318
+
319
+ .workflow-step.complete {
320
+ border-left: 5px solid var(--color-blue-300);
321
+ background-color: var(--color-gray-700);
322
+ }
323
+
324
+ /* Before-after comparison */
325
+ .comparison-container {
326
+ display: flex;
327
+ justify-content: space-between;
328
+ gap: 10px;
329
+ margin-bottom: 20px;
330
+ }
331
+
332
+ .comparison-image {
333
+ flex: 1;
334
+ border-radius: 8px;
335
+ overflow: hidden;
336
+ border: 1px solid var(--color-gray-300);
337
+ }
338
+
339
+ .comparison-title {
340
+ text-align: center;
341
+ font-weight: bold;
342
+ margin-bottom: 5px;
343
  }