Spaces:
Running
Running
Update documentation and improve HTML formatting
Browse files- Enhanced README with detailed features and documentation
- Improved About section with new special features
- Updated HTML formatting for better historical document presentation
- Added specialized poem formatting and multi-page support
- Removed redundant Technical Details expander
- CLAUDE.md +35 -0
- README.md +26 -8
- app.py +285 -167
- ocr_utils.py +161 -2
- structured_ocr.py +58 -6
CLAUDE.md
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Historical OCR Project Guidelines
|
2 |
+
|
3 |
+
## Commands
|
4 |
+
- Run standard app: `./run_local.sh` or `streamlit run app.py`
|
5 |
+
- Run educational version: `./run_local.sh educational` or `streamlit run streamlit_app.py`
|
6 |
+
- Run simple test: `python simple_test.py`
|
7 |
+
- Run PDF test: `python test_pdf.py`
|
8 |
+
- Process large files: `./run_large_files.sh --server.maxUploadSize=500 --server.maxMessageSize=500`
|
9 |
+
- Prepare for Hugging Face: `python prepare_for_hf.py`
|
10 |
+
|
11 |
+
## Environment Setup
|
12 |
+
- API key: Set `MISTRAL_API_KEY` in `.env` file or as environment variable
|
13 |
+
- System dependencies: Install poppler for PDF processing (brew install poppler on macOS)
|
14 |
+
- Python dependencies: `pip install -r requirements.txt`
|
15 |
+
|
16 |
+
## Code Style
|
17 |
+
- Imports: Standard library → third-party → local imports
|
18 |
+
- Documentation: Google-style docstrings with Args, Returns sections
|
19 |
+
- Error handling: Specific exceptions with informative messages, logging
|
20 |
+
- Naming: snake_case for variables/functions, PascalCase for classes
|
21 |
+
- Type hints: Pydantic models for structured data, typing module annotations
|
22 |
+
|
23 |
+
## Project Structure
|
24 |
+
- Core: `structured_ocr.py` - OCR processing with Mistral AI
|
25 |
+
- Utils: `ocr_utils.py` - Text/image processing utilities
|
26 |
+
- PDF: `pdf_ocr.py` - PDF-specific document handling
|
27 |
+
- Config: `config.py` - API settings, model selection
|
28 |
+
- UI: Streamlit interface with modular components
|
29 |
+
- Testing: Simple test scripts in project root
|
30 |
+
|
31 |
+
## Development Workflow
|
32 |
+
- Use logging for debugging (configured in structured_ocr.py)
|
33 |
+
- Test with sample files in input/ directory
|
34 |
+
- Handle large files with specific options for optimal processing
|
35 |
+
- Check confidence_score in results to evaluate OCR quality
|
README.md
CHANGED
@@ -13,19 +13,23 @@ short_description: Employs Mistral OCR for transcribing historical data
|
|
13 |
|
14 |
# Historical Document OCR
|
15 |
|
16 |
-
This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents.
|
17 |
|
18 |
## Features
|
19 |
|
20 |
- OCR processing for both image and PDF files
|
21 |
-
- Automatic file type detection
|
22 |
-
-
|
|
|
23 |
- Interactive web interface with Streamlit
|
24 |
-
-
|
25 |
-
-
|
|
|
26 |
- Smart handling of large PDFs with automatic page limiting
|
27 |
-
- Robust error handling with helpful messages
|
28 |
- Image preprocessing options for enhanced OCR accuracy
|
|
|
|
|
|
|
29 |
|
30 |
## Project Structure
|
31 |
|
@@ -69,7 +73,8 @@ Historical OCR - Project Structure
|
|
69 |
│ └─ process_file.py # File processing for educational app
|
70 |
│
|
71 |
├─ UI Components (ui/)
|
72 |
-
│
|
|
|
73 |
│
|
74 |
├─ Data Directories
|
75 |
│ ├─ input/ # Sample documents for testing/demo
|
@@ -117,7 +122,20 @@ streamlit run app.py
|
|
117 |
1. Upload an image or PDF file using the file uploader
|
118 |
2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
|
119 |
3. Click "Process Document" to analyze the file
|
120 |
-
4. View the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
121 |
|
122 |
## Application Versions
|
123 |
|
|
|
13 |
|
14 |
# Historical Document OCR
|
15 |
|
16 |
+
This application uses Mistral AI's OCR capabilities to transcribe and extract information from historical documents with enhanced formatting and presentation.
|
17 |
|
18 |
## Features
|
19 |
|
20 |
- OCR processing for both image and PDF files
|
21 |
+
- Automatic file type detection and content structuring
|
22 |
+
- Advanced HTML formatting with proper document structure preservation
|
23 |
+
- Specialized formatting for poems and historical texts
|
24 |
- Interactive web interface with Streamlit
|
25 |
+
- "With Images" view that preserves document layout and embedded images
|
26 |
+
- Multi-page document support with pagination
|
27 |
+
- PDF preview functionality
|
28 |
- Smart handling of large PDFs with automatic page limiting
|
|
|
29 |
- Image preprocessing options for enhanced OCR accuracy
|
30 |
+
- Document export in multiple formats (HTML, JSON)
|
31 |
+
- Responsive design optimized for historical document presentation
|
32 |
+
- Enhanced typography with appropriate fonts for historical content
|
33 |
|
34 |
## Project Structure
|
35 |
|
|
|
73 |
│ └─ process_file.py # File processing for educational app
|
74 |
│
|
75 |
├─ UI Components (ui/)
|
76 |
+
│ ├─ layout.py # Shared UI components and styling
|
77 |
+
│ └─ custom.css # Custom styling for the application
|
78 |
│
|
79 |
├─ Data Directories
|
80 |
│ ├─ input/ # Sample documents for testing/demo
|
|
|
122 |
1. Upload an image or PDF file using the file uploader
|
123 |
2. Select processing options in the sidebar (e.g., use vision model, image preprocessing)
|
124 |
3. Click "Process Document" to analyze the file
|
125 |
+
4. View the results in three available formats:
|
126 |
+
- **Structured View**: Beautifully formatted HTML with proper document structure
|
127 |
+
- **Raw JSON**: Complete data structure for developers
|
128 |
+
- **With Images**: Document with embedded images preserving original layout
|
129 |
+
|
130 |
+
## Document Output Features
|
131 |
+
|
132 |
+
The application provides several specialized features for historical document presentation:
|
133 |
+
|
134 |
+
1. **Poetry Formatting**: Special handling for poem structure with proper line spacing and typography
|
135 |
+
2. **Image Embedding**: Original document images embedded within the text at their correct positions
|
136 |
+
3. **Multi-page Support**: Pagination controls for navigating multi-page documents
|
137 |
+
4. **Typography**: Historical-appropriate fonts and styling for better readability of historical texts
|
138 |
+
5. **Document Export**: Download options for saving the processed document in HTML format
|
139 |
|
140 |
## Application Versions
|
141 |
|
app.py
CHANGED
@@ -146,17 +146,17 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None):
|
|
146 |
# Get file size in MB
|
147 |
file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
|
148 |
|
149 |
-
# Check if file exceeds size limits (
|
150 |
-
if file_size_mb >
|
151 |
-
st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is
|
152 |
return {
|
153 |
"file_name": uploaded_file.name,
|
154 |
"topics": ["Document"],
|
155 |
"languages": ["English"],
|
156 |
"confidence_score": 0.0,
|
157 |
-
"error": f"File size {file_size_mb:.2f} MB exceeds limit of
|
158 |
"ocr_contents": {
|
159 |
-
"error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of
|
160 |
"partial_text": "Document could not be processed due to size limitations."
|
161 |
}
|
162 |
}
|
@@ -190,7 +190,7 @@ st.title("Historical Document OCR")
|
|
190 |
st.subheader("Powered by Mistral AI")
|
191 |
|
192 |
# Create main layout with tabs and columns
|
193 |
-
main_tab1, main_tab2 = st.tabs(["Document Processing", "About"])
|
194 |
|
195 |
with main_tab1:
|
196 |
# Create a two-column layout for file upload and preview
|
@@ -203,7 +203,7 @@ with main_tab1:
|
|
203 |
|
204 |
Using the `mistral-ocr-latest` model for advanced document understanding.
|
205 |
""")
|
206 |
-
uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit
|
207 |
|
208 |
# Sidebar with options
|
209 |
with st.sidebar:
|
@@ -240,7 +240,7 @@ with main_tab2:
|
|
240 |
st.markdown("""
|
241 |
### About This Application
|
242 |
|
243 |
-
This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents.
|
244 |
|
245 |
It can process:
|
246 |
- Image files (jpg, png, etc.)
|
@@ -250,26 +250,71 @@ with main_tab2:
|
|
250 |
- Text extraction with `mistral-ocr-latest`
|
251 |
- Analysis with language models
|
252 |
- Layout preservation with images
|
|
|
253 |
|
254 |
View results in three formats:
|
255 |
-
- Structured HTML
|
256 |
-
- Raw JSON
|
257 |
-
-
|
258 |
|
259 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
260 |
- Image preprocessing for better OCR quality
|
261 |
- PDF resolution and page controls
|
262 |
- Progress tracking during processing
|
|
|
263 |
""")
|
264 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
265 |
with main_tab1:
|
266 |
if uploaded_file is not None:
|
267 |
-
# Check file size (cap at
|
268 |
file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
|
269 |
|
270 |
-
if file_size_mb >
|
271 |
with upload_col:
|
272 |
-
st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is
|
273 |
st.stop()
|
274 |
|
275 |
file_ext = Path(uploaded_file.name).suffix.lower()
|
@@ -331,10 +376,8 @@ with main_tab1:
|
|
331 |
# Call process_file with all options
|
332 |
result = process_file(uploaded_file, use_vision, preprocessing_options)
|
333 |
|
334 |
-
#
|
335 |
-
|
336 |
-
|
337 |
-
with results_tab1:
|
338 |
# Create two columns for metadata and content
|
339 |
meta_col, content_col = st.columns([1, 2])
|
340 |
|
@@ -368,12 +411,7 @@ with main_tab1:
|
|
368 |
st.subheader("Document Contents")
|
369 |
if 'ocr_contents' in result:
|
370 |
# Check if there are images in the OCR result
|
371 |
-
has_images = False
|
372 |
-
if 'raw_response' in result:
|
373 |
-
try:
|
374 |
-
has_images = any(page.images for page in result['raw_response'].pages)
|
375 |
-
except Exception:
|
376 |
-
has_images = False
|
377 |
|
378 |
# Create tabs for different views
|
379 |
if has_images:
|
@@ -383,37 +421,148 @@ with main_tab1:
|
|
383 |
|
384 |
with view_tab1:
|
385 |
# Display in a more user-friendly format based on the content structure
|
386 |
-
html_content = ""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
387 |
if isinstance(result['ocr_contents'], dict):
|
388 |
for section, content in result['ocr_contents'].items():
|
389 |
-
if content: #
|
390 |
-
|
391 |
-
html_content += section_title
|
392 |
|
393 |
-
|
394 |
-
|
395 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
396 |
st.markdown(content)
|
397 |
-
|
398 |
-
|
399 |
-
|
400 |
-
|
401 |
-
|
402 |
-
|
403 |
-
|
404 |
-
|
405 |
-
|
406 |
-
|
407 |
-
|
408 |
-
|
409 |
-
|
410 |
-
|
411 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
412 |
for k, v in content.items():
|
413 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
414 |
st.markdown(f"**{k}:** {v}")
|
415 |
-
|
416 |
-
|
|
|
|
|
417 |
|
418 |
# Add download button in a smaller section
|
419 |
with st.expander("Export Content"):
|
@@ -437,125 +586,60 @@ with main_tab1:
|
|
437 |
try:
|
438 |
# Import function
|
439 |
try:
|
440 |
-
from ocr_utils import
|
441 |
except ImportError:
|
442 |
st.error("Required module ocr_utils not found.")
|
443 |
st.stop()
|
444 |
|
445 |
-
# Check if
|
446 |
-
if '
|
447 |
-
st.warning("
|
448 |
-
st.stop()
|
449 |
-
|
450 |
-
# Validate the raw_response structure before processing
|
451 |
-
if not hasattr(result['raw_response'], 'pages'):
|
452 |
-
st.warning("Invalid OCR response format. Cannot display images.")
|
453 |
-
st.stop()
|
454 |
-
|
455 |
-
# Get the combined markdown with images
|
456 |
-
# Set a flag to compress images if needed
|
457 |
-
compress_images = True
|
458 |
-
max_image_width = 800 # Maximum width for images
|
459 |
-
|
460 |
-
try:
|
461 |
-
# First try to get combined markdown with compressed images
|
462 |
-
if compress_images and hasattr(result['raw_response'], 'pages'):
|
463 |
-
from ocr_utils import get_combined_markdown_compressed
|
464 |
-
combined_markdown = get_combined_markdown_compressed(
|
465 |
-
result['raw_response'],
|
466 |
-
max_width=max_image_width,
|
467 |
-
quality=85
|
468 |
-
)
|
469 |
-
else:
|
470 |
-
# Fall back to regular method if compression not available
|
471 |
-
combined_markdown = get_combined_markdown(result['raw_response'])
|
472 |
-
except (ImportError, AttributeError):
|
473 |
-
# Fall back to regular method
|
474 |
-
combined_markdown = get_combined_markdown(result['raw_response'])
|
475 |
-
|
476 |
-
if not combined_markdown or combined_markdown.strip() == "":
|
477 |
-
st.warning("No image content found in the document.")
|
478 |
st.stop()
|
479 |
|
480 |
-
#
|
481 |
-
image_count =
|
|
|
|
|
482 |
|
483 |
# Add warning for image-heavy documents
|
484 |
if image_count > 10:
|
485 |
st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
|
486 |
-
|
487 |
-
# Add CSS to ensure proper spacing and handling of text and images
|
488 |
-
st.markdown("""
|
489 |
-
<style>
|
490 |
-
.markdown-text-container {
|
491 |
-
padding: 10px;
|
492 |
-
background-color: #f9f9f9;
|
493 |
-
border-radius: 5px;
|
494 |
-
}
|
495 |
-
.markdown-text-container img {
|
496 |
-
margin: 15px 0;
|
497 |
-
max-width: 100%;
|
498 |
-
border: 1px solid #ddd;
|
499 |
-
border-radius: 4px;
|
500 |
-
display: block;
|
501 |
-
}
|
502 |
-
.markdown-text-container p {
|
503 |
-
margin-bottom: 16px;
|
504 |
-
line-height: 1.6;
|
505 |
-
}
|
506 |
-
/* Add lazy loading for images to improve performance */
|
507 |
-
.markdown-text-container img {
|
508 |
-
loading: lazy;
|
509 |
-
}
|
510 |
-
</style>
|
511 |
-
""", unsafe_allow_html=True)
|
512 |
|
513 |
-
#
|
514 |
-
|
515 |
-
|
516 |
-
|
517 |
-
|
518 |
-
|
519 |
-
|
|
|
520 |
|
521 |
# Create a page selector
|
522 |
-
|
523 |
-
|
524 |
-
index=0)
|
525 |
|
526 |
-
#
|
527 |
-
|
528 |
-
<div class="markdown-text-container">
|
529 |
-
{pages[page_num-1]}
|
530 |
-
</div>
|
531 |
-
""", unsafe_allow_html=True)
|
532 |
|
533 |
-
# Add
|
534 |
-
st.info(f"Showing page {page_num} of {len(pages)}. Select a different page from the dropdown above.")
|
535 |
-
else:
|
536 |
-
# Wrap the markdown in a div with the class for styling
|
537 |
st.markdown(f"""
|
538 |
-
<
|
539 |
-
|
540 |
-
|
|
|
|
|
|
|
|
|
|
|
541 |
""", unsafe_allow_html=True)
|
542 |
|
543 |
-
#
|
|
|
|
|
|
|
544 |
st.download_button(
|
545 |
label="Download with Images (HTML)",
|
546 |
-
data=
|
547 |
-
<html>
|
548 |
-
<head>
|
549 |
-
<style>
|
550 |
-
body {{ font-family: Arial, sans-serif; line-height: 1.6; }}
|
551 |
-
img {{ max-width: 100%; margin: 15px 0; }}
|
552 |
-
</style>
|
553 |
-
</head>
|
554 |
-
<body>
|
555 |
-
{combined_markdown}
|
556 |
-
</body>
|
557 |
-
</html>
|
558 |
-
""",
|
559 |
file_name="document_with_images.html",
|
560 |
mime="text/html"
|
561 |
)
|
@@ -565,10 +649,6 @@ with main_tab1:
|
|
565 |
st.info("Try refreshing or processing the document again.")
|
566 |
else:
|
567 |
st.error("No OCR content was extracted from the document.")
|
568 |
-
|
569 |
-
with results_tab2:
|
570 |
-
st.subheader("Raw Processing Results")
|
571 |
-
st.json(result)
|
572 |
|
573 |
except Exception as e:
|
574 |
st.error(f"Error processing document: {str(e)}")
|
@@ -577,25 +657,63 @@ with main_tab1:
|
|
577 |
st.info("Upload a document to get started using the file uploader above.")
|
578 |
|
579 |
# Show example images in a grid
|
580 |
-
st.subheader("Example Documents")
|
581 |
-
|
582 |
# Add a sample images container
|
583 |
with st.container():
|
584 |
# Find sample images from the input directory to display
|
585 |
input_dir = Path(__file__).parent / "input"
|
586 |
sample_images = []
|
587 |
if input_dir.exists():
|
588 |
-
#
|
589 |
-
|
590 |
-
|
591 |
-
|
592 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
593 |
|
594 |
if sample_images:
|
595 |
-
|
596 |
-
|
597 |
-
|
598 |
-
|
599 |
-
|
600 |
-
|
601 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
# Get file size in MB
|
147 |
file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
|
148 |
|
149 |
+
# Check if file exceeds size limits (20 MB)
|
150 |
+
if file_size_mb > 20:
|
151 |
+
st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
|
152 |
return {
|
153 |
"file_name": uploaded_file.name,
|
154 |
"topics": ["Document"],
|
155 |
"languages": ["English"],
|
156 |
"confidence_score": 0.0,
|
157 |
+
"error": f"File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
|
158 |
"ocr_contents": {
|
159 |
+
"error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds limit of 20 MB",
|
160 |
"partial_text": "Document could not be processed due to size limitations."
|
161 |
}
|
162 |
}
|
|
|
190 |
st.subheader("Powered by Mistral AI")
|
191 |
|
192 |
# Create main layout with tabs and columns
|
193 |
+
main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "About this App", "Companion Workshop"])
|
194 |
|
195 |
with main_tab1:
|
196 |
# Create a two-column layout for file upload and preview
|
|
|
203 |
|
204 |
Using the `mistral-ocr-latest` model for advanced document understanding.
|
205 |
""")
|
206 |
+
uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"], help="Limit 20MB per file")
|
207 |
|
208 |
# Sidebar with options
|
209 |
with st.sidebar:
|
|
|
240 |
st.markdown("""
|
241 |
### About This Application
|
242 |
|
243 |
+
This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents with enhanced formatting and presentation.
|
244 |
|
245 |
It can process:
|
246 |
- Image files (jpg, png, etc.)
|
|
|
250 |
- Text extraction with `mistral-ocr-latest`
|
251 |
- Analysis with language models
|
252 |
- Layout preservation with images
|
253 |
+
- Enhanced typography for historical documents
|
254 |
|
255 |
View results in three formats:
|
256 |
+
- **Structured View**: Beautifully formatted HTML with proper document structure
|
257 |
+
- **Raw JSON**: Complete data structure for developers
|
258 |
+
- **With Images**: Document with embedded images preserving original layout
|
259 |
|
260 |
+
**Special Features:**
|
261 |
+
- **Poetry Formatting**: Special handling for poem structure with proper line spacing
|
262 |
+
- **Image Embedding**: Original document images embedded at correct positions
|
263 |
+
- **Multi-page Support**: Pagination controls for navigating multi-page documents
|
264 |
+
- **Typography**: Historical-appropriate fonts for better readability
|
265 |
+
- **Document Export**: Download options for saving in HTML format
|
266 |
+
|
267 |
+
**Technical Features:**
|
268 |
- Image preprocessing for better OCR quality
|
269 |
- PDF resolution and page controls
|
270 |
- Progress tracking during processing
|
271 |
+
- Responsive design optimized for historical document presentation
|
272 |
""")
|
273 |
|
274 |
+
# Workshop tab content
|
275 |
+
with main_tab3:
|
276 |
+
st.markdown("<h3>Hacking AI for Historical Research</h3>", unsafe_allow_html=True)
|
277 |
+
st.markdown("<p style='margin-bottom: 20px;'>Interactive workshop resources and materials</p>", unsafe_allow_html=True)
|
278 |
+
|
279 |
+
# Custom CSS to improve the Padlet embed appearance
|
280 |
+
st.markdown("""
|
281 |
+
<style>
|
282 |
+
.padlet-container {
|
283 |
+
border-radius: 8px;
|
284 |
+
box-shadow: 0 4px 6px rgba(0,0,0,0.1);
|
285 |
+
margin-top: 10px;
|
286 |
+
margin-bottom: 20px;
|
287 |
+
overflow: hidden;
|
288 |
+
}
|
289 |
+
</style>
|
290 |
+
""", unsafe_allow_html=True)
|
291 |
+
|
292 |
+
# Padlet embed with additional container
|
293 |
+
st.markdown("""
|
294 |
+
<div class="padlet-container">
|
295 |
+
<div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:8px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4">
|
296 |
+
<p style="padding:0;margin:0">
|
297 |
+
<iframe src="https://padlet.com/embed/y9daf9yabqcj93dq" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:650px;display:block;padding:0;margin:0"></iframe>
|
298 |
+
</p>
|
299 |
+
<div style="display:flex;align-items:center;justify-content:end;margin:0;height:28px">
|
300 |
+
<a href="https://padlet.com?ref=embed" style="display:block;flex-grow:0;margin:0;border:none;padding:0;text-decoration:none" target="_blank">
|
301 |
+
<div style="display:flex;align-items:center;">
|
302 |
+
<img src="https://padlet.net/embeds/made_with_padlet_2022.png" width="114" height="28" style="padding:0;margin:0;background:0 0;border:none;box-shadow:none" alt="Made with Padlet">
|
303 |
+
</div>
|
304 |
+
</a>
|
305 |
+
</div>
|
306 |
+
</div>
|
307 |
+
</div>
|
308 |
+
""", unsafe_allow_html=True)
|
309 |
+
|
310 |
with main_tab1:
|
311 |
if uploaded_file is not None:
|
312 |
+
# Check file size (cap at 20MB)
|
313 |
file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
|
314 |
|
315 |
+
if file_size_mb > 20:
|
316 |
with upload_col:
|
317 |
+
st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 20MB.")
|
318 |
st.stop()
|
319 |
|
320 |
file_ext = Path(uploaded_file.name).suffix.lower()
|
|
|
376 |
# Call process_file with all options
|
377 |
result = process_file(uploaded_file, use_vision, preprocessing_options)
|
378 |
|
379 |
+
# Single tab for document analysis
|
380 |
+
with st.container():
|
|
|
|
|
381 |
# Create two columns for metadata and content
|
382 |
meta_col, content_col = st.columns([1, 2])
|
383 |
|
|
|
411 |
st.subheader("Document Contents")
|
412 |
if 'ocr_contents' in result:
|
413 |
# Check if there are images in the OCR result
|
414 |
+
has_images = result.get('has_images', False)
|
|
|
|
|
|
|
|
|
|
|
415 |
|
416 |
# Create tabs for different views
|
417 |
if has_images:
|
|
|
421 |
|
422 |
with view_tab1:
|
423 |
# Display in a more user-friendly format based on the content structure
|
424 |
+
html_content = '<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<title>OCR Document</title>\n<style>\n'
|
425 |
+
html_content += """
|
426 |
+
body {
|
427 |
+
font-family: 'Georgia', serif;
|
428 |
+
line-height: 1.6;
|
429 |
+
margin: 0;
|
430 |
+
padding: 20px;
|
431 |
+
background-color: #f9f9f9;
|
432 |
+
color: #333;
|
433 |
+
}
|
434 |
+
.container {
|
435 |
+
max-width: 1000px;
|
436 |
+
margin: 0 auto;
|
437 |
+
background-color: #fff;
|
438 |
+
padding: 30px;
|
439 |
+
border-radius: 8px;
|
440 |
+
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
|
441 |
+
}
|
442 |
+
h1, h2, h3, h4 {
|
443 |
+
font-family: 'Bookman', 'Georgia', serif;
|
444 |
+
margin-top: 1.5em;
|
445 |
+
margin-bottom: 0.5em;
|
446 |
+
color: #222;
|
447 |
+
}
|
448 |
+
h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
|
449 |
+
h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
|
450 |
+
h3 { font-size: 1.5em; }
|
451 |
+
h4 { font-size: 1.2em; }
|
452 |
+
p { margin-bottom: 1.2em; text-align: justify; }
|
453 |
+
ul, ol { margin-bottom: 1.5em; }
|
454 |
+
li { margin-bottom: 0.5em; }
|
455 |
+
.poem {
|
456 |
+
font-family: 'Baskerville', 'Georgia', serif;
|
457 |
+
margin-left: 2em;
|
458 |
+
line-height: 1.8;
|
459 |
+
white-space: pre-wrap;
|
460 |
+
}
|
461 |
+
.subtitle {
|
462 |
+
font-style: italic;
|
463 |
+
font-size: 1.1em;
|
464 |
+
margin-bottom: 1.5em;
|
465 |
+
color: #555;
|
466 |
+
}
|
467 |
+
blockquote {
|
468 |
+
border-left: 3px solid #ccc;
|
469 |
+
margin: 1.5em 0;
|
470 |
+
padding: 0.5em 1.5em;
|
471 |
+
background-color: #f5f5f5;
|
472 |
+
font-style: italic;
|
473 |
+
}
|
474 |
+
dl {
|
475 |
+
margin-bottom: 1.5em;
|
476 |
+
}
|
477 |
+
dt {
|
478 |
+
font-weight: bold;
|
479 |
+
margin-top: 1em;
|
480 |
+
}
|
481 |
+
dd {
|
482 |
+
margin-left: 2em;
|
483 |
+
margin-bottom: 0.5em;
|
484 |
+
}
|
485 |
+
</style>
|
486 |
+
</head>
|
487 |
+
<body>
|
488 |
+
<div class="container">
|
489 |
+
"""
|
490 |
+
|
491 |
if isinstance(result['ocr_contents'], dict):
|
492 |
for section, content in result['ocr_contents'].items():
|
493 |
+
if not content: # Skip empty sections
|
494 |
+
continue
|
|
|
495 |
|
496 |
+
section_title = section.replace('_', ' ').title()
|
497 |
+
|
498 |
+
# Special handling for title and subtitle
|
499 |
+
if section.lower() == 'title':
|
500 |
+
html_content += f'<h1>{content}</h1>\n'
|
501 |
+
st.markdown(f"## {content}")
|
502 |
+
elif section.lower() == 'subtitle':
|
503 |
+
html_content += f'<div class="subtitle">{content}</div>\n'
|
504 |
+
st.markdown(f"*{content}*")
|
505 |
+
else:
|
506 |
+
# Section headers for non-title sections
|
507 |
+
html_content += f'<h3>{section_title}</h3>\n'
|
508 |
+
st.markdown(f"### {section_title}")
|
509 |
+
|
510 |
+
# Process different content types
|
511 |
+
if isinstance(content, str):
|
512 |
+
# Handle poem type specifically
|
513 |
+
if section.lower() == 'type' and content.lower() == 'poem':
|
514 |
+
# Don't add special formatting here, just for the lines
|
515 |
st.markdown(content)
|
516 |
+
html_content += f'<p>{content}</p>\n'
|
517 |
+
elif 'content' in result['ocr_contents'] and isinstance(result['ocr_contents']['content'], dict) and 'type' in result['ocr_contents']['content'] and result['ocr_contents']['content']['type'] == 'poem' and section.lower() == 'content':
|
518 |
+
# This is handled in the dict case below
|
519 |
+
pass
|
520 |
+
else:
|
521 |
+
# Regular text content
|
522 |
+
paragraphs = content.split('\n\n')
|
523 |
+
for p in paragraphs:
|
524 |
+
if p.strip():
|
525 |
+
html_content += f'<p>{p.strip()}</p>\n'
|
526 |
+
st.markdown(content)
|
527 |
+
|
528 |
+
elif isinstance(content, list):
|
529 |
+
# Handle lists (bullet points, etc.)
|
530 |
+
html_content += '<ul>\n'
|
531 |
+
for item in content:
|
532 |
+
if isinstance(item, str):
|
533 |
+
html_content += f'<li>{item}</li>\n'
|
534 |
+
st.markdown(f"- {item}")
|
535 |
+
elif isinstance(item, dict):
|
536 |
+
# Format dictionary items in a readable way
|
537 |
+
html_content += f'<li><pre>{json.dumps(item, indent=2)}</pre></li>\n'
|
538 |
+
st.json(item)
|
539 |
+
html_content += '</ul>\n'
|
540 |
+
|
541 |
+
elif isinstance(content, dict):
|
542 |
+
# Special handling for poem type
|
543 |
+
if 'type' in content and content['type'] == 'poem' and 'lines' in content:
|
544 |
+
html_content += '<div class="poem">\n'
|
545 |
+
for line in content['lines']:
|
546 |
+
html_content += f'{line}\n'
|
547 |
+
st.markdown(line)
|
548 |
+
html_content += '</div>\n'
|
549 |
+
else:
|
550 |
+
# Regular dictionary display
|
551 |
+
html_content += '<dl>\n'
|
552 |
for k, v in content.items():
|
553 |
+
html_content += f'<dt>{k}</dt>\n<dd>'
|
554 |
+
if isinstance(v, str):
|
555 |
+
html_content += v
|
556 |
+
elif isinstance(v, list):
|
557 |
+
html_content += ', '.join(str(item) for item in v)
|
558 |
+
else:
|
559 |
+
html_content += str(v)
|
560 |
+
html_content += '</dd>\n'
|
561 |
st.markdown(f"**{k}:** {v}")
|
562 |
+
html_content += '</dl>\n'
|
563 |
+
|
564 |
+
# Close HTML document
|
565 |
+
html_content += '</div>\n</body>\n</html>'
|
566 |
|
567 |
# Add download button in a smaller section
|
568 |
with st.expander("Export Content"):
|
|
|
586 |
try:
|
587 |
# Import function
|
588 |
try:
|
589 |
+
from ocr_utils import create_html_with_images
|
590 |
except ImportError:
|
591 |
st.error("Required module ocr_utils not found.")
|
592 |
st.stop()
|
593 |
|
594 |
+
# Check if has_images flag is set
|
595 |
+
if not result.get('has_images', False) or 'pages_data' not in result:
|
596 |
+
st.warning("No image data available in the OCR response.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
597 |
st.stop()
|
598 |
|
599 |
+
# Count images in the result
|
600 |
+
image_count = 0
|
601 |
+
for page in result.get('pages_data', []):
|
602 |
+
image_count += len(page.get('images', []))
|
603 |
|
604 |
# Add warning for image-heavy documents
|
605 |
if image_count > 10:
|
606 |
st.warning(f"This document contains {image_count} images. Rendering may take longer than usual.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
607 |
|
608 |
+
# Generate HTML with images
|
609 |
+
html_with_images = create_html_with_images(result)
|
610 |
+
|
611 |
+
# For multi-page documents, create page navigation
|
612 |
+
page_count = len(result.get('pages_data', []))
|
613 |
+
|
614 |
+
if page_count > 1:
|
615 |
+
st.info(f"Document contains {page_count} pages. You can scroll to view all pages or use the page selector below.")
|
616 |
|
617 |
# Create a page selector
|
618 |
+
page_options = [f"Page {i+1}" for i in range(page_count)]
|
619 |
+
selected_page = st.selectbox("Jump to page:", options=page_options, index=0)
|
|
|
620 |
|
621 |
+
# Extract page number from selection
|
622 |
+
page_num = int(selected_page.split(" ")[1])
|
|
|
|
|
|
|
|
|
623 |
|
624 |
+
# Add JavaScript to scroll to the selected page
|
|
|
|
|
|
|
625 |
st.markdown(f"""
|
626 |
+
<script>
|
627 |
+
document.addEventListener('DOMContentLoaded', function() {{
|
628 |
+
const element = document.getElementById('page-{page_num}');
|
629 |
+
if (element) {{
|
630 |
+
element.scrollIntoView({{ behavior: 'smooth' }});
|
631 |
+
}}
|
632 |
+
}});
|
633 |
+
</script>
|
634 |
""", unsafe_allow_html=True)
|
635 |
|
636 |
+
# Display the HTML content
|
637 |
+
st.components.v1.html(html_with_images, height=600, scrolling=True)
|
638 |
+
|
639 |
+
# Add download button for the content with images
|
640 |
st.download_button(
|
641 |
label="Download with Images (HTML)",
|
642 |
+
data=html_with_images,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
643 |
file_name="document_with_images.html",
|
644 |
mime="text/html"
|
645 |
)
|
|
|
649 |
st.info("Try refreshing or processing the document again.")
|
650 |
else:
|
651 |
st.error("No OCR content was extracted from the document.")
|
|
|
|
|
|
|
|
|
652 |
|
653 |
except Exception as e:
|
654 |
st.error(f"Error processing document: {str(e)}")
|
|
|
657 |
st.info("Upload a document to get started using the file uploader above.")
|
658 |
|
659 |
# Show example images in a grid
|
|
|
|
|
660 |
# Add a sample images container
|
661 |
with st.container():
|
662 |
# Find sample images from the input directory to display
|
663 |
input_dir = Path(__file__).parent / "input"
|
664 |
sample_images = []
|
665 |
if input_dir.exists():
|
666 |
+
# Get all potential image files - exclude PDF files
|
667 |
+
all_images = []
|
668 |
+
all_images.extend(list(input_dir.glob("*.jpg")))
|
669 |
+
all_images.extend(list(input_dir.glob("*.jpeg")))
|
670 |
+
all_images.extend(list(input_dir.glob("*.png")))
|
671 |
+
|
672 |
+
# Filter to get a good set of diverse images - not too small, not too large
|
673 |
+
valid_images = [path for path in all_images if 50000 < path.stat().st_size < 1000000]
|
674 |
+
|
675 |
+
# Deduplicate any images that might have the same content (like recipe and historical-recipe)
|
676 |
+
seen_sizes = {}
|
677 |
+
deduplicated_images = []
|
678 |
+
for img in valid_images:
|
679 |
+
size = img.stat().st_size
|
680 |
+
# If we haven't seen this exact file size before, include it
|
681 |
+
# This simple heuristic works well enough for images with identical content
|
682 |
+
if size not in seen_sizes:
|
683 |
+
seen_sizes[size] = True
|
684 |
+
deduplicated_images.append(img)
|
685 |
+
|
686 |
+
valid_images = deduplicated_images
|
687 |
+
|
688 |
+
# Select a random sample of 6 images if we have enough
|
689 |
+
import random
|
690 |
+
if len(valid_images) > 6:
|
691 |
+
sample_images = random.sample(valid_images, 6)
|
692 |
+
else:
|
693 |
+
sample_images = valid_images
|
694 |
|
695 |
if sample_images:
|
696 |
+
# Create two rows of three columns
|
697 |
+
|
698 |
+
# First row
|
699 |
+
row1 = st.columns(3)
|
700 |
+
for i in range(3):
|
701 |
+
if i < len(sample_images):
|
702 |
+
with row1[i]:
|
703 |
+
try:
|
704 |
+
st.image(str(sample_images[i]), caption=sample_images[i].name, use_container_width=True)
|
705 |
+
except Exception:
|
706 |
+
# Silently skip problematic images
|
707 |
+
pass
|
708 |
+
|
709 |
+
# Second row
|
710 |
+
row2 = st.columns(3)
|
711 |
+
for i in range(3):
|
712 |
+
idx = i + 3
|
713 |
+
if idx < len(sample_images):
|
714 |
+
with row2[i]:
|
715 |
+
try:
|
716 |
+
st.image(str(sample_images[idx]), caption=sample_images[idx].name, use_container_width=True)
|
717 |
+
except Exception:
|
718 |
+
# Silently skip problematic images
|
719 |
+
pass
|
ocr_utils.py
CHANGED
@@ -125,7 +125,7 @@ def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
|
|
125 |
response_dict = json.loads(ocr_response.model_dump_json())
|
126 |
return json.dumps(response_dict, indent=indent)
|
127 |
|
128 |
-
def get_combined_markdown_compressed(ocr_response, max_width: int =
|
129 |
"""
|
130 |
Combine OCR text and images into a single markdown document with compressed images.
|
131 |
Reduces image sizes to improve performance.
|
@@ -209,4 +209,163 @@ try:
|
|
209 |
display(Markdown(combined_markdown))
|
210 |
except ImportError:
|
211 |
# IPython not available
|
212 |
-
pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
response_dict = json.loads(ocr_response.model_dump_json())
|
126 |
return json.dumps(response_dict, indent=indent)
|
127 |
|
128 |
+
def get_combined_markdown_compressed(ocr_response, max_width: int = 1200, quality: int = 92) -> str:
|
129 |
"""
|
130 |
Combine OCR text and images into a single markdown document with compressed images.
|
131 |
Reduces image sizes to improve performance.
|
|
|
209 |
display(Markdown(combined_markdown))
|
210 |
except ImportError:
|
211 |
# IPython not available
|
212 |
+
pass
|
213 |
+
|
214 |
+
def create_html_with_images(result_with_pages: dict) -> str:
|
215 |
+
"""
|
216 |
+
Create HTML with embedded images from the OCR result.
|
217 |
+
|
218 |
+
Args:
|
219 |
+
result_with_pages: OCR result with pages_data containing markdown and images
|
220 |
+
|
221 |
+
Returns:
|
222 |
+
HTML string with embedded images
|
223 |
+
"""
|
224 |
+
if not result_with_pages.get('has_images', False) or 'pages_data' not in result_with_pages:
|
225 |
+
return "<p>No images available in the document.</p>"
|
226 |
+
|
227 |
+
# Create HTML document
|
228 |
+
html = """<!DOCTYPE html>
|
229 |
+
<html lang="en">
|
230 |
+
<head>
|
231 |
+
<meta charset="UTF-8">
|
232 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
233 |
+
<title>Document with Images</title>
|
234 |
+
<style>
|
235 |
+
body {
|
236 |
+
font-family: 'Georgia', serif;
|
237 |
+
line-height: 1.6;
|
238 |
+
margin: 0;
|
239 |
+
padding: 20px;
|
240 |
+
background-color: #f9f9f9;
|
241 |
+
color: #333;
|
242 |
+
}
|
243 |
+
.container {
|
244 |
+
max-width: 1000px;
|
245 |
+
margin: 0 auto;
|
246 |
+
background-color: #fff;
|
247 |
+
padding: 30px;
|
248 |
+
border-radius: 8px;
|
249 |
+
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
|
250 |
+
}
|
251 |
+
h1, h2, h3, h4 {
|
252 |
+
font-family: 'Bookman', 'Georgia', serif;
|
253 |
+
margin-top: 1.5em;
|
254 |
+
margin-bottom: 0.5em;
|
255 |
+
color: #222;
|
256 |
+
}
|
257 |
+
h1 { font-size: 2.2em; border-bottom: 2px solid #e0e0e0; padding-bottom: 10px; }
|
258 |
+
h2 { font-size: 1.8em; border-bottom: 1px solid #e0e0e0; padding-bottom: 6px; }
|
259 |
+
h3 { font-size: 1.5em; }
|
260 |
+
h4 { font-size: 1.2em; }
|
261 |
+
p { margin-bottom: 1.2em; text-align: justify; }
|
262 |
+
img {
|
263 |
+
max-width: 100%;
|
264 |
+
height: auto;
|
265 |
+
margin: 20px 0;
|
266 |
+
border: 1px solid #ddd;
|
267 |
+
border-radius: 6px;
|
268 |
+
box-shadow: 0 3px 6px rgba(0,0,0,0.1);
|
269 |
+
display: block;
|
270 |
+
}
|
271 |
+
.page {
|
272 |
+
margin-bottom: 40px;
|
273 |
+
padding-bottom: 30px;
|
274 |
+
border-bottom: 1px dashed #ccc;
|
275 |
+
}
|
276 |
+
.page:last-child {
|
277 |
+
border-bottom: none;
|
278 |
+
}
|
279 |
+
.page-title {
|
280 |
+
text-align: center;
|
281 |
+
color: #555;
|
282 |
+
font-style: italic;
|
283 |
+
margin: 30px 0;
|
284 |
+
}
|
285 |
+
pre {
|
286 |
+
background-color: #f5f5f5;
|
287 |
+
padding: 15px;
|
288 |
+
border-radius: 5px;
|
289 |
+
overflow-x: auto;
|
290 |
+
font-size: 14px;
|
291 |
+
line-height: 1.4;
|
292 |
+
}
|
293 |
+
blockquote {
|
294 |
+
border-left: 3px solid #ccc;
|
295 |
+
margin: 1.5em 0;
|
296 |
+
padding: 0.5em 1.5em;
|
297 |
+
background-color: #f5f5f5;
|
298 |
+
font-style: italic;
|
299 |
+
}
|
300 |
+
.poem {
|
301 |
+
font-family: 'Baskerville', 'Georgia', serif;
|
302 |
+
margin-left: 2em;
|
303 |
+
line-height: 1.8;
|
304 |
+
white-space: pre-wrap;
|
305 |
+
}
|
306 |
+
</style>
|
307 |
+
</head>
|
308 |
+
<body>
|
309 |
+
<div class="container">
|
310 |
+
"""
|
311 |
+
|
312 |
+
# Process each page
|
313 |
+
pages_data = result_with_pages.get('pages_data', [])
|
314 |
+
for page_idx, page in enumerate(pages_data):
|
315 |
+
page_number = page.get('page_number', page_idx + 1)
|
316 |
+
page_markdown = page.get('markdown', '')
|
317 |
+
page_images = page.get('images', [])
|
318 |
+
|
319 |
+
# Add page header
|
320 |
+
html += f'<div class="page" id="page-{page_number}">\n'
|
321 |
+
if len(pages_data) > 1:
|
322 |
+
html += f'<div class="page-title">Page {page_number}</div>\n'
|
323 |
+
|
324 |
+
# Process markdown text and replace image references
|
325 |
+
if page_markdown:
|
326 |
+
# Replace image markers with actual images
|
327 |
+
for img in page_images:
|
328 |
+
img_id = img.get('id', '')
|
329 |
+
img_base64 = img.get('image_base64', '')
|
330 |
+
|
331 |
+
if img_id and img_base64:
|
332 |
+
# Format image tag
|
333 |
+
img_tag = f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">'
|
334 |
+
# Replace markdown image reference with HTML image
|
335 |
+
page_markdown = page_markdown.replace(f'', img_tag)
|
336 |
+
|
337 |
+
# Convert line breaks to <p> tags for proper HTML formatting
|
338 |
+
paragraphs = page_markdown.split('\n\n')
|
339 |
+
for paragraph in paragraphs:
|
340 |
+
if paragraph.strip():
|
341 |
+
# Check if this looks like a header
|
342 |
+
if paragraph.startswith('# '):
|
343 |
+
header_text = paragraph[2:].strip()
|
344 |
+
html += f'<h1>{header_text}</h1>\n'
|
345 |
+
elif paragraph.startswith('## '):
|
346 |
+
header_text = paragraph[3:].strip()
|
347 |
+
html += f'<h2>{header_text}</h2>\n'
|
348 |
+
elif paragraph.startswith('### '):
|
349 |
+
header_text = paragraph[4:].strip()
|
350 |
+
html += f'<h3>{header_text}</h3>\n'
|
351 |
+
else:
|
352 |
+
html += f'<p>{paragraph}</p>\n'
|
353 |
+
|
354 |
+
# Add any images that weren't referenced in the markdown
|
355 |
+
referenced_img_ids = [img.get('id') for img in page_images if img.get('id') in page_markdown]
|
356 |
+
for img in page_images:
|
357 |
+
img_id = img.get('id', '')
|
358 |
+
img_base64 = img.get('image_base64', '')
|
359 |
+
|
360 |
+
if img_id and img_base64 and img_id not in referenced_img_ids:
|
361 |
+
html += f'<img src="{img_base64}" alt="Image {img_id}" loading="lazy">\n'
|
362 |
+
|
363 |
+
# Close page div
|
364 |
+
html += '</div>\n'
|
365 |
+
|
366 |
+
# Close main container and document
|
367 |
+
html += """ </div>
|
368 |
+
</body>
|
369 |
+
</html>"""
|
370 |
+
|
371 |
+
return html
|
structured_ocr.py
CHANGED
@@ -238,8 +238,31 @@ class StructuredOCR:
|
|
238 |
# Add confidence score
|
239 |
result['confidence_score'] = confidence_score
|
240 |
|
241 |
-
# Store the
|
242 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
243 |
|
244 |
logger.info(f"PDF processing completed successfully")
|
245 |
return result
|
@@ -300,8 +323,31 @@ class StructuredOCR:
|
|
300 |
# Add confidence score
|
301 |
result['confidence_score'] = confidence_score
|
302 |
|
303 |
-
# Store the
|
304 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
305 |
|
306 |
logger.info("Image processing completed successfully")
|
307 |
return result
|
@@ -336,7 +382,10 @@ class StructuredOCR:
|
|
336 |
f"This is a historical document's OCR in markdown:\n"
|
337 |
f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
|
338 |
f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
|
339 |
-
f"Extract topics, languages,
|
|
|
|
|
|
|
340 |
))
|
341 |
],
|
342 |
},
|
@@ -371,7 +420,10 @@ class StructuredOCR:
|
|
371 |
"content": f"This is a historical document's OCR in markdown:\n"
|
372 |
f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
|
373 |
f"Convert this into a structured JSON response with the OCR contents. "
|
374 |
-
f"Extract topics, languages,
|
|
|
|
|
|
|
375 |
},
|
376 |
],
|
377 |
response_format=StructuredOCRModel,
|
|
|
238 |
# Add confidence score
|
239 |
result['confidence_score'] = confidence_score
|
240 |
|
241 |
+
# Store key parts of the OCR response for image rendering
|
242 |
+
# Extract and store image data in a format that can be serialized to JSON
|
243 |
+
has_images = hasattr(pdf_response, 'pages') and any(hasattr(page, 'images') and page.images for page in pdf_response.pages)
|
244 |
+
result['has_images'] = has_images
|
245 |
+
|
246 |
+
if has_images:
|
247 |
+
# Create a structured representation of images that can be serialized
|
248 |
+
result['pages_data'] = []
|
249 |
+
for page_idx, page in enumerate(pdf_response.pages):
|
250 |
+
page_data = {
|
251 |
+
'page_number': page_idx + 1,
|
252 |
+
'markdown': page.markdown if hasattr(page, 'markdown') else '',
|
253 |
+
'images': []
|
254 |
+
}
|
255 |
+
|
256 |
+
# Extract images if present
|
257 |
+
if hasattr(page, 'images') and page.images:
|
258 |
+
for img_idx, img in enumerate(page.images):
|
259 |
+
if hasattr(img, 'image_base64') and img.image_base64:
|
260 |
+
page_data['images'].append({
|
261 |
+
'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
|
262 |
+
'image_base64': img.image_base64
|
263 |
+
})
|
264 |
+
|
265 |
+
result['pages_data'].append(page_data)
|
266 |
|
267 |
logger.info(f"PDF processing completed successfully")
|
268 |
return result
|
|
|
323 |
# Add confidence score
|
324 |
result['confidence_score'] = confidence_score
|
325 |
|
326 |
+
# Store key parts of the OCR response for image rendering
|
327 |
+
# Extract and store image data in a format that can be serialized to JSON
|
328 |
+
has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
|
329 |
+
result['has_images'] = has_images
|
330 |
+
|
331 |
+
if has_images:
|
332 |
+
# Create a structured representation of images that can be serialized
|
333 |
+
result['pages_data'] = []
|
334 |
+
for page_idx, page in enumerate(image_response.pages):
|
335 |
+
page_data = {
|
336 |
+
'page_number': page_idx + 1,
|
337 |
+
'markdown': page.markdown if hasattr(page, 'markdown') else '',
|
338 |
+
'images': []
|
339 |
+
}
|
340 |
+
|
341 |
+
# Extract images if present
|
342 |
+
if hasattr(page, 'images') and page.images:
|
343 |
+
for img_idx, img in enumerate(page.images):
|
344 |
+
if hasattr(img, 'image_base64') and img.image_base64:
|
345 |
+
page_data['images'].append({
|
346 |
+
'id': img.id if hasattr(img, 'id') else f"img_{page_idx}_{img_idx}",
|
347 |
+
'image_base64': img.image_base64
|
348 |
+
})
|
349 |
+
|
350 |
+
result['pages_data'].append(page_data)
|
351 |
|
352 |
logger.info("Image processing completed successfully")
|
353 |
return result
|
|
|
382 |
f"This is a historical document's OCR in markdown:\n"
|
383 |
f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
|
384 |
f"Convert this into a structured JSON response with the OCR contents in a sensible dictionary. "
|
385 |
+
f"Extract topics, languages, document type, date if present, and key entities. "
|
386 |
+
f"For handwritten documents, carefully preserve the structure. "
|
387 |
+
f"For printed texts, organize content logically by sections, maintaining the hierarchy. "
|
388 |
+
f"For tabular content, preserve the table structure as much as possible."
|
389 |
))
|
390 |
],
|
391 |
},
|
|
|
420 |
"content": f"This is a historical document's OCR in markdown:\n"
|
421 |
f"<BEGIN_IMAGE_OCR>\n{ocr_markdown}\n<END_IMAGE_OCR>.\n"
|
422 |
f"Convert this into a structured JSON response with the OCR contents. "
|
423 |
+
f"Extract topics, languages, document type, date if present, and key entities. "
|
424 |
+
f"For handwritten documents, carefully preserve the structure. "
|
425 |
+
f"For printed texts, organize content logically by sections. "
|
426 |
+
f"For tabular content, preserve the table structure as much as possible."
|
427 |
},
|
428 |
],
|
429 |
response_format=StructuredOCRModel,
|