LatexMind-2B-Codec / README.md

Update README.md

d3729bd verified about 1 month ago

7.35 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2-VL-2B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- latex
	- vLM
	- Vision
	- Codec
	---

	![qwenVL.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/g8zYbOSBt4NSqhSIypaX3.png)

	--------------

	# LatexMind-2B-Codec

	The LatexMind-2B-Codec model is a fine-tuned version of Qwen2-VL-2B-Instruct, optimized for Optical Character Recognition (OCR), image-to-text conversion, and mathematical expression extraction with LaTeX formatting. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.

	# Key Enhancements:

	* SoTA understanding of images with various resolutions & aspect ratios: LatexMind-2B-Codec achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

	* Advanced LaTeX extraction: The model specializes in extracting structured mathematical expressions from images and documents, converting them into LaTeX format for precise rendering and further computation.

	* Understanding long-duration videos (20min+): LatexMind-2B-Codec can process videos over 20 minutes long, enabling high-quality video-based question answering, mathematical solution explanation, and educational content creation.

	* Agent capabilities for automated operations: With complex reasoning and decision-making abilities, the model can be integrated with mobile devices, robots, and assistive technologies to automate tasks based on visual and textual inputs.

	* Multilingual Support: To serve global users, in addition to English and Chinese, the model supports text recognition inside images across multiple languages, including European languages, Japanese, Korean, Arabic, Vietnamese, etc.

	This model is particularly effective in retrieving mathematical notations and equations from scanned documents, whiteboard images, and handwritten notes, ensuring accurate conversion to LaTeX code for further academic and computational applications.

	# Sample Inference with Doc

	![latexqwen.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/-h5z3giEudPrdM9qRMMTe.png)

	Demo: https://huggingface.co/prithivMLmods/LatexMind-2B-Codec/blob/main/latexmind/latexmind-codec.ipynb

	# Use it with Transformers


	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# default: Load the model on the available device(s)
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/LatexMind-2B-Codec", torch_dtype="auto", device_map="auto"
	)

	# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
	# model = Qwen2VLForConditionalGeneration.from_pretrained(
	# "prithivMLmods/LatexMind-2B-Codec",
	# torch_dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	# default processer
	processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-OCR-2B-Instruct")

	# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
	# min_pixels = 2562828
	# max_pixels = 12802828
	# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```
	# Buf
	```python
	buffer = ""
	for new_text in streamer:
	buffer += new_text
	# Remove <\|im_end\|> or similar tokens from the output
	buffer = buffer.replace("<\|im_end\|>", "")
	yield buffer
	```

	# Intended Use

	LatexMind-2B-Codec is designed for tasks that require image-based text recognition, math equation extraction, and multi-modal understanding. It is particularly useful in the following scenarios:

	Optical Character Recognition (OCR) – Extracting printed and handwritten text from images, documents, and scanned pages.
	Math Expression Recognition – Converting mathematical notations into structured LaTeX format for further computation and documentation.
	Image-to-Text Conversion – Generating accurate descriptions for text-rich and math-heavy images.
	Document and Academic Processing – Assisting researchers, students, and professionals in digitizing handwritten notes and extracting structured content from books, PDFs, and whiteboards.
	Automated Educational Support – Enabling AI-powered tutors, content summarization, and interactive learning for subjects involving complex equations.
	Multi-Language OCR – Recognizing text inside images across multiple languages, including English, Chinese, Japanese, Korean, Arabic, and various European languages.
	Video-Based Question Answering – Understanding long-duration videos for content summarization, question answering, and structured data extraction.

	# Limitations

	Despite its capabilities, LatexMind-2B-Codec has some inherent limitations:

	Handwritten Text Accuracy – While it can recognize handwritten equations, performance may degrade with highly unstructured or messy handwriting.
	Complex LaTeX Formatting – The model may struggle with deeply nested or ambiguous LaTeX expressions, requiring manual corrections for precise formatting.
	Low-Resolution Images – Extracting accurate text from blurry or low-resolution images can lead to misinterpretations or OCR errors.
	Contextual Understanding in Multi-Step Equations – While it recognizes math expressions, solving multi-step problems autonomously may be limited.
	Limited Support for Rare Mathematical Notations – Some specialized or domain-specific symbols may not be recognized with high accuracy.
	Processing Speed for Large Documents – Performance may slow down when handling extremely large documents or dense mathematical content in real-time applications.
	Language-Specific OCR Variability – While it supports multiple languages, OCR accuracy may vary depending on the script complexity and font style.