--- license: cc-by-4.0 language: - en metrics: - accuracy - recall pipeline_tag: image-to-text tags: - agriculture - leaf - disease datasets: - enalis/LeafNet library_name: transformers --- # ๐ŸŒฟ SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification **SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space. This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts. --- ## ๐Ÿš€ Model Details | Component | Architecture | |------------------|-------------------------------------------| | Image Encoder | Swin Base (patch4, window7, 224 resolution) | | Text Encoder | RoBERTa-base | | Projection Head | Linear layer (to 512-D space) | | Normalization | L2 on both embeddings | | Training Task | Contrastive learning | The final embeddings from image and text encoders are aligned using cosine similarity. --- ### โœ… Intended Use - Vision-language embedding for classification or retrieval tasks - Few-shot learning in agricultural or medical datasets - Multimodal interpretability or zero-shot transfer --- ## ๐Ÿงช How to Use ```python import torch from transformers import RobertaTokenizer from torchvision import transforms from PIL import Image from modeling_lvl import LVL # Replace with your module or package # Load model model = LVL() model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu")) model.eval() # Text preprocessing tokenizer = RobertaTokenizer.from_pretrained("roberta-base") text = "A maize leaf with bacterial blight" inputs = tokenizer(text, return_tensors="pt") # Image preprocessing image = Image.open("path_to_leaf.jpg").convert("RGB") transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor() ]) image_tensor = transform(image).unsqueeze(0) # Inference with torch.no_grad(): image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"]) similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb) print(f"Similarity score: {similarity.item():.4f}") ``` Please cite this paper if this code is useful for you! ``` @misc{quoc2025visionlanguage, author = {Quoc, K. N. and Thu, L. L. T. and Quach, L. D.}, title = {A Vision-Language Foundation Model for Leaf Disease Identification}, year = {2025}, publisher = {Authorea Preprints}, url = {10.36227/techrxiv.174062971.11176782/v1} } ``` Demo in [here](https://leafclip.streamlit.app/)