Model Card: [LISAt]

Model Description

LISAT (Language-Image Segmentation and Text) is a cutting-edge vision-language model (VLM) designed specifically for handling complex remote-sensing images. Unlike traditional segmentation models, which are limited to recognizing a pre-defined set of objects, LISAT can reason over intricate user queries that refer to multiple objects of interest. This allows LISAT to generate segmentation masks from complex and implicit query text, offering a more flexible and intuitive approach to image understanding.

LISAT was trained on a new curated geospatial reasoning-segmentation dataset, GRES, which contains 27,615 annotations across 9,205 images, as well as a multi-modal geospatial pre-training dataset, PreGRES, with over 1 million QA pairs. These datasets allow LISAT to not only describe remote-sensing images and answer complex questions but also identify and segment specific objects within these images.

Our model surpasses existing geospatial foundation models, such as RS-GPT4V, by over 10.04% (BLEU-4) on remote-sensing visual description tasks. On remote-sensing reasoning segmentation tasks, LISAT outperforms state-of-the-art open-domain models by an impressive 143.36% (gIoU).

With its advanced reasoning capabilities and superior performance on geospatial tasks, LISAT is a powerful tool for anyone working with remote-sensing data, enabling more accurate and detailed analysis of complex visual information. relevant outputs across a variety of domains.

Model Details

Model architecture: Inspired by LISA (Lai et al., 2024), LISAT integrates a multimodal large language model (LLM) with a segmentation model. Its architechture is shown below.

Training data: we introduce the Geospatial Reasoning Segmentation Dataset (GRES), a collection of vision and language data designed around remote-sensing applications. GRES consists of two core components: PreGRES, a dataset consisting of over 1M remote-sensing specific visual instruction-tuning Q/A pairs for pre-training geospatial models, and GRES, a semi-synthetic dataset specialized for reasoning segmentation of remote-sensing data and consisting of 9,205 images and 27,615 natural language queries/answers within those images. From this LISAt dataset, we generate train, test, and validation splits consisting of 7,205, 1,500, and 500 images respectively. The GRES dataset can be downloaded here.
Implementation Details: LISAT and LISATPRE are trained on eight DGX A100 80GB GPUs. In the first stage, we pretrain LISATPRE (context length = 2048) using LoRA for 1 epoch on PreGRES with next-token prediction cross-entropy loss. We employ the AdamW optimizer with a learning rate of 3e−4 and a cosine-decay learning rate scheduler, setting the batch size to 2 and gradient accumulation steps to 6.

In the second stage, we train LISAT using GRES, as well as two traditional natural image referring segmentation datasets, FP-Ref-COCO and ReasonSeg. LoRA is applied to LISATPRE , while the SAM decoder undergoes full fine-tuning. The learning rate is set to 3e−4, with all other configurations remaining the same. For the loss function, the weight for text generation loss (λtxt) and mask loss (λmask) is set to 1.0, while the binary cross-entropy loss (BCE) (λbce) and Dice loss (λdice) are assigned weights of 2.0 and 0.5, respectively. The total training time was approximately 12 hours on eight DGX A100 80GB GPUs.

License: cc-by-nc-sa-4.0

Model Release Date

LISAT-7B: TBD

Status:

This is a static model trained on a curated geospatial dataset. Future versions of the model will be released as we incorporate community feedback and improve model performance, especially with regards to safety and generalization in remote-sensing image tasks.

Comparative Performance of LISAT-7B on GRES

The following table shows a comparison of LISAT-7B against LISA-7B and LISA-13B-Llama2-v1 on the GRES dataset across different object sizes. LISAT-7B consistently outperforms the baseline models, particularly in the Small object category.

Model	Object Size	cIoU	gIoU
LISA-7B	All	0.122 ± 0.014	0.113 ± 0.007
	Small	0.104 ± 0.022	0.062 ± 0.008
	Large	0.157 ± 0.017	0.222 ± 0.013
LISA-13B (llama2)	All	0.122 ± 0.014	0.139 ± 0.006
	Small	0.106 ± 0.016	0.089 ± 0.007
	Large	0.148 ± 0.018	0.244 ± 0.019
LISAT (Ours)	All	0.245 ± 0.023	0.275 ± 0.009
	Small	0.232 ± 0.024	0.240 ± 0.009
	Large	0.250 ± 0.029	0.348 ± 0.015

The bolded values represent the best results in each category.

Model Usage

Starting with transformers version >= 4.45.0, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

To use LISAT-7B with transformers, make sure to update your transformers installation to the latest version using:

pip install --upgrade transformers

Once your installation is updated, you can use LISAT-7B for inference as follows:

from transformers import AutoModelForImageSegmentation, AutoTokenizer

# Load model and tokenizer
model = AutoModelForImageSegmentation.from_pretrained("path/to/your/LISAt-7b")
tokenizer = AutoTokenizer.from_pretrained("path/to/your/LISAt-7b")

# Example usage for inference
input_image = "path/to/your/image.png"  # Replace with your input image
inputs = tokenizer(input_image, return_tensors="pt")

# Generate segmentation or other tasks
outputs = model.generate(**inputs)

Community

Generative AI safety requires ongoing expertise and collaboration, and we believe in the strength of the open community to accelerate progress in this area. We encourage contributions and collaboration within open consortiums focused on AI safety and responsible model deployment. We also engage with communities through relevant standards, including the MLCommons safety evaluation framework.

We invite the community to contribute to the ongoing development of LISAT-7B, and we provide an open-source toolset to help developers assess and deploy the model in a safe and responsible manner.

Ethical Considerations

This paper presents advancements in reasoning segmentation for remote sensing tasks. LISAT-7B is a method that is able to reason over arbitrary remote sensing images and output both explanations and segmentation masks for objects of interest. These kinds of workflows are extremely common across multiple fields. For example, disaster management personnel may want to know which roads leading to an airport are undamaged, and why. LISAT-7B is the first such model that can simultaneously answer both components of such questions.

Broadly, LISAT-7B has impacts in numerous domains such as environmental monitoring, urban planning, and search and rescue. However, one of the biggest uses of satellite is surveillance. Our work would also aid in intelligence use cases whether they be conducted by friendly governments or malicious actors. We offset this use case by basing GRES primarily on the xView series of datasets which are explicitly created with AI for good purposes in mind. This means that any intelligence use of LISAT-7B would require further dataset curation and training.

We encourage responsible deployment and continued discourse on the implications of geospatial AI in real-world applications.

Citation

If you use LISAt in your research or applications, please cite our paper:

@article{TBD,
  title={LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery},
  author={Quenum, Jerome and Hsieh, Wen-Han and Wu, Tsung-Han and Gupta, Ritwik and Darrell, Trevor and Chan, David M},
  journal={TBD},
  year={2025},
  url={TBD}
}