---
language: en
tags:
- vision
license: apache-2.0
---

Model Card for Mars ViT Base Model

## Model Architecture
- Architecture: Vision Transformer (ViT) Base
- Input Channels: 1 (grayscale images)
- Number of Classes: 0 (features extraction)

## Training Method
- Method: Masked Autoencoder (MAE)
- Dataset: 2 million CTX images

## Usage Examples
### Using timm (suggested now)

First download checkpoint-1199.pth (backbone only)

```python
import timm
import torch

model = timm.create_model(
    'vit_base_patch16_224',
    in_chans=1,
    num_classes=0,
    global_pool='',
    checkpoint_path="./checkpoint-1199.pth" # must use local path
)

model.eval()

# for images, need to convert to single channel, 224, and normalize 

# transform example:
# transform = transforms.Compose([
#     transforms.ToTensor(),
#     transforms.Resize((224, 224)),
#     transforms.Grayscale(num_output_channels=1),
#     transforms.Normalize(mean=[0.5], std=[0.5])
# ])
x = torch.randn(1, 1, 224, 224)
with torch.no_grad():
    features = model.forward_features(x)  # shape [1, tokens, embed_dim]
print(features.shape)

cls_token = features[:, 0]
patch_tokens = features[:, 1:]
```

Using transformers
```python
from transformers import AutoModel, AutoImageProcessor

model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m")
image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m")

# Example usage
from PIL import Image
image = Image.open("some_image.png").convert("L")  # 1-channel
inputs = image_processor(image, return_tensors="pt")


outputs = model(**inputs)
```
## MAE reconstruction 
Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization.

### Limitations
The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning.
The model is designed for feature extraction and does not include a classification head.