--- language: en tags: - vision license: apache-2.0 --- Model Card for Mars ViT Base Model ## Model Architecture - Architecture: Vision Transformer (ViT) Base - Input Channels: 1 (grayscale images) - Number of Classes: 0 (features extraction) ## Training Method - Method: Masked Autoencoder (MAE) - Dataset: 2 million CTX images ## Usage Examples ### Using timm (suggested now) First download checkpoint-1199.pth (backbone only) ```python import timm import torch model = timm.create_model( 'vit_base_patch16_224', in_chans=1, num_classes=0, global_pool='', checkpoint_path="./checkpoint-1199.pth" # must use local path ) model.eval() # for images, need to convert to single channel, 224, and normalize # transform example: # transform = transforms.Compose([ # transforms.ToTensor(), # transforms.Resize((224, 224)), # transforms.Grayscale(num_output_channels=1), # transforms.Normalize(mean=[0.5], std=[0.5]) # ]) x = torch.randn(1, 1, 224, 224) with torch.no_grad(): features = model.forward_features(x) # shape [1, tokens, embed_dim] print(features.shape) cls_token = features[:, 0] patch_tokens = features[:, 1:] ``` Using transformers ```python from transformers import AutoModel, AutoImageProcessor model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m") image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m") # Example usage from PIL import Image image = Image.open("some_image.png").convert("L") # 1-channel inputs = image_processor(image, return_tensors="pt") outputs = model(**inputs) ``` ## MAE reconstruction Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization. ### Limitations The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning. The model is designed for feature extraction and does not include a classification head.