Note:
This model is copied version of DNABERT-2 which removes the FlashAttention integration with Trition. This allows the model to be installed off HuggingFace without having to uninstall Triton. Running the below example code yields identical output compared to the original verison.
import torch
from transformers import AutoTokenizer, AutoModel
from transformers.models.bert.configuration_bert import BertConfig
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"quietflamingo/dnabert2-fixed",
trust_remote_code=True,
)
config = BertConfig.from_pretrained(
"quietflamingo/dnabert2-fixed",
)
self.model = AutoModel.from_pretrained(
"quietflamingo/dnabert2-fixed",
config=config
)
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
inputs = inputs.to(device)
model = model.to(device)
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(torch.mean(embedding_mean) # Outputs 0.0045, matches DNABERT2
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(torch.mean(embedding_max) # Outputs 0.2840, matches DNABERT2
If you use this model please give full attribution to the original authors below: https://huggingface.co/zhihan1996/DNABERT-2-117M
@misc{zhou2023dnabert2,
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},
year={2023},
eprint={2306.15006},
archivePrefix={arXiv},
primaryClass={q-bio.GN}
}
Original README:
"""
This is the official pre-trained model introduced in DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
.
We sincerely appreciate the MosaicML team for the MosaicBERT implementation, which serves as the base of DNABERT-2 development.
DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
To load the model from huggingface:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
To calculate the embedding of a dna sequence
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768
# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768
"""
- Downloads last month
- 55