Twitter bitcoin related spam detection
This model aims to classify tweets related to bitcoin or crypto topics as "human", "spam", or "bot".
The reason behind having this model when there are plenty of them already posted is that bicoin related tweets get often classified instantly as "spam" as it is usually related to phishing sites or scams. So having this model already trained over a bitcoin related dataset removes this prejudice making it possible to work with bitcoin related tweets.
The model is a fine-tuned version of vinai/bertweet-base (a roBERTa based model fine-tuned with 850M English Tweets) and trained for emotion classification over a bitcoin related dataset.
Example of classification
BERTweet was trained over normalized tweets such as:
tweet = "DHEC confirms HTTPURL via @USER :crying_face:"
,
so for better results it's recomended to normalize the texts before applying the spam detection.
What it does is converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively and other preprocessing modifications.
To do so, copy or download the TweetNormalizer provided at the BERTweet webpage:
!git clone https://github.com/VinAIResearch/BERTweet.git
!pip install emoji
from transformers import pipeline
import sys
# (Optional to improve accuracy)
sys.path.append('/content/BERTweet') # Or whatever directory the folder got downloaded in
from TweetNormalizer import normalizeTweet
classifier = pipeline("text-classification",
model="sandiumenge/twitter-bitcoin-spam-detection",
tokenizer="sandiumenge/twitter-bitcoin-spam-detection",
truncation=True,
padding=True,
max_length=128
)
tweet = "I'm winning iPhone XS,BTC,ETH and other Awards. Join with us!@freecoinhunt https://t.co/VIUwLmdy4n"
normalized_tweet = normalizeTweet(tweet)
# "I 'm winning iPhone XS , BTC , ETH and other Awards . Join with us ! @USER HTTPURL"
print(classifier(normalized_tweet))
>> [{'label': 'spam', 'score': 0.9803344011306763}]
It achieves the following results on the evaluation set:
- Loss: 0.4793
- Accuracy: 0.8755
- F1: 0.8767
- Precision: 0.8792
- Recall: 0.8755
- Downloads last month
- 52
Model tree for sandiumenge/twitter-bitcoin-spam-detection
Base model
vinai/bertweet-base