Twitter bitcoin related spam detection

This model aims to classify tweets related to bitcoin or crypto topics as "human", "spam", or "bot".

The reason behind having this model when there are plenty of them already posted is that bicoin related tweets get often classified instantly as "spam" as it is usually related to phishing sites or scams. So having this model already trained over a bitcoin related dataset removes this prejudice making it possible to work with bitcoin related tweets.

The model is a fine-tuned version of vinai/bertweet-base (a roBERTa based model fine-tuned with 850M English Tweets) and trained for emotion classification over a bitcoin related dataset.

Example of classification

BERTweet was trained over normalized tweets such as: tweet = "DHEC confirms HTTPURL via @USER :crying_face:", so for better results it's recomended to normalize the texts before applying the spam detection. What it does is converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively and other preprocessing modifications. To do so, copy or download the TweetNormalizer provided at the BERTweet webpage:

!git clone https://github.com/VinAIResearch/BERTweet.git
!pip install emoji
from transformers import pipeline
import sys

# (Optional to improve accuracy)
sys.path.append('/content/BERTweet') # Or whatever directory the folder got downloaded in
from TweetNormalizer import normalizeTweet

classifier = pipeline("text-classification",
                      model="sandiumenge/twitter-bitcoin-spam-detection",
                      tokenizer="sandiumenge/twitter-bitcoin-spam-detection",
                      truncation=True,
                      padding=True,
                      max_length=128
)

tweet = "I'm winning iPhone XS,BTC,ETH and other Awards. Join with us!@freecoinhunt https://t.co/VIUwLmdy4n"
normalized_tweet = normalizeTweet(tweet)
# "I 'm winning iPhone XS , BTC , ETH and other Awards . Join with us ! @USER HTTPURL"

print(classifier(normalized_tweet))

>> [{'label': 'spam', 'score': 0.9803344011306763}]

It achieves the following results on the evaluation set:

  • Loss: 0.4793
  • Accuracy: 0.8755
  • F1: 0.8767
  • Precision: 0.8792
  • Recall: 0.8755

image/png

Downloads last month
52
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sandiumenge/twitter-bitcoin-spam-detection

Finetuned
(246)
this model

Dataset used to train sandiumenge/twitter-bitcoin-spam-detection