Twitter bitcoin related spam detection

This model aims to classify tweets related to bitcoin or crypto topics as "human", "spam", or "bot".

The reason behind having this model when there are plenty of them already posted is that bicoin related tweets get often classified instantly as "spam" as it is usually related to phishing sites or scams. So having this model already trained over a bitcoin related dataset removes this prejudice making it possible to work with bitcoin related tweets.

The model is a fine-tuned version of vinai/bertweet-base (a roBERTa based model fine-tuned with 850M English Tweets) and trained for emotion classification over a bitcoin related dataset.

Example of classification

BERTweet was trained over normalized tweets such as: tweet = "DHEC confirms HTTPURL via @USER :crying_face:", so for better results it's recomended to normalize the texts before applying the spam detection. What it does is converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively and other preprocessing modifications. To do so, copy or download the TweetNormalizer provided at the BERTweet webpage:

!git clone https://github.com/VinAIResearch/BERTweet.git
!pip install emoji

from transformers import pipeline
import sys

# (Optional to improve accuracy)
sys.path.append('/content/BERTweet') # Or whatever directory the folder got downloaded in
from TweetNormalizer import normalizeTweet

classifier = pipeline("text-classification",
                      model="sandiumenge/twitter-bitcoin-spam-detection",
                      tokenizer="sandiumenge/twitter-bitcoin-spam-detection",
                      truncation=True,
                      padding=True,
                      max_length=128
)

tweet = "I'm winning iPhone XS，BTC，ETH and other Awards. Join with us!@freecoinhunt https://t.co/VIUwLmdy4n"
normalized_tweet = normalizeTweet(tweet)
# "I 'm winning iPhone XS ， BTC ， ETH and other Awards . Join with us ! @USER HTTPURL"

print(classifier(normalized_tweet))

>> [{'label': 'spam', 'score': 0.9803344011306763}]

It achieves the following results on the evaluation set:

Loss: 0.4793
Accuracy: 0.8755
F1: 0.8767
Precision: 0.8792
Recall: 0.8755

sandiumenge
/

twitter-bitcoin-spam-detection

Twitter bitcoin related spam detection

Example of classification

Model tree for sandiumenge/twitter-bitcoin-spam-detection

Dataset used to train sandiumenge/twitter-bitcoin-spam-detection

Evaluation results