DistilBERT Spam Classifier

A fine-tuned DistilBERT-based model for phishing email detection, trained on the Phishing Emails Dataset. This model is optimized for identifying spam and phishing emails with high accuracy.

Model Overview

  • Base Model: DistilBERT
  • Fine-Tuning: Performed on a phishing email dataset to classify emails as spam (1) or non-spam (0).
  • Format: Available in ONNX format for efficient deployment.

Architecture

The model extends DistilBERT with a custom classification head:

class DistilBERTSpamClassifier(nn.Module):
def __init__(self, distilbert):
super(DistilBERTSpamClassifier, self).__init__()
self.distilbert = distilbert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)
  • Input: Tokenized email text (processed via DistilBERT tokenizer).
  • Output: Log-probabilities for two classes (spam or non-spam).
  • Layers:
  • DistilBERT for contextual embeddings (768 dimensions).
  • Dropout (0.1) for regularization.
  • Fully connected layers (768 → 512 → 2) with ReLU activation.
  • LogSoftmax for classification.

Performance

Evaluated on a test set of 3,021 samples, the model achieves performance across metrics:

Class Precision Recall F1-Score Support
Non-Spam (0) 0.98 0.98 0.98 1,870
Spam (1) 0.96 0.97 0.96 1,151
  • Accuracy: 97%
  • Macro Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97
  • Weighted Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97

Usage

  1. Install Dependencies:
pip install transformers onnxruntime torch
  1. Load the Model: Use the ONNX model with a compatible inference engine (e.g., ONNX Runtime). Example:
from transformers import DistilBertTokenizer
import onnxruntime as ort
import numpy as np

# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Load ONNX model
session = ort.InferenceSession("path_to_model.onnx")

# Tokenize input
text = "Your example email text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)

# Run inference
outputs = session.run(None, dict(inputs))[0]
prediction = np.argmax(outputs, axis=1)
print("Spam" if prediction == 1 else "Non-Spam")
  1. Input Requirements:
  • Text input must be tokenized using the DistilBERT tokenizer.
  • Maximum sequence length: 512 tokens.

Dataset

The model was fine-tuned on the Phishing Emails Dataset, which contains labeled email samples for spam and phishing detection.

Limitations

  • Only available in ONNX format; no PyTorch or TensorFlow checkpoints.
  • Maximum input length is 512 tokens; longer emails are truncated.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for masterburator3301/distilbert-spam-phishing-classification-onnx

Quantized
(28)
this model

Dataset used to train masterburator3301/distilbert-spam-phishing-classification-onnx