DistilBERT Spam Classifier

A fine-tuned DistilBERT-based model for phishing email detection, trained on the Phishing Emails Dataset. This model is optimized for identifying spam and phishing emails with high accuracy.

Model Overview

Base Model: DistilBERT
Fine-Tuning: Performed on a phishing email dataset to classify emails as spam (1) or non-spam (0).
Format: Available in ONNX format for efficient deployment.

Architecture

The model extends DistilBERT with a custom classification head:

class DistilBERTSpamClassifier(nn.Module):
def __init__(self, distilbert):
super(DistilBERTSpamClassifier, self).__init__()
self.distilbert = distilbert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)

Input: Tokenized email text (processed via DistilBERT tokenizer).
Output: Log-probabilities for two classes (spam or non-spam).
Layers:
DistilBERT for contextual embeddings (768 dimensions).
Dropout (0.1) for regularization.
Fully connected layers (768 → 512 → 2) with ReLU activation.
LogSoftmax for classification.

Performance

Evaluated on a test set of 3,021 samples, the model achieves performance across metrics:

Class	Precision	Recall	F1-Score	Support
Non-Spam (0)	0.98	0.98	0.98	1,870
Spam (1)	0.96	0.97	0.96	1,151

Accuracy: 97%
Macro Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97
Weighted Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97

Usage

Install Dependencies:

pip install transformers onnxruntime torch

Load the Model: Use the ONNX model with a compatible inference engine (e.g., ONNX Runtime). Example:

from transformers import DistilBertTokenizer
import onnxruntime as ort
import numpy as np

# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Load ONNX model
session = ort.InferenceSession("path_to_model.onnx")

# Tokenize input
text = "Your example email text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)

# Run inference
outputs = session.run(None, dict(inputs))[0]
prediction = np.argmax(outputs, axis=1)
print("Spam" if prediction == 1 else "Non-Spam")

Input Requirements:

Text input must be tokenized using the DistilBERT tokenizer.
Maximum sequence length: 512 tokens.

Dataset

The model was fine-tuned on the Phishing Emails Dataset, which contains labeled email samples for spam and phishing detection.

Limitations

Only available in ONNX format; no PyTorch or TensorFlow checkpoints.
Maximum input length is 512 tokens; longer emails are truncated.

masterburator3301
/

distilbert-spam-phishing-classification-onnx