DistilBERT Spam Classifier
A fine-tuned DistilBERT-based model for phishing email detection, trained on the Phishing Emails Dataset. This model is optimized for identifying spam and phishing emails with high accuracy.
Model Overview
- Base Model: DistilBERT
- Fine-Tuning: Performed on a phishing email dataset to classify emails as spam (1) or non-spam (0).
- Format: Available in ONNX format for efficient deployment.
Architecture
The model extends DistilBERT with a custom classification head:
class DistilBERTSpamClassifier(nn.Module):
def __init__(self, distilbert):
super(DistilBERTSpamClassifier, self).__init__()
self.distilbert = distilbert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)
- Input: Tokenized email text (processed via DistilBERT tokenizer).
- Output: Log-probabilities for two classes (spam or non-spam).
- Layers:
- DistilBERT for contextual embeddings (768 dimensions).
- Dropout (0.1) for regularization.
- Fully connected layers (768 → 512 → 2) with ReLU activation.
- LogSoftmax for classification.
Performance
Evaluated on a test set of 3,021 samples, the model achieves performance across metrics:
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Non-Spam (0) | 0.98 | 0.98 | 0.98 | 1,870 |
Spam (1) | 0.96 | 0.97 | 0.96 | 1,151 |
- Accuracy: 97%
- Macro Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97
- Weighted Avg: Precision: 0.97, Recall: 0.97, F1-Score: 0.97
Usage
- Install Dependencies:
pip install transformers onnxruntime torch
- Load the Model: Use the ONNX model with a compatible inference engine (e.g., ONNX Runtime). Example:
from transformers import DistilBertTokenizer
import onnxruntime as ort
import numpy as np
# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# Load ONNX model
session = ort.InferenceSession("path_to_model.onnx")
# Tokenize input
text = "Your example email text here"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
# Run inference
outputs = session.run(None, dict(inputs))[0]
prediction = np.argmax(outputs, axis=1)
print("Spam" if prediction == 1 else "Non-Spam")
- Input Requirements:
- Text input must be tokenized using the DistilBERT tokenizer.
- Maximum sequence length: 512 tokens.
Dataset
The model was fine-tuned on the Phishing Emails Dataset, which contains labeled email samples for spam and phishing detection.
Limitations
- Only available in ONNX format; no PyTorch or TensorFlow checkpoints.
- Maximum input length is 512 tokens; longer emails are truncated.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for masterburator3301/distilbert-spam-phishing-classification-onnx
Base model
distilbert/distilbert-base-uncased