CrabInHoney commited on
Commit
25275d4
·
verified ·
1 Parent(s): da0dfd6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - ealvaradob/phishing-dataset
4
+ language:
5
+ - en
6
+ base_model:
7
+ - CrabInHoney/urlbert-tiny-base-v4
8
+ pipeline_tag: text-classification
9
+ tags:
10
+ - url
11
+ - urls
12
+ - links
13
+ - classification
14
+ - tiny
15
+ - phishing
16
+ - urlbert
17
+ license: apache-2.0
18
+ ---
19
+ This is a very small version of BERT, designed to categorize links into phishing and non-phishing links
20
+
21
+ An updated, lighter version of the old classification model for URL analysis
22
+
23
+ Old version: https://huggingface.co/CrabInHoney/urlbert-tiny-v3-phishing-classifier
24
+ ##### Comparison with the previous version of urlbert phishing-classifier:
25
+
26
+ | Version | Accuracy | Precision | Recall | F1-score |
27
+ | ------------ | ------------ | ------------ | ------------ | ------------ |
28
+ | v2 | 0.9665 | 0.9756 | 0.9522 | 0.9637 |
29
+ | v3 | 0.9819 | 0.9876 | 0.9734 | 0.9805|
30
+ | **v4 (this model)** | **0.9907** | **0.9945** | **0.9855** | **0.9900** |
31
+
32
+ Model size
33
+
34
+ 3.69M params
35
+
36
+ Tensor type
37
+
38
+ F32
39
+
40
+ [Dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset "Dataset")
41
+ (urls.json only)
42
+
43
+ Example:
44
+
45
+
46
+
47
+ from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
48
+ import torch
49
+
50
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
51
+ print(f"Используемое устройство: {device}")
52
+
53
+ model_name = "CrabInHoney/urlbert-tiny-v4-phishing-classifier"
54
+
55
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
56
+ model = BertForSequenceClassification.from_pretrained(model_name)
57
+ model.to(device)
58
+
59
+ classifier = pipeline(
60
+ "text-classification",
61
+ model=model,
62
+ tokenizer=tokenizer,
63
+ device=0 if torch.cuda.is_available() else -1,
64
+ return_all_scores=True
65
+ )
66
+
67
+ test_urls = [
68
+ "huggingface.co/",
69
+ "hu991ngface.com.ru/"
70
+ ]
71
+
72
+ label_mapping = {"LABEL_0": "good", "LABEL_1": "fish"}
73
+
74
+ for url in test_urls:
75
+ results = classifier(url)
76
+ print(f"\nURL: {url}")
77
+ for result in results[0]:
78
+ label = result['label']
79
+ score = result['score']
80
+ friendly_label = label_mapping.get(label, label)
81
+ print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
82
+
83
+
84
+ Используемое устройство: cuda
85
+
86
+ URL: huggingface.co/
87
+ Класс: good, вероятность: 0.9710
88
+ Класс: fish, вероятность: 0.0290
89
+
90
+ URL: hu991ngface.com.ru/
91
+ Класс: good, вероятность: 0.0013
92
+ Класс: fish, вероятность: 0.9987