Legal-BERT (GDPR Pretrained Version)
This model is based on nlpaueb/legal-bert-base-uncased
, and has been further pretrained on the full text of the General Data Protection Regulation (GDPR) to adapt it to privacy law and regulatory compliance scenarios.
🧠 What’s New?
We adapted Legal-BERT through masked language modeling (MLM) on GDPR-specific language, enhancing the model’s understanding of:
- Personal data protection terms
- GDPR article structure
- Typical compliance language and risk descriptions
The training corpus includes official GDPR text, split into clean English sentences, formatted for MLM.
🔧 Intended Use
This specialized model is best suited for:
- GDPR compliance assistance
- Legal document classification and clause matching
- Privacy policy analysis
- Regulatory question answering (when further fine-tuned)
💾 Training Details
- Base model:
nlpaueb/legal-bert-base-uncased
- Task: Masked Language Modeling (MLM)
- Corpus: Full official GDPR English text (~10,000+ sentences)
- Epochs: 3
- Block size: 128
- Batch size: 16
- MLM Probability: 15%
🛠 How to Use
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("JQ1984/legalbert_gdpr_pretrained")
model = AutoModelForMaskedLM.from_pretrained("JQ1984/legalbert_gdpr_pretrained")
# Example
inputs = tokenizer("The data controller shall ensure that personal data is", return_tensors="pt")
outputs = model(**inputs)
## References
* [Model Paper](https://arxiv.org/abs/xxxx.xxxxx)
- Downloads last month
- 35
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support