Legal-BERT (GDPR Pretrained Version)

This model is based on nlpaueb/legal-bert-base-uncased, and has been further pretrained on the full text of the General Data Protection Regulation (GDPR) to adapt it to privacy law and regulatory compliance scenarios.

🧠 What’s New?

We adapted Legal-BERT through masked language modeling (MLM) on GDPR-specific language, enhancing the model’s understanding of:

Personal data protection terms
GDPR article structure
Typical compliance language and risk descriptions

The training corpus includes official GDPR text, split into clean English sentences, formatted for MLM.

🔧 Intended Use

This specialized model is best suited for:

GDPR compliance assistance
Legal document classification and clause matching
Privacy policy analysis
Regulatory question answering (when further fine-tuned)

💾 Training Details

Base model: nlpaueb/legal-bert-base-uncased
Task: Masked Language Modeling (MLM)
Corpus: Full official GDPR English text (~10,000+ sentences)
Epochs: 3
Block size: 128
Batch size: 16
MLM Probability: 15%

🛠 How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("JQ1984/legalbert_gdpr_pretrained")
model = AutoModelForMaskedLM.from_pretrained("JQ1984/legalbert_gdpr_pretrained")

# Example
inputs = tokenizer("The data controller shall ensure that personal data is", return_tensors="pt")
outputs = model(**inputs)


## References

* [Model Paper](https://arxiv.org/abs/xxxx.xxxxx)

JQ1984
/

legalbert_gdpr_pretrained

Legal-BERT (GDPR Pretrained Version)

🧠 What’s New?

🔧 Intended Use

💾 Training Details

🛠 How to Use

Model tree for JQ1984/legalbert_gdpr_pretrained

Dataset used to train JQ1984/legalbert_gdpr_pretrained