Legal-BERT (GDPR Pretrained Version)

This model is based on nlpaueb/legal-bert-base-uncased, and has been further pretrained on the full text of the General Data Protection Regulation (GDPR) to adapt it to privacy law and regulatory compliance scenarios.

🧠 What’s New?

We adapted Legal-BERT through masked language modeling (MLM) on GDPR-specific language, enhancing the model’s understanding of:

  • Personal data protection terms
  • GDPR article structure
  • Typical compliance language and risk descriptions

The training corpus includes official GDPR text, split into clean English sentences, formatted for MLM.

🔧 Intended Use

This specialized model is best suited for:

  • GDPR compliance assistance
  • Legal document classification and clause matching
  • Privacy policy analysis
  • Regulatory question answering (when further fine-tuned)

💾 Training Details

  • Base model: nlpaueb/legal-bert-base-uncased
  • Task: Masked Language Modeling (MLM)
  • Corpus: Full official GDPR English text (~10,000+ sentences)
  • Epochs: 3
  • Block size: 128
  • Batch size: 16
  • MLM Probability: 15%

🛠 How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("JQ1984/legalbert_gdpr_pretrained")
model = AutoModelForMaskedLM.from_pretrained("JQ1984/legalbert_gdpr_pretrained")

# Example
inputs = tokenizer("The data controller shall ensure that personal data is", return_tensors="pt")
outputs = model(**inputs)


## References

* [Model Paper](https://arxiv.org/abs/xxxx.xxxxx)
Downloads last month
35
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JQ1984/legalbert_gdpr_pretrained

Finetuned
(61)
this model
Finetunes
3 models

Dataset used to train JQ1984/legalbert_gdpr_pretrained