Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
yamatazen 's Collections
LLM merging
Multilingual LLMs
Japanese LLMs
LLM censorship
LLM leaderboards
Grokking

LLM censorship

updated 8 days ago
Upvote
1

  • GuardReasoner: Towards Reasoning-based LLM Safeguards

    Paper • 2501.18492 • Published Jan 30 • 87

  • Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

    Paper • 2412.19512 • Published Dec 27, 2024 • 8

  • Course-Correction: Safety Alignment Using Synthetic Preferences

    Paper • 2407.16637 • Published Jul 23, 2024 • 27

  • Refusal in Language Models Is Mediated by a Single Direction

    Paper • 2406.11717 • Published Jun 17, 2024 • 3

  • GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

    Paper • 2505.11049 • Published 18 days ago • 59

  • Lifelong Safety Alignment for Language Models

    Paper • 2505.20259 • Published 8 days ago • 23
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs