Mejurix Medical-Legal Embedding Model
This model is a fine-tuned Transformer (BERT-based) that generates high-quality embeddings for documents in medical and legal domains, with a focus on capturing the semantic relationships between medical and legal concepts. The model leverages NER (Named Entity Recognition) to better understand domain-specific entities and their relationships.
Model Description
Model Architecture
- Base Architecture: BERT (Bidirectional Encoder Representations from Transformers)
- Base Model: medicalai/ClinicalBERT
- Modifications:
- Custom embedding projection layer (768 โ 256 dimensions)
- NER-enhanced attention mechanism
- Domain-specific fine-tuning
Key Features
- Domain-Specific Embeddings: Optimized for medical and legal text analysis
- NER-Enhanced Understanding: Utilizes named entity recognition to improve context awareness
- Reduced Dimensionality: 256-dimensional embeddings balance expressiveness and efficiency
- Cross-Domain Connections: Effectively captures relationships between medical findings and legal implications
- Transformer-Based: Leverages bidirectional attention mechanisms for better context understanding
Performance Comparison
Our model outperforms other similar domain-specific models:
Model | Avg Similarity | #Params | Notes |
---|---|---|---|
Mejurix (ours) | 0.9859 | 110M | Medical-legal + NER FT |
ClinicalBERT | 0.9719 | 110M | No NER, no fine-tuning |
BioBERT | 0.9640 | 110M | Domain medical only |
LegalBERT | 0.9508 | 110M | Domain legal only |
The Mejurix model shows superior performance across all relationship types, particularly in cross-domain relationships between medical and legal concepts.
Detailed Relationship-Type Comparison
Our model demonstrates consistently higher similarity scores across all relationship types compared to other domain-specific models:
Relationship Type | Mejurix | ClinicalBERT | BioBERT | LegalBERT |
---|---|---|---|---|
DISEASE_MEDICATION | 0.9966 | 0.9921 | 0.9841 | 0.8514 |
SEVERITY_PROGNOSIS | 1.0000 | 1.0000 | 1.0000 | 0.8381 |
SEVERITY_COMPENSATION | 0.9997 | 0.9606 | 0.9713 | 0.8348 |
DISEASE_TREATMENT | 0.9980 | 0.9778 | 0.9645 | 0.8359 |
DIAGNOSIS_TREATMENT | 0.9995 | 0.9710 | 0.9703 | 0.8222 |
LEGAL_SIMILAR_MEDICAL_DIFFERENT | 0.9899 | 0.9699 | 0.9792 | 0.8236 |
TREATMENT_OUTCOME | 0.9941 | 0.9668 | 0.9745 | 0.8103 |
OUTCOME_SETTLEMENT | 0.9847 | 0.9631 | 0.9534 | 0.7951 |
MEDICAL_SIMILAR_LEGAL_DIFFERENT | 0.9936 | 0.9434 | 0.9414 | 0.7812 |
SYMPTOM_DISEASE | 0.9934 | 0.9690 | 0.9766 | 0.8500 |
The Mejurix model particularly excels in cross-domain relationships such as MEDICAL_SIMILAR_LEGAL_DIFFERENT (0.9936) and SEVERITY_COMPENSATION (0.9997), showing significant improvement over other models in these complex relationship types.
How to Use This Model
This model is directly available on the Hugging Face Hub and can be used with the Transformers library for feature extraction, sentence embeddings, and similarity calculations.
Basic Usage with Transformers
import torch
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model_name = "mejurix/medical-legal-embedder" # The model's actual path on Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Generate embeddings for a single text
text = "The patient was diagnosed with L3 vertebral fracture, and a compensation claim is in progress."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
# Use the [CLS] token embedding for sentence representation
embeddings = outputs.last_hidden_state[:, 0, :] # [CLS] token
print(f"Embedding shape: {embeddings.shape}") # Should be [1, 256]
Using the Model for Similarity Calculation
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model_name = "mejurix/medical-legal-embedder" # The model's actual path on Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state[:, 0, :] # [CLS] token embedding
def compute_similarity(text1, text2):
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
return F.cosine_similarity(emb1, emb2).item()
# Example
text1 = "Diagnosed with L3 spinal fracture."
text2 = "Compensation is needed for lumbar injury."
similarity = compute_similarity(text1, text2)
print(f"Similarity: {similarity:.4f}")
Using with Hugging Face Pipelines
from transformers import pipeline
# Create a feature-extraction pipeline
extractor = pipeline(
"feature-extraction",
model="mejurix/medical-legal-embedder", # The model's actual path on Hugging Face Hub
tokenizer="mejurix/medical-legal-embedder"
)
# Extract features
text = "The patient requires physical therapy following spinal surgery."
features = extractor(text)
# The output is a nested list with shape [1, sequence_length, hidden_size]
Batch Processing
import torch
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mejurix/medical-legal-embedder")
model = AutoModel.from_pretrained("mejurix/medical-legal-embedder")
# Prepare batch of texts
texts = [
"The patient was diagnosed with L3 vertebral fracture",
"Neck pain persisted after the accident",
"Clinical test results were within normal range",
"Compensation claim filed for permanent disability"
]
# Tokenize and get embeddings in a single pass
inputs = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Get CLS token embeddings for each text in the batch
embeddings = outputs.last_hidden_state[:, 0, :]
print(f"Batch embeddings shape: {embeddings.shape}") # Should be [4, 256]
Intended Uses & Limitations
Intended Uses
- Medical-legal document similarity analysis
- Case relevance assessment
- Document clustering and organization
- Information retrieval in medical and legal domains
- Cross-referencing medical records with legal precedents
- Zero-shot text classification with custom categories
Limitations
- Limited understanding of negations (current similarity: 0.7791)
- Temporal context differentiation needs improvement
- May not fully distinguish severity levels in medical conditions
- Maximum context length of 512 tokens (inherited from BERT architecture)
Training and Evaluation
Training
The model was fine-tuned on a specialized dataset containing medical-legal document pairs with various relationship types (disease-treatment, severity-compensation, etc.). Training employed triplet loss with hard negative mining.
Training Configuration:
- Base model: medicalai/ClinicalBERT
- Embedding dimension reduction: 768 โ 256
- Dropout: 0.5
- Learning rate: 1e-5
- Batch size: 16
- Weight decay: 0.1
- Triplet margin: 2.0
- Epochs: 15
Performance Observations
Strengths
- Medical-Legal Cross-Concept Connection: Effectively connects medical assessments with legal compensation concepts (0.8348)
- Medical Terminology Recognition: Recognizes equivalent medical expressions across different terminologies (0.8414)
- Causality Understanding: Accurately identifies cause-effect relationships (0.8236)
- Transformer Attention: The bidirectional attention mechanism captures contextual relationships effectively
Areas for Improvement
- Detailed Medical Terminology Differentiation: Needs better recognition of severity differences
- Temporal Context Understanding: Temporal differences in medical conditions need better differentiation
- Negation Handling: Improved handling of negations needed
- Longer Context Windows: Future versions could benefit from extended context length models
Ethical Considerations
This model should be used as a tool to assist professionals, not as a replacement for medical or legal expertise. Decisions affecting patient care or legal outcomes should not be based solely on this model's output.
Citation
If you use this model in your research, please cite:
@software{mejurix_medicallegal_embedder,
author = {Mejurix},
title = {Mejurix Medical-Legal Embedding Model},
year = {2025},
version = {0.1.0},
url = {https://huggingface.co/mejurix/medical-legal-embedder}
}
License
This project is distributed under the MIT License.
ํ๊ตญ์ด ๋ฌธ์ / Korean Documentation
Mejurix ์๋ฃ-๋ฒ๋ฅ ์๋ฒ ๋ฉ ๋ชจ๋ธ
๋ณธ ๋ชจ๋ธ์ ์๋ฃ ๋ฐ ๋ฒ๋ฅ ๋๋ฉ์ธ์ ํ ์คํธ์ ํนํ๋ ์๋ฒ ๋ฉ์ ์์ฑํ๋ ๋ฏธ์ธ ์กฐ์ ๋ ํธ๋์คํฌ๋จธ(BERT ๊ธฐ๋ฐ) ๋ชจ๋ธ์ ๋๋ค. ์๋ฃ ๋ฐ ๋ฒ๋ฅ ๊ฐ๋ ๊ฐ์ ์๋ฏธ๋ก ์ ๊ด๊ณ๋ฅผ ํฌ์ฐฉํ๋ ๋ฐ ์ค์ ์ ๋๊ณ ์์ผ๋ฉฐ, ๊ฐ์ฒด๋ช ์ธ์(NER)์ ํ์ฉํ์ฌ ๋๋ฉ์ธ ํนํ ์ํฐํฐ์ ๊ทธ ๊ด๊ณ๋ฅผ ๋ ์ ์ดํดํฉ๋๋ค.
๋ชจ๋ธ ์ค๋ช
๋ชจ๋ธ ์ํคํ ์ฒ
- ๊ธฐ๋ณธ ์ํคํ ์ฒ: BERT (Bidirectional Encoder Representations from Transformers)
- ๊ธฐ๋ฐ ๋ชจ๋ธ: medicalai/ClinicalBERT
- ์ฃผ์ ์์ ์ฌํญ:
- ์ฌ์ฉ์ ์ ์ ์๋ฒ ๋ฉ ํฌ์ ๋ ์ด์ด (768 โ 256 ์ฐจ์)
- NER ๊ฐํ ์ดํ ์ ๋ฉ์ปค๋์ฆ
- ๋๋ฉ์ธ ํนํ ๋ฏธ์ธ ์กฐ์
์ฃผ์ ํน์ง
- ๋๋ฉ์ธ ํนํ ์๋ฒ ๋ฉ: ์๋ฃ ๋ฐ ๋ฒ๋ฅ ํ ์คํธ ๋ถ์์ ์ต์ ํ
- NER ๊ฐํ ์ดํด: ๊ฐ์ฒด๋ช ์ธ์์ ํ์ฉํ์ฌ ๋งฅ๋ฝ ์ธ์ ๊ฐ์
- ์ฐจ์ ์ถ์: 256์ฐจ์ ์๋ฒ ๋ฉ์ผ๋ก ํํ๋ ฅ๊ณผ ํจ์จ์ฑ์ ๊ท ํ ์ ์ง
- ํฌ๋ก์ค ๋๋ฉ์ธ ์ฐ๊ฒฐ: ์๋ฃ ์๊ฒฌ๊ณผ ๋ฒ๋ฅ ์ ํจ์ ๊ฐ์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํฌ์ฐฉ
- ํธ๋์คํฌ๋จธ ๊ธฐ๋ฐ: ์๋ฐฉํฅ ์ดํ ์ ๋ฉ์ปค๋์ฆ์ ํ์ฉํ์ฌ ๋งฅ๋ฝ ์ดํด ํฅ์
์ฑ๋ฅ ๋น๊ต
๋ณธ ๋ชจ๋ธ์ ์ ์ฌํ ๋๋ฉ์ธ ํนํ ๋ชจ๋ธ๋ค๋ณด๋ค ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ ๋๋ค:
๋ชจ๋ธ | ํ๊ท ์ ์ฌ๋ | ํ๋ผ๋ฏธํฐ ์ | ๋น๊ณ |
---|---|---|---|
Mejurix (๋ณธ ๋ชจ๋ธ) | 0.9859 | 110M | ์๋ฃ-๋ฒ๋ฅ + NER ๋ฏธ์ธ ์กฐ์ |
ClinicalBERT | 0.9719 | 110M | NER ์์, ๋ฏธ์ธ ์กฐ์ ์์ |
BioBERT | 0.9640 | 110M | ์๋ฃ ๋๋ฉ์ธ๋ง ํนํ |
LegalBERT | 0.9508 | 110M | ๋ฒ๋ฅ ๋๋ฉ์ธ๋ง ํนํ |
Mejurix ๋ชจ๋ธ์ ๋ชจ๋ ๊ด๊ณ ์ ํ์์ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ด๋ฉฐ, ํนํ ์๋ฃ์ ๋ฒ๋ฅ ๊ฐ๋ ๊ฐ์ ํฌ๋ก์ค ๋๋ฉ์ธ ๊ด๊ณ์์ ๋๋๋ฌ์ง๋๋ค.
๊ด๊ณ ์ ํ๋ณ ์์ธ ๋น๊ต
๋ณธ ๋ชจ๋ธ์ ๋ค๋ฅธ ๋๋ฉ์ธ ํนํ ๋ชจ๋ธ๊ณผ ๋น๊ตํ์ฌ ๋ชจ๋ ๊ด๊ณ ์ ํ์์ ์ผ๊ด๋๊ฒ ๋์ ์ ์ฌ๋ ์ ์๋ฅผ ๋ณด์ฌ์ค๋๋ค:
๊ด๊ณ ์ ํ | Mejurix | ClinicalBERT | BioBERT | LegalBERT |
---|---|---|---|---|
DISEASE_MEDICATION (์ง๋ณ-์ฝ๋ฌผ) | 0.9966 | 0.9921 | 0.9841 | 0.8514 |
SEVERITY_PROGNOSIS (์ค์ฆ๋-์ํ) | 1.0000 | 1.0000 | 1.0000 | 0.8381 |
SEVERITY_COMPENSATION (์ค์ฆ๋-๋ณด์) | 0.9997 | 0.9606 | 0.9713 | 0.8348 |
DISEASE_TREATMENT (์ง๋ณ-์น๋ฃ) | 0.9980 | 0.9778 | 0.9645 | 0.8359 |
DIAGNOSIS_TREATMENT (์ง๋จ-์น๋ฃ) | 0.9995 | 0.9710 | 0.9703 | 0.8222 |
LEGAL_SIMILAR_MEDICAL_DIFFERENT (๋ฒ์ ์ ์ฌ-์ํ์ ์์ด) | 0.9899 | 0.9699 | 0.9792 | 0.8236 |
TREATMENT_OUTCOME (์น๋ฃ-๊ฒฐ๊ณผ) | 0.9941 | 0.9668 | 0.9745 | 0.8103 |
OUTCOME_SETTLEMENT (๊ฒฐ๊ณผ-ํฉ์) | 0.9847 | 0.9631 | 0.9534 | 0.7951 |
MEDICAL_SIMILAR_LEGAL_DIFFERENT (์ํ์ ์ ์ฌ-๋ฒ์ ์์ด) | 0.9936 | 0.9434 | 0.9414 | 0.7812 |
SYMPTOM_DISEASE (์ฆ์-์ง๋ณ) | 0.9934 | 0.9690 | 0.9766 | 0.8500 |
Mejurix ๋ชจ๋ธ์ ํนํ MEDICAL_SIMILAR_LEGAL_DIFFERENT(0.9936)์ SEVERITY_COMPENSATION(0.9997)๊ณผ ๊ฐ์ ํฌ๋ก์ค ๋๋ฉ์ธ ๊ด๊ณ์์ ํ์ํ ์ฑ๋ฅ์ ๋ณด์ด๋ฉฐ, ์ด๋ฌํ ๋ณต์กํ ๊ด๊ณ ์ ํ์์ ๋ค๋ฅธ ๋ชจ๋ธ๋ณด๋ค ํฐ ๊ฐ์ ์ ๋ณด์ฌ์ค๋๋ค.
๋ชจ๋ธ ์ฌ์ฉ ๋ฐฉ๋ฒ
์ด ๋ชจ๋ธ์ Hugging Face Hub์์ ์ง์ ์ฌ์ฉ ๊ฐ๋ฅํ๋ฉฐ, Transformers ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ํตํด ํน์ฑ ์ถ์ถ, ๋ฌธ์ฅ ์๋ฒ ๋ฉ ๋ฐ ์ ์ฌ๋ ๊ณ์ฐ์ ํ์ฉํ ์ ์์ต๋๋ค.
Transformers๋ฅผ ์ฌ์ฉํ ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ
import torch
from transformers import AutoModel, AutoTokenizer
# ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋
model_name = "mejurix/medical-legal-embedder" # Hugging Face Hub์ ์๋ ์ค์ ๋ชจ๋ธ ๊ฒฝ๋ก
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# ๋จ์ผ ํ
์คํธ์ ๋ํ ์๋ฒ ๋ฉ ์์ฑ
text = "ํ์๋ L3 ์ฒ์ถ ๊ณจ์ ์ง๋จ์ ๋ฐ์์ผ๋ฉฐ, ๋ณด์ ์ฒญ๊ตฌ๊ฐ ์งํ ์ค์
๋๋ค."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
# ๋ฌธ์ฅ ํํ์ [CLS] ํ ํฐ ์๋ฒ ๋ฉ ์ฌ์ฉ
embeddings = outputs.last_hidden_state[:, 0, :] # [CLS] ํ ํฐ
print(f"์๋ฒ ๋ฉ ํํ: {embeddings.shape}") # [1, 256]์ด์ด์ผ ํจ
์ ์ฌ๋ ๊ณ์ฐ์ ๋ชจ๋ธ ์ฌ์ฉํ๊ธฐ
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
# ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋
model_name = "mejurix/medical-legal-embedder" # Hugging Face Hub์ ์๋ ์ค์ ๋ชจ๋ธ ๊ฒฝ๋ก
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state[:, 0, :] # [CLS] ํ ํฐ ์๋ฒ ๋ฉ
def compute_similarity(text1, text2):
emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
return F.cosine_similarity(emb1, emb2).item()
# ์์
text1 = "L3 ์ฒ์ถ ๊ณจ์ ์ง๋จ์ ๋ฐ์์ต๋๋ค."
text2 = "์์ถ ๋ถ์์ ๋ํ ๋ณด์์ด ํ์ํฉ๋๋ค."
similarity = compute_similarity(text1, text2)
print(f"์ ์ฌ๋: {similarity:.4f}")
Hugging Face ํ์ดํ๋ผ์ธ ์ฌ์ฉํ๊ธฐ
from transformers import pipeline
# ํน์ฑ ์ถ์ถ ํ์ดํ๋ผ์ธ ์์ฑ
extractor = pipeline(
"feature-extraction",
model="mejurix/medical-legal-embedder", # Hugging Face Hub์ ์๋ ์ค์ ๋ชจ๋ธ ๊ฒฝ๋ก
tokenizer="mejurix/medical-legal-embedder"
)
# ํน์ฑ ์ถ์ถ
text = "ํ์๋ ์ฒ์ถ ์์ ํ ๋ฌผ๋ฆฌ ์น๋ฃ๊ฐ ํ์ํฉ๋๋ค."
features = extractor(text)
# ์ถ๋ ฅ์ [1, sequence_length, hidden_size] ํํ์ ์ค์ฒฉ๋ ๋ฆฌ์คํธ
ํ์ฉ ๋ถ์ผ ๋ฐ ํ๊ณ์
ํ์ฉ ๋ถ์ผ
- ์๋ฃ-๋ฒ๋ฅ ๋ฌธ์ ์ ์ฌ๋ ๋ถ์
- ์ฌ๋ก ๊ด๋ จ์ฑ ํ๊ฐ
- ๋ฌธ์ ํด๋ฌ์คํฐ๋ง ๋ฐ ์กฐ์งํ
- ์๋ฃ ๋ฐ ๋ฒ๋ฅ ๋๋ฉ์ธ์์์ ์ ๋ณด ๊ฒ์
- ์๋ฃ ๊ธฐ๋ก๊ณผ ๋ฒ์ ์ ๋ก์ ์ํธ ์ฐธ์กฐ
- ์ฌ์ฉ์ ์ ์ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ์ฌ์ฉํ ์ ๋ก์ท ํ ์คํธ ๋ถ๋ฅ
ํ๊ณ์
- ๋ถ์ ๋ฌธ์ ๋ํ ์ดํด ์ ํ(ํ์ฌ ์ ์ฌ๋: 0.7791)
- ์๊ฐ์ ๋งฅ๋ฝ ๊ตฌ๋ถ ๊ฐ์ ํ์
- ์๋ฃ ์ํ์ ์ค์ฆ๋ ์์ค์ ์์ ํ ๊ตฌ๋ถํ์ง ๋ชปํ ์ ์์
- ์ต๋ ์ปจํ ์คํธ ๊ธธ์ด 512 ํ ํฐ(BERT ์ํคํ ์ฒ์์ ์์)
ํ์ต ๋ฐ ํ๊ฐ
ํ์ต
์ด ๋ชจ๋ธ์ ๋ค์ํ ๊ด๊ณ ์ ํ(์ง๋ณ-์น๋ฃ, ์ค์ฆ๋-๋ณด์ ๋ฑ)์ ํฌํจํ๋ ์๋ฃ-๋ฒ๋ฅ ๋ฌธ์ ์์ ํน์ ๋ฐ์ดํฐ์ ์์ ๋ฏธ์ธ ์กฐ์ ๋์์ต๋๋ค. ํ์ต์๋ ์ด๋ ค์ด ๋ถ์ ์ ์ฌ๋ก ๋ง์ด๋์ ํตํ ํธ๋ฆฌํ๋ ์์ค(triplet loss)์ด ์ฌ์ฉ๋์์ต๋๋ค.
ํ์ต ๊ตฌ์ฑ:
- ๊ธฐ๋ฐ ๋ชจ๋ธ: medicalai/ClinicalBERT
- ์๋ฒ ๋ฉ ์ฐจ์ ์ถ์: 768 โ 256
- ๋๋กญ์์: 0.5
- ํ์ต๋ฅ : 1e-5
- ๋ฐฐ์น ํฌ๊ธฐ: 16
- ๊ฐ์ค์น ๊ฐ์: 0.1
- ํธ๋ฆฌํ๋ ๋ง์ง: 2.0
- ์ํญ: 15
์ธ์ฉ
ํ์ ์ฐ๊ตฌ์์ ์ด ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๊ฒฝ์ฐ ๋ค์๊ณผ ๊ฐ์ด ์ธ์ฉํด ์ฃผ์ธ์:
@software{mejurix_medicallegal_embedder,
author = {Mejurix},
title = {Mejurix Medical-Legal Embedding Model},
year = {2025},
version = {0.1.0},
url = {https://huggingface.co/mejurix/medical-legal-embedder}
}
๋ผ์ด์ ์ค
์ด ํ๋ก์ ํธ๋ MIT ๋ผ์ด์ ์ค์ ๋ฐ๋ผ ๋ฐฐํฌ๋ฉ๋๋ค.
- Downloads last month
- 25