Mejurix Medical-Legal Embedding Model

This model is a fine-tuned Transformer (BERT-based) that generates high-quality embeddings for documents in medical and legal domains, with a focus on capturing the semantic relationships between medical and legal concepts. The model leverages NER (Named Entity Recognition) to better understand domain-specific entities and their relationships.

Model Description

Model Architecture

  • Base Architecture: BERT (Bidirectional Encoder Representations from Transformers)
  • Base Model: medicalai/ClinicalBERT
  • Modifications:
    • Custom embedding projection layer (768 โ†’ 256 dimensions)
    • NER-enhanced attention mechanism
    • Domain-specific fine-tuning

Key Features

  • Domain-Specific Embeddings: Optimized for medical and legal text analysis
  • NER-Enhanced Understanding: Utilizes named entity recognition to improve context awareness
  • Reduced Dimensionality: 256-dimensional embeddings balance expressiveness and efficiency
  • Cross-Domain Connections: Effectively captures relationships between medical findings and legal implications
  • Transformer-Based: Leverages bidirectional attention mechanisms for better context understanding

Performance Comparison

Our model outperforms other similar domain-specific models:

Model Avg Similarity #Params Notes
Mejurix (ours) 0.9859 110M Medical-legal + NER FT
ClinicalBERT 0.9719 110M No NER, no fine-tuning
BioBERT 0.9640 110M Domain medical only
LegalBERT 0.9508 110M Domain legal only

The Mejurix model shows superior performance across all relationship types, particularly in cross-domain relationships between medical and legal concepts.

Detailed Relationship-Type Comparison

Our model demonstrates consistently higher similarity scores across all relationship types compared to other domain-specific models:

Relationship Type Mejurix ClinicalBERT BioBERT LegalBERT
DISEASE_MEDICATION 0.9966 0.9921 0.9841 0.8514
SEVERITY_PROGNOSIS 1.0000 1.0000 1.0000 0.8381
SEVERITY_COMPENSATION 0.9997 0.9606 0.9713 0.8348
DISEASE_TREATMENT 0.9980 0.9778 0.9645 0.8359
DIAGNOSIS_TREATMENT 0.9995 0.9710 0.9703 0.8222
LEGAL_SIMILAR_MEDICAL_DIFFERENT 0.9899 0.9699 0.9792 0.8236
TREATMENT_OUTCOME 0.9941 0.9668 0.9745 0.8103
OUTCOME_SETTLEMENT 0.9847 0.9631 0.9534 0.7951
MEDICAL_SIMILAR_LEGAL_DIFFERENT 0.9936 0.9434 0.9414 0.7812
SYMPTOM_DISEASE 0.9934 0.9690 0.9766 0.8500

The Mejurix model particularly excels in cross-domain relationships such as MEDICAL_SIMILAR_LEGAL_DIFFERENT (0.9936) and SEVERITY_COMPENSATION (0.9997), showing significant improvement over other models in these complex relationship types.

Model Comparison by Relationship Type

How to Use This Model

This model is directly available on the Hugging Face Hub and can be used with the Transformers library for feature extraction, sentence embeddings, and similarity calculations.

Basic Usage with Transformers

import torch
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model_name = "mejurix/medical-legal-embedder"  # The model's actual path on Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Generate embeddings for a single text
text = "The patient was diagnosed with L3 vertebral fracture, and a compensation claim is in progress."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)

# Use the [CLS] token embedding for sentence representation
embeddings = outputs.last_hidden_state[:, 0, :]  # [CLS] token
print(f"Embedding shape: {embeddings.shape}")  # Should be [1, 256]

Using the Model for Similarity Calculation

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model_name = "mejurix/medical-legal-embedder"  # The model's actual path on Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :]  # [CLS] token embedding

def compute_similarity(text1, text2):
    emb1 = get_embedding(text1)
    emb2 = get_embedding(text2)
    return F.cosine_similarity(emb1, emb2).item()

# Example
text1 = "Diagnosed with L3 spinal fracture."
text2 = "Compensation is needed for lumbar injury."
similarity = compute_similarity(text1, text2)
print(f"Similarity: {similarity:.4f}")

Using with Hugging Face Pipelines

from transformers import pipeline

# Create a feature-extraction pipeline
extractor = pipeline(
    "feature-extraction",
    model="mejurix/medical-legal-embedder",  # The model's actual path on Hugging Face Hub
    tokenizer="mejurix/medical-legal-embedder"
)

# Extract features
text = "The patient requires physical therapy following spinal surgery."
features = extractor(text)

# The output is a nested list with shape [1, sequence_length, hidden_size]

Batch Processing

import torch
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mejurix/medical-legal-embedder")
model = AutoModel.from_pretrained("mejurix/medical-legal-embedder")

# Prepare batch of texts
texts = [
    "The patient was diagnosed with L3 vertebral fracture",
    "Neck pain persisted after the accident",
    "Clinical test results were within normal range",
    "Compensation claim filed for permanent disability"
]

# Tokenize and get embeddings in a single pass
inputs = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Get CLS token embeddings for each text in the batch
embeddings = outputs.last_hidden_state[:, 0, :]
print(f"Batch embeddings shape: {embeddings.shape}")  # Should be [4, 256]

Intended Uses & Limitations

Intended Uses

  • Medical-legal document similarity analysis
  • Case relevance assessment
  • Document clustering and organization
  • Information retrieval in medical and legal domains
  • Cross-referencing medical records with legal precedents
  • Zero-shot text classification with custom categories

Limitations

  • Limited understanding of negations (current similarity: 0.7791)
  • Temporal context differentiation needs improvement
  • May not fully distinguish severity levels in medical conditions
  • Maximum context length of 512 tokens (inherited from BERT architecture)

Training and Evaluation

Training

The model was fine-tuned on a specialized dataset containing medical-legal document pairs with various relationship types (disease-treatment, severity-compensation, etc.). Training employed triplet loss with hard negative mining.

Training Configuration:

  • Base model: medicalai/ClinicalBERT
  • Embedding dimension reduction: 768 โ†’ 256
  • Dropout: 0.5
  • Learning rate: 1e-5
  • Batch size: 16
  • Weight decay: 0.1
  • Triplet margin: 2.0
  • Epochs: 15

Performance Observations

Strengths

  1. Medical-Legal Cross-Concept Connection: Effectively connects medical assessments with legal compensation concepts (0.8348)
  2. Medical Terminology Recognition: Recognizes equivalent medical expressions across different terminologies (0.8414)
  3. Causality Understanding: Accurately identifies cause-effect relationships (0.8236)
  4. Transformer Attention: The bidirectional attention mechanism captures contextual relationships effectively

Areas for Improvement

  1. Detailed Medical Terminology Differentiation: Needs better recognition of severity differences
  2. Temporal Context Understanding: Temporal differences in medical conditions need better differentiation
  3. Negation Handling: Improved handling of negations needed
  4. Longer Context Windows: Future versions could benefit from extended context length models

Ethical Considerations

This model should be used as a tool to assist professionals, not as a replacement for medical or legal expertise. Decisions affecting patient care or legal outcomes should not be based solely on this model's output.

Citation

If you use this model in your research, please cite:

@software{mejurix_medicallegal_embedder,
  author = {Mejurix},
  title = {Mejurix Medical-Legal Embedding Model},
  year = {2025},
  version = {0.1.0},
  url = {https://huggingface.co/mejurix/medical-legal-embedder}
}

License

This project is distributed under the MIT License.


ํ•œ๊ตญ์–ด ๋ฌธ์„œ / Korean Documentation

Mejurix ์˜๋ฃŒ-๋ฒ•๋ฅ  ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ

๋ณธ ๋ชจ๋ธ์€ ์˜๋ฃŒ ๋ฐ ๋ฒ•๋ฅ  ๋„๋ฉ”์ธ์˜ ํ…์ŠคํŠธ์— ํŠนํ™”๋œ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฏธ์„ธ ์กฐ์ •๋œ ํŠธ๋žœ์Šคํฌ๋จธ(BERT ๊ธฐ๋ฐ˜) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์˜๋ฃŒ ๋ฐ ๋ฒ•๋ฅ  ๊ฐœ๋… ๊ฐ„์˜ ์˜๋ฏธ๋ก ์  ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐœ์ฒด๋ช… ์ธ์‹(NER)์„ ํ™œ์šฉํ•˜์—ฌ ๋„๋ฉ”์ธ ํŠนํ™” ์—”ํ‹ฐํ‹ฐ์™€ ๊ทธ ๊ด€๊ณ„๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์„ค๋ช…

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

  • ๊ธฐ๋ณธ ์•„ํ‚คํ…์ฒ˜: BERT (Bidirectional Encoder Representations from Transformers)
  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: medicalai/ClinicalBERT
  • ์ฃผ์š” ์ˆ˜์ •์‚ฌํ•ญ:
    • ์‚ฌ์šฉ์ž ์ •์˜ ์ž„๋ฒ ๋”ฉ ํˆฌ์˜ ๋ ˆ์ด์–ด (768 โ†’ 256 ์ฐจ์›)
    • NER ๊ฐ•ํ™” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜
    • ๋„๋ฉ”์ธ ํŠนํ™” ๋ฏธ์„ธ ์กฐ์ •

์ฃผ์š” ํŠน์ง•

  • ๋„๋ฉ”์ธ ํŠนํ™” ์ž„๋ฒ ๋”ฉ: ์˜๋ฃŒ ๋ฐ ๋ฒ•๋ฅ  ํ…์ŠคํŠธ ๋ถ„์„์— ์ตœ์ ํ™”
  • NER ๊ฐ•ํ™” ์ดํ•ด: ๊ฐœ์ฒด๋ช… ์ธ์‹์„ ํ™œ์šฉํ•˜์—ฌ ๋งฅ๋ฝ ์ธ์‹ ๊ฐœ์„ 
  • ์ฐจ์› ์ถ•์†Œ: 256์ฐจ์› ์ž„๋ฒ ๋”ฉ์œผ๋กœ ํ‘œํ˜„๋ ฅ๊ณผ ํšจ์œจ์„ฑ์˜ ๊ท ํ˜• ์œ ์ง€
  • ํฌ๋กœ์Šค ๋„๋ฉ”์ธ ์—ฐ๊ฒฐ: ์˜๋ฃŒ ์†Œ๊ฒฌ๊ณผ ๋ฒ•๋ฅ ์  ํ•จ์˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉ
  • ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜: ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋งฅ๋ฝ ์ดํ•ด ํ–ฅ์ƒ

์„ฑ๋Šฅ ๋น„๊ต

๋ณธ ๋ชจ๋ธ์€ ์œ ์‚ฌํ•œ ๋„๋ฉ”์ธ ํŠนํ™” ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค:

๋ชจ๋ธ ํ‰๊ท  ์œ ์‚ฌ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋น„๊ณ 
Mejurix (๋ณธ ๋ชจ๋ธ) 0.9859 110M ์˜๋ฃŒ-๋ฒ•๋ฅ  + NER ๋ฏธ์„ธ ์กฐ์ •
ClinicalBERT 0.9719 110M NER ์—†์Œ, ๋ฏธ์„ธ ์กฐ์ • ์—†์Œ
BioBERT 0.9640 110M ์˜๋ฃŒ ๋„๋ฉ”์ธ๋งŒ ํŠนํ™”
LegalBERT 0.9508 110M ๋ฒ•๋ฅ  ๋„๋ฉ”์ธ๋งŒ ํŠนํ™”

Mejurix ๋ชจ๋ธ์€ ๋ชจ๋“  ๊ด€๊ณ„ ์œ ํ˜•์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ํŠนํžˆ ์˜๋ฃŒ์™€ ๋ฒ•๋ฅ  ๊ฐœ๋… ๊ฐ„์˜ ํฌ๋กœ์Šค ๋„๋ฉ”์ธ ๊ด€๊ณ„์—์„œ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค.

๊ด€๊ณ„ ์œ ํ˜•๋ณ„ ์ƒ์„ธ ๋น„๊ต

๋ณธ ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ ํŠนํ™” ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋“  ๊ด€๊ณ„ ์œ ํ˜•์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ด€๊ณ„ ์œ ํ˜• Mejurix ClinicalBERT BioBERT LegalBERT
DISEASE_MEDICATION (์งˆ๋ณ‘-์•ฝ๋ฌผ) 0.9966 0.9921 0.9841 0.8514
SEVERITY_PROGNOSIS (์ค‘์ฆ๋„-์˜ˆํ›„) 1.0000 1.0000 1.0000 0.8381
SEVERITY_COMPENSATION (์ค‘์ฆ๋„-๋ณด์ƒ) 0.9997 0.9606 0.9713 0.8348
DISEASE_TREATMENT (์งˆ๋ณ‘-์น˜๋ฃŒ) 0.9980 0.9778 0.9645 0.8359
DIAGNOSIS_TREATMENT (์ง„๋‹จ-์น˜๋ฃŒ) 0.9995 0.9710 0.9703 0.8222
LEGAL_SIMILAR_MEDICAL_DIFFERENT (๋ฒ•์  ์œ ์‚ฌ-์˜ํ•™์  ์ƒ์ด) 0.9899 0.9699 0.9792 0.8236
TREATMENT_OUTCOME (์น˜๋ฃŒ-๊ฒฐ๊ณผ) 0.9941 0.9668 0.9745 0.8103
OUTCOME_SETTLEMENT (๊ฒฐ๊ณผ-ํ•ฉ์˜) 0.9847 0.9631 0.9534 0.7951
MEDICAL_SIMILAR_LEGAL_DIFFERENT (์˜ํ•™์  ์œ ์‚ฌ-๋ฒ•์  ์ƒ์ด) 0.9936 0.9434 0.9414 0.7812
SYMPTOM_DISEASE (์ฆ์ƒ-์งˆ๋ณ‘) 0.9934 0.9690 0.9766 0.8500

Mejurix ๋ชจ๋ธ์€ ํŠนํžˆ MEDICAL_SIMILAR_LEGAL_DIFFERENT(0.9936)์™€ SEVERITY_COMPENSATION(0.9997)๊ณผ ๊ฐ™์€ ํฌ๋กœ์Šค ๋„๋ฉ”์ธ ๊ด€๊ณ„์—์„œ ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ๊ด€๊ณ„ ์œ ํ˜•์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ๋ณด๋‹ค ํฐ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ด€๊ณ„ ์œ ํ˜•๋ณ„ ๋ชจ๋ธ ๋น„๊ต

๋ชจ๋ธ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์ด ๋ชจ๋ธ์€ Hugging Face Hub์—์„œ ์ง์ ‘ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํŠน์„ฑ ์ถ”์ถœ, ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ฐ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Transformers๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

import torch
from transformers import AutoModel, AutoTokenizer

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model_name = "mejurix/medical-legal-embedder"  # Hugging Face Hub์— ์žˆ๋Š” ์‹ค์ œ ๋ชจ๋ธ ๊ฒฝ๋กœ
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# ๋‹จ์ผ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
text = "ํ™˜์ž๋Š” L3 ์ฒ™์ถ” ๊ณจ์ ˆ ์ง„๋‹จ์„ ๋ฐ›์•˜์œผ๋ฉฐ, ๋ณด์ƒ ์ฒญ๊ตฌ๊ฐ€ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)

# ๋ฌธ์žฅ ํ‘œํ˜„์— [CLS] ํ† ํฐ ์ž„๋ฒ ๋”ฉ ์‚ฌ์šฉ
embeddings = outputs.last_hidden_state[:, 0, :]  # [CLS] ํ† ํฐ
print(f"์ž„๋ฒ ๋”ฉ ํ˜•ํƒœ: {embeddings.shape}")  # [1, 256]์ด์–ด์•ผ ํ•จ

์œ ์‚ฌ๋„ ๊ณ„์‚ฐ์— ๋ชจ๋ธ ์‚ฌ์šฉํ•˜๊ธฐ

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model_name = "mejurix/medical-legal-embedder"  # Hugging Face Hub์— ์žˆ๋Š” ์‹ค์ œ ๋ชจ๋ธ ๊ฒฝ๋กœ
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :]  # [CLS] ํ† ํฐ ์ž„๋ฒ ๋”ฉ

def compute_similarity(text1, text2):
    emb1 = get_embedding(text1)
    emb2 = get_embedding(text2)
    return F.cosine_similarity(emb1, emb2).item()

# ์˜ˆ์‹œ
text1 = "L3 ์ฒ™์ถ” ๊ณจ์ ˆ ์ง„๋‹จ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค."
text2 = "์š”์ถ” ๋ถ€์ƒ์— ๋Œ€ํ•œ ๋ณด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค."
similarity = compute_similarity(text1, text2)
print(f"์œ ์‚ฌ๋„: {similarity:.4f}")

Hugging Face ํŒŒ์ดํ”„๋ผ์ธ ์‚ฌ์šฉํ•˜๊ธฐ

from transformers import pipeline

# ํŠน์„ฑ ์ถ”์ถœ ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ
extractor = pipeline(
    "feature-extraction",
    model="mejurix/medical-legal-embedder",  # Hugging Face Hub์— ์žˆ๋Š” ์‹ค์ œ ๋ชจ๋ธ ๊ฒฝ๋กœ
    tokenizer="mejurix/medical-legal-embedder"
)

# ํŠน์„ฑ ์ถ”์ถœ
text = "ํ™˜์ž๋Š” ์ฒ™์ถ” ์ˆ˜์ˆ  ํ›„ ๋ฌผ๋ฆฌ ์น˜๋ฃŒ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค."
features = extractor(text)

# ์ถœ๋ ฅ์€ [1, sequence_length, hidden_size] ํ˜•ํƒœ์˜ ์ค‘์ฒฉ๋œ ๋ฆฌ์ŠคํŠธ

ํ™œ์šฉ ๋ถ„์•ผ ๋ฐ ํ•œ๊ณ„์ 

ํ™œ์šฉ ๋ถ„์•ผ

  • ์˜๋ฃŒ-๋ฒ•๋ฅ  ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„
  • ์‚ฌ๋ก€ ๊ด€๋ จ์„ฑ ํ‰๊ฐ€
  • ๋ฌธ์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐ ์กฐ์งํ™”
  • ์˜๋ฃŒ ๋ฐ ๋ฒ•๋ฅ  ๋„๋ฉ”์ธ์—์„œ์˜ ์ •๋ณด ๊ฒ€์ƒ‰
  • ์˜๋ฃŒ ๊ธฐ๋ก๊ณผ ๋ฒ•์  ์„ ๋ก€์˜ ์ƒํ˜ธ ์ฐธ์กฐ
  • ์‚ฌ์šฉ์ž ์ •์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ์ œ๋กœ์ƒท ํ…์ŠคํŠธ ๋ถ„๋ฅ˜

ํ•œ๊ณ„์ 

  • ๋ถ€์ •๋ฌธ์— ๋Œ€ํ•œ ์ดํ•ด ์ œํ•œ(ํ˜„์žฌ ์œ ์‚ฌ๋„: 0.7791)
  • ์‹œ๊ฐ„์  ๋งฅ๋ฝ ๊ตฌ๋ถ„ ๊ฐœ์„  ํ•„์š”
  • ์˜๋ฃŒ ์ƒํƒœ์˜ ์ค‘์ฆ๋„ ์ˆ˜์ค€์„ ์™„์ „ํžˆ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ
  • ์ตœ๋Œ€ ์ปจํ…์ŠคํŠธ ๊ธธ์ด 512 ํ† ํฐ(BERT ์•„ํ‚คํ…์ฒ˜์—์„œ ์ƒ์†)

ํ•™์Šต ๋ฐ ํ‰๊ฐ€

ํ•™์Šต

์ด ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„ ์œ ํ˜•(์งˆ๋ณ‘-์น˜๋ฃŒ, ์ค‘์ฆ๋„-๋ณด์ƒ ๋“ฑ)์„ ํฌํ•จํ•˜๋Š” ์˜๋ฃŒ-๋ฒ•๋ฅ  ๋ฌธ์„œ ์Œ์˜ ํŠน์ˆ˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์—๋Š” ์–ด๋ ค์šด ๋ถ€์ •์  ์‚ฌ๋ก€ ๋งˆ์ด๋‹์„ ํ†ตํ•œ ํŠธ๋ฆฌํ”Œ๋ › ์†์‹ค(triplet loss)์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๊ตฌ์„ฑ:

  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: medicalai/ClinicalBERT
  • ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ถ•์†Œ: 768 โ†’ 256
  • ๋“œ๋กญ์•„์›ƒ: 0.5
  • ํ•™์Šต๋ฅ : 1e-5
  • ๋ฐฐ์น˜ ํฌ๊ธฐ: 16
  • ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ: 0.1
  • ํŠธ๋ฆฌํ”Œ๋ › ๋งˆ์ง„: 2.0
  • ์—ํญ: 15

์ธ์šฉ

ํ•™์ˆ  ์—ฐ๊ตฌ์—์„œ ์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ธ์šฉํ•ด ์ฃผ์„ธ์š”:

@software{mejurix_medicallegal_embedder,
  author = {Mejurix},
  title = {Mejurix Medical-Legal Embedding Model},
  year = {2025},
  version = {0.1.0},
  url = {https://huggingface.co/mejurix/medical-legal-embedder}
}

๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT ๋ผ์ด์„ ์Šค์— ๋”ฐ๋ผ ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support