BERTopic Model for Serverless Inference

A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in safetensors format for optimized loading and is designed for serverless inference in cloud environments. It is a key component of our thesis project, "Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."

Overview

This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.

Thesis Context:
As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.

Key Features

Multilingual Support:
Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
Pre-trained & Fine-tuned:
Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
Optimized Serialization:
Uses safetensors for faster and safer model loading.
Serverless Inference Ready:
Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.

Model Architecture & Details

Architecture: BERTopic
Embedding Model: paraphrase-multilingual-MiniLM-L12-v2
Dimensionality Reduction: UMAP
Clustering Algorithm: HDBSCAN
Vectorizer: CountVectorizer with TF-IDF preprocessing
Dataset: 160k synthetic and real tourist reviews categorized by emotional tone and topics

Model Performance Metrics

Topic Coherence Score: XX.XX (placeholder)
Diversity Score: XX.XX (placeholder)
Sentiment Analysis Accuracy: ≥ 70% (as part of the complementary system)

How to Use

Loading the Model

from bertopic import BERTopic
from safetensors.torch import load_file

# Load the BERTopic model
model = BERTopic.load("path/to/model.safetensors")

Performing Topic Modeling

# Sample documents for topic modeling
docs = [
    "The hotel had a great view of the beach and excellent service.",
    "Transportation was a bit difficult to find late at night."
]

# Extract topics from the documents
topics, probs = model.transform(docs)
print("Topics:", topics)
print("Probabilities:", probs)

Deployment Guide

Serverless Platforms:
Ensure dependencies such as safetensors, bertopic, and sentence-transformers are included in your deployment package for platforms like AWS Lambda or FastAPI.
Memory Optimization:
Use safetensors for a reduced memory footprint and faster inference.
Scaling Considerations:
Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.

Limitations

Variable Topic Coherence:
Coherence may vary by language.
Dataset Biases:
The model’s performance may be influenced by biases in the training data.
Latency Constraints:
Not ideal for real-time low-latency applications (<50ms response time).

License

[Insert License Here]

Citation

@inproceedings{your_citation,
  title={BERTopic Model for Multilingual Tourism Feedback},
  author={Paul Andre D. Tadiar},
  year={2025}
}

For inquiries or contributions, please open an issue on the Hugging Face repository.

SCANSKY
/

BERTopic_Tourism_8L