ImmtorLig_DB: Machine Learning Curation for Immune Receptors
Overview
This project focuses on curating and modeling bioactivity data of small molecules targeting immune receptors. Using datasets from ImmtorLig_DB, we applied machine learning techniques to predict interactions between small molecules and immune receptors or cytokines, aiding drug discovery research.
Dataset Description
- Source: The dataset is derived from the ImmtorLig_DB, a repertoire of 5000 small molecules screened against 25 immune receptors, as described in the paper by Chatterjee et al. in Nature Scientific Reports (DOI: 10.1038/s41598-019-39475-7).
- Structure: Contains molecular properties, SMILES strings, ECFP fingerprints, and receptor/cytokine interaction data.
- Purpose: To identify potential lead compounds to bolster host immunity by targeting immune receptors.
Curation and Analysis Process
Script Breakdown
Data Integration:
01_concat_ImmtorLig_DB_ADME.py
: Merges multiple Excel sheets into a single dataset, adding receptor/cytokine annotations.02_add_fingerprints.py
: Calculates ECFP fingerprints from SMILES strings and adds them to the dataset.03_expand_fingerprints.py
: Expands fingerprint features into separate columns for model compatibility.
Feature Generation and Dimensionality Reduction:
04.1_umap_fingerprints.py
: Applies UMAP for dimensionality reduction and clusters molecules using fingerprints.04.2_umap_fp_adme.py
: Conducts UMAP on combined fingerprint and ADME data, enhancing visual analysis and clustering.
Target-Specific Analysis:
05.1_il2_fingerprints_analysis.py
: Focuses on IL-2 receptor, analyzing its molecular fingerprint interactions.05.2_il2_fp_adme_analysis.py
: Extends analysis by incorporating ADME properties for IL-2.
Receptor Clustering and UMAP:
06.1_umap_fingerprint_receptors.py
: Generates UMAP plots for various receptor categories using fingerprint data.06.2_umap_fp_adme_receptors.py
: Similar to 06.1, but integrates ADME features for enhanced receptor categorization.
Data Preparation for Machine Learning:
07.1_prep_data_split_ind.py
: Splits data for individual receptor classification tasks, maintaining distribution.07.2_prep_data_split_class.py
: Prepares data splits for class-based receptor analysis, handling multi-class scenarios.
Model Training and Evaluation:
08.1_h2o_automl_ind.py
: Trains H2O AutoML models to predict individual receptor interactions.08.2_h2o_automl_class.py
: Develops models to classify broader receptor categories using H2O AutoML.
Model Development and Results
- Tools: Employed H2O AutoML to train models for predicting receptor/cytokine interactions and class associations.
- Performance Metrics: Mean per class errors and full confusion matrices saved for detailed performance evaluation.
- Outputs: Leaderboards and detailed evaluation results stored in CSV and text formats.
Usage Instructions
Environment Setup:
- Follow
environment.yml
to set up the Python environment using Conda.
- Follow
Running the Scripts:
- Execute scripts sequentially for data processing, feature extraction, and model training.
Data Access:
- All curated datasets and results are available in this repository, ready for further analysis and model refinement.
Citation and Licensing
Paper Reference: Chatterjee, D., Kaur, G., Muradia, S., Singh, B., & Agrewala, J.N. ImmtorLig_DB: repertoire... (DOI: 10.1038/s41598-019-39475-7)
License: This project is licensed under the Creative Commons Attribution 4.0 International License. This permits use, sharing, adaptation, distribution, and reproduction in any medium or format, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. See the LICENSE.md for more details.
Future Work
- Enhance model accuracy through further feature engineering and tuning.
- Expand the dataset with additional receptor interactions and properties.
Contact
For questions or contributions, please contact the project lead or collaborators through the provided details in the repository.