--- license: apache-2.0 pipeline_tag: translation tags: - chemistry - biology --- # **Contributors** - Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @) - Michael Heinzinger (GitHub @mheinzinger; Twitter @) - Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com ) # **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes** **Work in Progress** REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions. It is possible to provide fine-grained input at the substrate level. Akin to how translation machines have learned to translate between complex language pairs with great success, often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces sequences that putatitely perform their intended reactions. To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), which you can do online here: xxxx We are still working in the analysis of the model for different tasks, including experimental testing. See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes. ## **Model description** REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator) and contains xx layers with a model dimensionality of xx, totaling xx million parameters. REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs. The pre-training was done on pairs of smiles and ... (fasta headers?), ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing token in the encoder's input. Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction. Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers. In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that there are relationships between the two classes. The figure below summarizes the process of training: (add figure) ## **Model Performance** - explain dataset curation - general descriptors (esmfold, iuored.. ) - second pgp - mmseqs (Average?) ## **How to generate from REXyme** REXyme can be used with the HuggingFace transformer python package. Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES. [please seb include snippet to generate sequences] ## **A word of caution** - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e., with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work, it is very much work in progress. We'll uptadate the model and documentation shortly.