nferruz commited on
Commit
3fc7e66
·
1 Parent(s): b058439

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md CHANGED
@@ -1,3 +1,73 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: translation
4
+ tags:
5
+ - chemistry
6
+ - biology
7
  ---
8
+
9
+ # **Contributors**
10
+
11
+ - Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @)
12
+ - Michael Heinzinger (GitHub @mheinzinger; Twitter @)
13
+ - Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com )
14
+
15
+ # **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
16
+ **Work in Progress**
17
+
18
+ REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions.
19
+ It is possible to provide fine-grained input at the substrate level.
20
+ Akin to how translation machines have learned to translate between complex language pairs with great success,
21
+ often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
22
+ be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces
23
+ sequences that putatitely perform their intended reactions.
24
+
25
+ To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system),
26
+ which you can do online here: xxxx
27
+
28
+ We are still working in the analysis of the model for different tasks, including experimental testing.
29
+ See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes.
30
+
31
+ ## **Model description**
32
+ REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator)
33
+ and contains xx layers
34
+ with a model dimensionality of xx, totaling xx million parameters.
35
+
36
+ REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs.
37
+ The pre-training was done on pairs of smiles and ... (fasta headers?),
38
+
39
+ ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing
40
+ token in the encoder's input. Hence,
41
+ the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
42
+
43
+ Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers.
44
+ In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that
45
+ there are relationships between the two classes.
46
+
47
+ The figure below summarizes the process of training: (add figure)
48
+
49
+ ## **Model Performance**
50
+
51
+ - explain dataset curation
52
+ - general descriptors (esmfold, iuored.. )
53
+ - second pgp
54
+ - mmseqs (Average?)
55
+
56
+
57
+ ## **How to generate from REXyme**
58
+ REXyme can be used with the HuggingFace transformer python package.
59
+ Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation
60
+
61
+ Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES.
62
+
63
+ [please seb include snippet to generate sequences]
64
+
65
+
66
+ ## **A word of caution**
67
+
68
+ - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e.,
69
+ with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work,
70
+ it is very much work in progress. We'll uptadate the model and documentation shortly.
71
+
72
+
73
+