0% found this document useful (0 votes)
18 views11 pages

Structure To Property: Chemical Element Embeddings and A Deep Learning Approach For Accurate Prediction of Chemical Properties

The document introduces a new deep learning model called 'elEmBERT' that uses a transformer mechanism for binary classification based on structural information of materials. It uses atomic pair distribution functions and element embeddings as input and achieves state-of-the-art performance on several materials property prediction benchmarks, demonstrating the opportunities of the proposed approach.

Uploaded by

denex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Structure To Property: Chemical Element Embeddings and A Deep Learning Approach For Accurate Prediction of Chemical Properties

The document introduces a new deep learning model called 'elEmBERT' that uses a transformer mechanism for binary classification based on structural information of materials. It uses atomic pair distribution functions and element embeddings as input and achieves state-of-the-art performance on several materials property prediction benchmarks, demonstrating the opportunities of the proposed approach.

Uploaded by

denex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Structure to Property: Chemical Element Embeddings and a Deep Learning

Approach for Accurate Prediction of Chemical Properties


Shokirbek Shermukhamedov1, Dilorom Mamurjonova2 and Michael Probst1,3

Abstract
The application of machine learning (ML) techniques in computational chemistry has led to significant advances in predicting
molecular properties, accelerating drug discovery, and material design. ML models can extract hidden patterns and
relationships from complex and large datasets, allowing for the prediction of various chemical properties with high accuracy.
The use of such methods has enabled the discovery of molecules and materials that were previously difficult to identify. This
paper introduces a new ML model based on deep learning techniques, such as a multilayer encoder and decoder architecture,
for classification tasks. We demonstrate the opportunities offered by our approach by applying it to various types of input
data, including organic and inorganic compounds. In particular, we developed and tested the model using the Matbench and
Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct a
comprehensive analysis of vector representations of chemical compounds, shedding light on the underlying patterns in
molecular data. The models used in this work exhibit a high degree of predictive power, underscoring the progress that can
be made with refined machine learning when applied to molecular and material datasets. For instance, on the Tox21 dataset,
we achieved an average accuracy of 96%, surpassing the previous best result by 10%. Our code is publicly available at
https://github.com/dmamur/elembert.

1 Introduction
Due to their effectiveness in fitting experimental data and predicting material properties, machine learning models have found
extensive applications in research on batteries1,2, supercapacitors3, thermoelectric4 and photoelectric5 devices, catalysts6 and
in drug design7. In a 'second wave', deep learning models (DLMs) have exhibited remarkable potential in advancing the field
of chemical applications. So-called Word2vec8 DLMs have been used for processing chemical text data extracted from
academic articles. By representing chemical formulas as embeddings or vectors, non-obvious connections between
compounds and chemical properties can be discovered. For instance, the mat2vec 9 NLP model was able to predict materials
with good thermoelectric properties, even when these materials and their properties were not explicitly named in the original
papers. Other NLP-inspired models, such as Bag of Bonds10, mol2vec11, smile2vec12, SPvec13, have used unsupervised
machine learning and have been applied to chemical compound classification tasks, achieving remarkable results. These
models hold immense potential for accelerating the discovery and the design of materials with tailored properties.
In this regard, the type of input data is crucial for ML models. In chemistry, this could be chemical text data, like in mat2vec,
or structural data. Chemical texts make it possible to use reference information of a compound14, such as weight, melting
point, crystallization temperature, and element composition. These types of inputs can, in turn, be used by general deep
learning models, with ELMO, BERT, and GPT-3 (or GPT-4) being the most famous examples.
One of the most common types of input data used for ML-based approaches is structural representation, which provides
valuable information about the atomic environment of a given material. However, text-based data does not normally capture
important structural features, such as interatomic distances. Structural information is crucial for predicting material properties,
as it is key to all pertinent physical and chemical characteristics. This can be understood in the same sense as the Born-
Oppenheimer approximation, in short, states that atomic coordinates (and from them the potential energy) are all that is needed
in chemistry. The challenge of linking structural information to material properties is commonly referred to as the "structure
to property" task. Overcoming this challenge has the potential to greatly enhance our ability to predict and design novel
materials with desired properties.
Structure could be translated into property by graph neural networks (GNN) or high-dimensional neural networks (HDNN)
formalisms. GNNs transform graphs of molecules (or compounds) into node and edge embeddings, which can then be used
for state-of-the-art tasks15–21. HDNNs based on converting Cartesian coordinates of atoms to continuous representations use
techniques like the smooth overlap of atomic positions (SOAP) 22, the many-body tensor representation (MBTR)23, or the
atomic centered symmetry functions (ACSF)24 to achieve the same goal. Message passing neural networks (MPNN) are a
subgroup of HDNN that use atomic positions and nuclear charges as input. Examples include SchNet25 and PhysNet26. In

1Institute
of Ion Physics and Applied Physics, University of Innsbruck, 6020 Innsbruck, Austria 2Tashkent Chemical
Technological Institute, 100011 Tashkent, Uzbekistan 3School of Molecular Science and Engineering, Vidyasirimedhi Institute
of Science and Technology, 21201 Rayong, Thailand. Correspondence to: Shokirbek Shermukhamedov
<2shermux@gmail.com>.
1
these models, atomic embedding encodes the atomic identifier into vector arrays, which are first initialized randomly and
optimized during training.
Despite the increasing use of deep learning in computational chemistry, many aspects of NLP models have yet to be fully
explored. One of them is the attention mechanism27, which allows the model to focus on specific parts of the input data when
making predictions. It works by assigning different levels of importance, or attention, to different elements in the input
sequence. Additionally, the so-called transformer approach has not yet been fully utilized in chemistry. The transformer
consists of two distinct components: an encoder responsible for processing the input data and a decoder responsible for
generating task-related predictions. In this paper, we introduce a new deep learning model for chemical compounds that
utilizes both of these approaches. Specifically, our model incorporates local attention layers to capture properties of local
atomic environments and then utilizes a global attention layer to make weighted aggregations of these atomic environment
vectors to create a global representation of the entire crystal structure. While the attention mechanism has been previously
used in graph neural networks28, this work introduces an atomic representation deep learning model that can be applied to a
wide range of tasks. From its components, we call this model 'elEmBERT' (element Embeddings and Bidirectional Encoder
Representations from Transformers).
In summary, the main aspects of our work are:
a. We use a transformer mechanism for binary classification based on structural information.
b. Our model is flexible and can be easily adapted to different types of datasets.
c. Benchmarks show the state-of-the-art performance of our model for a variety of material property prediction
problems, both involving organic and inorganic compounds.

2 Methods
As input to the neural network (NN), we utilize atomic pair distribution functions (PDFs) and the atom types that compose
the compounds. The PDF represents the probability of finding an atom inside a sphere with a radius r centered at a selected
atom29. To prepare the training data, we calculate PDFs employing the ASE library30 with a cutoff radius of 10Å. The second
input for the NN consists of element embedding vectors. To achieve this, all elements in all crystals are mapped to integers
(typically using the nuclear number), creating an elemental vocabulary of size Vsize=101. These embeddings are then passed
to the BERT module.
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model originally designed
for natural language processing (NLP) tasks. It employs a bidirectional transformer encoder to capture word context in
sentences, allowing it to generate accurate text representations. BERT employs masked language modeling (MLM), where
some tokens in a sentence are masked or replaced with a [MASK] token, and the model is trained to predict the original word
based on the surrounding context. Additionally, BERT uses next sentence prediction, training on pairs of sentences to predict
whether the second sentence follows the first.
Our model is illustrated in Fig. 1. It can use various combinations of embedding sizes, encoder-decoder layers, and attention
heads. In chemical applications, the atomic composition of a compound can be equated to a sentence, with individual atoms
serving as constituent tokens. Leveraging this analogy, we introduce four new tokens to the vocabulary: [MASK] for MLM,
[UNK] for unseen tokens, [CLS] for classification, and [SEP] for separating two compounds.

Figure 1. Classification Model Architecture: The initial step involves computing the pair distribution function for each element
based on atom positions within the chemical compound. This information is then passed through the PCAKM layer.
Subsequently, the resulting subelements are converted into tokens, with additional tokens incorporated before input into the
BERT module. The [CLS] token output vector from BERT is used for the classification task.
In a chemical compound, each element can exhibit different oxidation states or formal charges, indicating the relative electron
loss or gain during chemical reactions. Considering the foundations of chemistry, it is evident that the elements composing
them do not interact uniformly, but instead exhibit specific interactions with neighboring atoms. In the case of inorganic
substances, such interactions can manifest themselves as ionic interactions represented by the oxidation state, while for
2
organic substances, it may be covalent bonding. To create a universal criterion for understanding these interactions, we
considered the number of electrons that can participate in a chemical reaction. Using this criterion, we categorized the elements
in our training dataset into subelements based on the number of electrons in their outer shell or their oxidation states. However,
it is important to note that information about the type of interaction between atoms in molecular structures is often missing,
and existing algorithms can be prone to errors. In view of this and recognizing that the length of the chemical bond carries
information about the type of interaction, we used Principal Component Analysis (PCA) to reduce the dimensionality of PDF
vectors. We then employed a kmeans algorithm to cluster the outputs and categorize elements into subelement classes. This is
similar to what has often been done manually when developing classical force fields. Examples of such differentiation are
presented in Fig. 2.
We trained an individual model for each element in our dataset, resulting in a total of 192 models, including one PCA and one
kmeans model for each element. The final dictionary size was Vsize=565.

a b
Figure 2. Two examples illustrating the division of elements into sub-elements based on their environment: a hypothetical
organic compound (a) and Li8CoO6 (b) crystal with ID mp-27920. The numbers at the top right of elements correspond to
subelements.

In the following sections, we will present the results of prediction models with specific parameters, including an embedding
size of 32, 2 attention heads, and 2 layers. We explored two model versions, V0 (where the PCA-Km block is omitted) and
V1, as discussed previously. These calculations were carried out three times for each dataset, using random numbers 12345,
67890, and 234567.

3 Results
We trained our elEmBERT model to perform various classification tasks. To do this, we used the [CLS] token and added an
additional layer to the BERT module with the same number of neurons as there are classes in the dataset. Our first task
involved using the Materials Project (MP) metallicity dataset to predict the metallicity of materials based on crystal structure
information31,32. Next, we employed a portion of the datasets gathered for the CegaNN model 33. This led us to undertake a
classification task known as the Liquid-Amorphous (LA) task, which revolves around distinguishing between liquid and
amorphous phases of silicon (Si). The LA dataset comprises 2,400 Si structures, evenly divided between amorphous and
liquid phases (50% each). Importantly, these Si structures lack symmetry and differ solely in terms of density and coordination
number. In addition to these tasks, we evaluated the elEmBERT model's ability to classify material polymorphs across
different dimensionalities, specifically clusters (0D), sheets (2D), and bulk structures (3D). Carbon, with its wide range of
allotropes spanning these dimensionalities, served as an excellent system for assessing the efficiency of our network model
in dimensionality classification (DIM task). The DIM dataset contained 1,827 configurations. Finally, we ventured into
characterizing the space group of crystal structures, encompassing a total of 10,517 crystal structures distributed among eight
distinct space groups (SG task)34.
Expanding beyond inorganic material datasets, we incorporated organic compounds, which greatly outnumber their inorganic
counterparts. This expansion encompasses an extended range of properties, including biochemical and pharmaceutical
aspects. To rigorously validate our model, we turned to benchmark datasets from MoleculeNet35, specifically BBBP (Blood-
Brain Barrier Penetration), ClinTox (Clinical Toxicity), BACE (β-Secretase), SIDER (Side Effect Resource)36, and Tox21.
These datasets cover a diverse array of chemical compounds and provide a comprehensive assessment of our model's
predictive performance for binary properties or activities associated with organic molecules. In this context, a positive instance
signifies that a molecule possesses a specific property, while a negative instance indicates its absence. The MoleculeNet
dataset primarily comprises organic molecules represented in SMILES format. For analysis purposes, we converted these
SMILES formulas into the standard XYZ format using the Open Babel software 37 and RDKit package38. To evaluate our
model's performance, we employed the 'Receiver Operating Characteristic - Area Under the Curve' (ROC-AUC) metric, a
common measure for assessing binary classification quality. ROC-AUC quantifies the model's ability to differentiate between
positive and negative classes based on predicted probabilities. We divided the datasets into three subsets: the training set, the
validation set, and the test set, with an 80:10:10 ratio. The ROC-AUC results reported in Table 1 are based on the test set.
3
These results serve as a reliable metric for evaluating prediction capabilities and the model's ability to generalize to new
instances. Notably, the Tox21 and SIDER datasets encompass 12 and 27 individual tasks, respectively, each corresponding
to specific toxicity predictions.
Table 1. Performance of different models applied to datasets (Matbench to TOXIC21) used in this work. A Bold font indicates
the best performance and an underline represents the second-best performance, and the last column presents previous results
obtained from other models. V0 represents models that use chemical element embeddings, while V1 uses subelement
embeddings as input for the BERT module.
Benchmark V0 V1 BEST
MP metallicity 0.961 ± 0.001 0.965 ± 0.001 0.95039
SG 0.944 ± 0.003 0.968 ± 0.002 133
LA 0.475 ± 0.014 0.980 ± 0.003 133
DIM 0.893 ± 0.013 0.958 ± 0.003 133
BACE 0.827 ± 0.005 0.856 ± 0.010 0.88840
BBBP 0.900 ± 0.020 0.905 ± 0.025 0.93240
CLINTOX 0.945 ± 0.011 0.951 ± 0.016 0.94841
HIV 0.978 ± 0.002 0.979 ± 0.003 0.77642
SIDER 0.778 ± 0.032 0.777 ± 0.028 0.65940
TOX21 0.961 ± 0.006 0.958 ± 0.007 0.86041

Table 1 provides clear evidence that the accuracy of predictions improves as the number of subtypes increases, particularly
for inorganic compounds. In the LA task, using single-element inputs, such as Si, results in only 50% accuracy, which is
comparable to random guessing. However, incorporating sub-elements significantly enhances the performance, leading to an
impressive ROC-AUC of 0.98. Our approach also demonstrates improved accuracy across other datasets. While further
increasing the number of sub-elements has a relatively small impact, it still leads to higher accuracy. In the subsequent
sections, we will delve into each dataset, from Matbench to Toxic21, and examine the elEmBERT-V1 model in more detail,
providing comprehensive insights into the predictions.

MP metallicity
Figure 3a illustrates the confusion matrix and presents the performance of the elEmBERT-V1 model in classifying MP
metallicity. In this task, the objective is to predict or estimate whether a material or chemical compound is a metal or not. The
dataset for this task comprises 106,113 samples of training structures and 21,222 samples of test structures. Our trained model
achieves a binary accuracy of approximately 0.91 and an AUC of 0.965 on the test set.

Figure 3. Confusion matrix (a) and visualization of [CLS] token embeddings for the MP metallicity dataset for the reference
(b) and predicted (c) datasets: blue circles denote negative labels (not metal) and orange dots represent positive labels
(metal).
In Figure 3b, the t-SNE (t-distributed stochastic neighbor embedding) plot shows the embeddings of the entire reference
dataset, categorized by labels, revealing a smooth differentiation among labels within the feature space. Figure 3c
demonstrates how our model classifies the reference dataset. It is evident that the classification models create a clear separation
in the feature space, in contrast to the diffuse boundary in the reference dataset. The primary errors are located at the boundary,
where the model sometimes struggles to effectively capture the diffusive behavior. The metallicity prediction task highlights
4
elEmBERT's remarkable capability to characterize these binary properties of crystals. The achieved accuracy surpasses the
capabilities of previously published models, including those of GNNs.

LA, DIM and SG


This section presents the results obtained from benchmarks conducted for the CegaNN model, beginning with a focus on the
LA classification task. In Fig. 4a and 4b show the embedding representation of Si structures based on their labels, reduced
through the t-SNE algorithm. Our model effectively segregates the structures into distinct clusters, with two clusters clearly
corresponding to their respective classes. However, one cluster exhibits intermixing of structures, which challenges accurate
recognition by the model.
The confusion matrices shown in Fig. 4c-e provide insights into the performance of the elEmBERT-V1 model across the LA,
DIM, and SG datasets. The model achieves a high accuracy of approximately 0.958 on the DIM task's test set and a slightly
higher accuracy of 0.968 on the SG dataset. These confusion matrices illustrate the model's ability to identify and categorize
each structure accurately. It is worth noting that the model faces challenges in distinguishing the bcc (229) structure from
others in the SG dataset. This challenge arises from the structural similarities between the bcc structure and others, resulting
in identical geometrical representations unless the orientational order of the particles is considered.
While the CegaNN model achieved 100% efficiency in this benchmark, our model does not reach this level of performance.
Nonetheless, it demonstrates strengths in terms of versatility, speed, and simplicity for this benchmark as well.

Figure 4. Top row: Visualization of [CLS] Token Embeddings for the LA Dataset: a) reference labels and b) predicted labels.
The embeddings are represented using blue circles for liquid phase labels and orange dots for amorphous labels. Bottom
row: Confusion matrix analysis of the LA (c), Dim (d), and SG (e) datasets.

BACE
The BACE dataset consists of compounds classified as either active or inactive inhibitors of the β-secretase enzyme, which
plays a crucial role in the production of amyloid-beta peptides associated with Alzheimer's disease. This dataset contains a
total of 1,513 compounds, including 681 positive instances, making it a valuable resource for developing and evaluating
predictive models aimed at identifying potential BACE enzyme inhibitors35.
Our elEmBERT-V1 model, trained on the BACE dataset, achieved a ROC-AUC value of 0.86 in classifying compounds as
active or inactive inhibitors. The visualization of our model's predictions on the BACE dataset is presented in Fig. 5. Our
model predicts the presence of two distinct clusters, with some infiltration of both labels within each other (Fig. 5c). However,
it's important to note that reference labels do not uniformly distribute across these visible clusters; instead, both labels
intermixing, leading to errors in both active (true) and inactive (false) predictions. Nevertheless, the attained AUC value of

5
0.856 closely approximates the best performance obtained from a GNN model. We believe that exploring alternative
combinations of model parameters may further enhance these results.

Figure 5. Classification of BACE data: a) Confusion matrix of predicted labels on the test set. b) t-SNE feature representation
of the entire reference dataset according to their labels. c) Feature representation of the predicted labels.

BBBP
Next, we used the BBBP dataset, which comprises 2,039 chemical compounds annotated based on their ability to penetrate
the blood-brain barrier. This dataset serves as a valuable resource for training and evaluating models aimed at predicting drug
candidates' permeability through the blood-brain barrier35. Remarkably, our predictive model achieved a high ROC-AUC
value of 0.905, ranking as the second-best value among other models.
As before, we present Fig. 6, which includes the confusion matrix of the test set and t-SNE plots, illustrating the feature
representation of labels. As you can see, the model successfully separates compounds according to their labels. However, the
primary source of errors again arises from the diffuse boundary, where our model establishes a clear boundary.

Figure 6. Classification of BBBP data: a) Confusion matrix of predicted labels on the test set. b) t-SNE feature representation
of the entire reference dataset according to their labels. c) Feature representation of the predicted labels.

Clintox
The ClinTox dataset is a valuable resource for studying the clinical toxicity profiles of chemical compounds. It provides data
on two crucial toxicity endpoints: clinical trial toxicity and FDA approval status. Researchers use this dataset to develop
predictive models and evaluate the safety profiles of compounds, aiding in the early identification of potentially toxic
substances during the drug development process35. The ClinTox dataset contains 1,491 compounds.
The ClinTox model achieves an impressive ROC-AUC accuracy of approximately 0.951 on the FDA approval task, as
demonstrated in Fig. 7. Nevertheless, this dataset features only 94 negative instances, which leads to the confusion matrix
showing zero predicted False values. A more detailed analysis of t-SNE projections shows that our model identifies a region
with the highest concentration of negative values, yielding accurate true predictions for all points within this limited area.
However, the criteria for false selection necessitate greater complexity. We believe that increasing the embedding size, along
with attention heads and encoder layers, may further enhance our results.

6
Figure 7. Classification of ClinTox FDA approval task: a) The confusion matrix of predicted labels on the test set. b) The t-
SNE feature representation of the entire reference dataset according to their labels. c) The feature representation of the
predicted labels.

HIV
The HIV dataset comprises diverse biomedical data related to the Human Immunodeficiency Virus (HIV), including clinical
records, genetic sequences, drug resistance profiles, and more. Machine learning techniques are applied to this dataset for
tasks such as predicting patient treatment responses, identifying drug resistance mutations, and understanding viral evolution
patterns. Within the HIV dataset, there are approximately 41,000 distinct data structures, of which 1,443 are considered
positive cases. For the task at hand, the achieved AUC score is an impressive 0.98. Notably, the primary source of erroneous
predictions lies in the positive classification of negative compounds, as presented in the confusion matrix plot (Fig. 8a). Fig.
8b illustrates that some positive data points continue to mix with negative ones, contributing to these misclassifications.
Furthermore, a diffuse region housing negative values (found in the lower-right region) also contributes to these inaccuracies.
Despite the notable occurrence of incorrect positive predictions, it's worth emphasizing that our model demonstrates a robust
capability to effectively categorize HIV compounds. Importantly, the highest AUC score achieved by other models remains
considerably lower at 0.778, significantly lagging behind our results.

Figure 8. Classification of ClinTox data: a) Confusion matrix of predicted labels on the test set. b) t-SNE feature
representation of the entire reference dataset according to their labels. c) Feature representation of the predicted labels.

SIDER
The SIDER dataset serves as a comprehensive pharmacovigilance resource, containing structured information on
drug-associated side effects. Curated from diverse sources such as clinical trials, regulatory reports, and medical
literature, it offers a systematic compilation of adverse drug reactions associated with various pharmaceutical
interventions. The SIDER dataset plays a vital role in assessing drug safety, understanding adverse reaction
patterns, and informing clinical decision-making and drug development. It comprises 27 individual tasks, each
requiring a corresponding model for fitting. The training results are presented in Table 2, where the average AUC
value across all tasks is approximately 0.78. This notably exceeds the previous best-predicted value of 0.659 (over
all tasks). The task with the lowest AUC value was the initial SIDER task, concerning Hepatobiliary disorders.

7
The Meta-MGNN (MMGNN) model achieved a score of 0.76343 for this task, compared to 0.635 in our model.
However, in other tasks, our model demonstrates comparability or superiority. The average AUC value across all
tasks for our model is higher than the reported value, which averaged the first six tasks in the MMGNN model. In
Fig. 9, both the confusion matrix of the test set and t-SNE plots for the compound embeddings are illustrated.
These visualizations reveal that labels within the feature space of the reference data are intermingled, necessitating
a more intricate model than the one employed in this study for effective label separation. Nonetheless, the model
proves better suited for other tasks, providing satisfactory results and improvements over prior predictions.

Table 2. ROC-AUC performances of various models on the SIDER dataset. MMGNN denotes the prior top-
performing results43. The last column presents the elEmBERT model's average performance across all tasks. The
Bold entries signify the highest performance, while underlined values indicate the second-best performance.
SIDER N 1 2 3 4 5 6 7 8 9 10 11 12 13 14
V0 0.626 0.756 0.972 0.735 0.843 0.736 0.958 0.846 0.775 0.712 0.748 0.930 0.802 0.859
V1 0.635 0.723 0.976 0.700 0.881 0.677 0.957 0.865 0.757 0.662 0.769 0.918 0.785 0.838
MMGNN 0.754 0.693 0.723 0.744 0.817 0.741 - - - - - - - -
SIDER N 15 16 17 18 19 20 21 22 23 24 25 26 27 Ave
V0 0.841 0.675 0.898 0.812 0.792 0.761 0.843 0.750 0.909 0.669 0.758 0.952 0.798 0.778
V1 0.877 0.732 0.921 0.833 0.781 0.798 0.873 0.781 0.918 0.545 0.726 0.962 0.731 0.777
MMGNN - - - - - - - - - - - - - 0.747

Figure 9. Classification of SIDER-1 data: a) Confusion matrix of predicted labels on the test set. b) t-SNE feature
representation of the entire reference dataset according to their labels. c) Feature representation of the predicted labels.

Toxic21
The Tox21 dataset is a collection of chemical compounds evaluated for their toxicity against a panel of 12 different biological
targets. With over 8,000 compounds, it serves as a valuable resource for predicting the toxicity and potential adverse effects
of various chemical compounds. Our model, trained on the Tox21 dataset, demonstrated impressive performance, achieving
an average AUC of 0.96 across all 12 toxicity prediction tasks35. The results of these individual tasks are presented in Table
3, enabling a comprehensive evaluation of the model's performance on each toxicity prediction within the Tox21 dataset.
Comparing our results with those of the MMGNN model highlights the significant advantages of our approach. Fig. 10 shows
the confusion matrix of the test set and the t-SNE projection representing the features of the sr-mmp task in the Tox21 dataset.
As shown, our model predicts distinct patterns in the t-SNE projections, with each label value occupying a specific region
(Fig. 10b). The molecular embedding visualizations are also available in the MMGNN model report for the sr-mmp task43. In
contrast, our feature space exhibits more structure, with positive values being less dispersed across all compounds. Our model
primarily has only a few points that are significantly distant from the positive value region. Both elEmBERT models
successfully identify the boundary between these two classes and make predictions (Fig. 10c). Errors primarily arise from
diffuse boundary regions and points located far from the true cluster. This observation holds true for all tasks within the Tox21
dataset.

8
Table 3. ROC-AUC performances of different tasks from the Tox21 dataset. MMGNN denotes the prior top-
performing results43. The last column presents the elEmBERT model's average performance across all tasks. The
Bold entry signifies the highest performance, while underlined values indicate the second-best performance.
Model nr- nr-ar-lbd nr-arom nr-ar nr-er-lbd nr-er nr-ppar-g sr- sr-atad5 sr- sr-mmp sr-p53 Ave
ahr are hse
V0 0.947 0.987 0.973 0.982 0.972 0.924 0.991 0.908 0.982 0.975 0.935 0.970 0.961
V1 0.953 0.981 0.972 0.982 0.976 0.930 0.989 0.911 0.984 0.976 0.941 0.970 0.958
MMGNN - - - - - - - - - 0.748 0.804 0.790 0.781

Figure 10. Classification of sr-mmp data from Tox21 dataset: a) Confusion matrix of predicted labels on the test set. b) t-SNE feature
representation of the entire reference dataset according to their labels. c) Feature representation of the predicted labels.

The binary classification results of our model for organic compounds exemplify its exceptional efficiency in predicting the
behavior of interactions between organic compounds and protein molecules. By accurately classifying these compounds, our
model provides valuable insights into their potential effects and interactions within biological systems. This capability holds
significant promise for drug discovery, as it enables the identification of organic compounds that have a high likelihood of
binding to specific protein targets and exerting desired therapeutic effects.

4 Conclusions
In conclusion, the deep learning model presented in this paper signifies a significant advancement in the application of
machine learning to computational chemistry. By integrating the attention mechanism and a transformer-based approach, our
model can capture both local and global properties of chemical compounds, enabling highly accurate predictions of chemical
properties that outperform similar approaches. Our innovative combination of principal component analysis and k-means
clustering for sub-elements accounts for the nuanced effects stemming from electronic structure, a fact confirmed through the
analysis of numerous chemical databases. Our classification approach, which relies on compound embeddings, has
substantially improved prediction accuracy compared to previously published scores. Additionally, t-SNE projections provide
valuable insights into the classification mechanisms and can pinpoint sources of erroneous predictions. Beyond accurately
predicting desired properties, we believe that our model has the potential to illuminate the underlying reasons behind
structure/property relationships.

Acknowledgements
The work has partially been carried out within the framework of the EUROfusion Consortium and received funding from the
Euratom research and training programme by Grant Agreement No. 101052200-EUROfusion. SS has received funding from
the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant
Agreement No. 847476. The views and opinions expressed herein do not necessarily reflect those of the European
Commission. The computational results have been obtained using the HPC infrastructure LEO of the University of Innsbruck.

Data availability
All data used in this paper are publicly available and can be accessed from various sources. The structure files for the MP
metallicity dataset are accessible at https://matbench.materialsproject.org/. The LA, SG, and DIM datasets are available at
https://github.com/sbanik2/CEGANN/tree/main/pretrained. The BACE, BBBP, Clintox, HIV, and SIDER datasets can be
retrieved from https://moleculenet.org/. Structure files for the Tox21 dataset can be obtained from
9
https://tripod.nih.gov/tox21/challenge/data.jsp. The source code used in this study is available at
https://github.com/dmamur/elembert, and detailed Python notebooks for replicating all calculations can be found on the
corresponding GitHub page.

References
1. Ng, M.-F., Zhao, J., Yan, Q., Conduit, G. J. & Seh, Z. W. Predicting the state of charge and health of batteries using
data-driven machine learning. Nat. Mach. Intell. 2, 161–170 (2020).
2. Liu, Y., Guo, B., Zou, X., Li, Y. & Shi, S. Machine learning assisted materials design and discovery for
rechargeable batteries. Energy Storage Mater. 31, 434–450 (2020).
3. Sawant, V., Deshmukh, R. & Awati, C. Machine learning techniques for prediction of capacitance and remaining
useful life of supercapacitors: A comprehensive review. J. Energy Chem. 77, 438–451 (2023).
4. Iwasaki, Y. et al. Machine-learning guided discovery of a new thermoelectric material. Sci. Rep. 9, 1–7 (2019).
5. Akhter, M. N., Mekhilef, S., Mokhlis, H. & Shah, N. M. Review on forecasting of photovoltaic power generation
based on machine learning and metaheuristic techniques. IET Renew. Power Gener. 13, 1009–1023 (2019).
6. Toyao, T. et al. Machine Learning for Catalysis Informatics: Recent Applications and Prospects. ACS Catal. 10,
2260–2297 (2020).
7. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov.
18, 463–477 (2019).
8. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. 1st Int.
Conf. Learn. Represent. ICLR 2013 - Work. Track Proc. 1–12 (2013).
9. Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature.
Nature 571, 95–98 (2019).
10. Hansen, K. et al. Machine learning predictions of molecular properties: Accurate many-body potentials and
nonlocality in chemical space. J. Phys. Chem. Lett. 6, 2326–2331 (2015).
11. Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J.
Chem. Inf. Model. 58, 27–35 (2018).
12. Goh, G. B., Hodas, N. O., Siegel, C. & Vishnu, A. SMILES2Vec: An Interpretable General-Purpose Deep Neural
Network for Predicting Chemical Properties. (2017) doi:10.475/123.
13. Zhang, Y. F. et al. SPVec: A Word2vec-Inspired Feature Representation Method for Drug-Target Interaction
Prediction. Front. Chem. 7, 1–11 (2020).
14. Stanev, V. et al. Machine learning modeling of superconducting critical temperature. npj Comput. Mater. 4, (2018).
15. Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph Networks as a Universal Machine Learning Framework
for Molecules and Crystals. Chem. Mater. 31, 3564–3572 (2019).
16. Kong, S. et al. Density of states prediction for materials discovery via contrastive learning from probabilistic
embeddings. Nat. Commun. 13, 1–12 (2022).
17. Gori, M., Monfardini, G. & Scarselli, F. A new model for earning in raph domains. Proc. Int. Jt. Conf. Neural
Networks 2, 729–734 (2005).
18. Zhang, S., Liu, Y. & Xie, L. Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for
Molecular Structures. 1–14 (2020).
19. Shui, Z. & Karypis, G. Heterogeneous molecular graph neural networks for predicting molecule properties. Proc. -
IEEE Int. Conf. Data Mining, ICDM 2020-Novem, 492–500 (2020).
20. Fung, V., Zhang, J., Juarez, E. & Sumpter, B. G. Benchmarking graph neural networks for materials chemistry. npj
Comput. Mater. 7, 1–8 (2021).
21. Xie, T. & Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable
10
Prediction of Material Properties. Phys. Rev. Lett. 120, (2018).
22. Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B - Condens. Matter
Mater. Phys. 87, 1–16 (2013).
23. Huo, H. & Rupp, M. Unified Representation of Molecules and Crystals for Machine Learning. Mach. Learn. Sci.
Technol. 3, 045017 (2022).
24. Behler, J. Four Generations of High-Dimensional Neural Network Potentials. Chem. Rev. 121, 10037–10072
(2021).
25. Schütt, K. T. et al. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions.
Adv. Neural Inf. Process. Syst. 2017-December, 992–1002 (2017).
26. Unke, O. T. & Meuwly, M. PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments, and
Partial Charges. J. Chem. Theory Comput. 15, 3678–3693 (2019).
27. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017-Decem, 5999–6009 (2017).
28. Louis, S. Y. et al. Graph convolutional neural networks with global attention for improved materials property
prediction. Phys. Chem. Chem. Phys. 22, 18141–18148 (2020).
29. Billinge, S. J. L. The rise of the X-ray atomic pair distribution function method: A series of fortunate events. Philos.
Trans. R. Soc. A Math. Phys. Eng. Sci. 377, (2019).
30. Hjorth Larsen, A. et al. The atomic simulation environment—a Python library for working with atoms. J. Phys.
Condens. Matter 29, 273002 (2017).
31. Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating materials
innovation. APL Mater. 1, (2013).
32. Ong, S. P. et al. The Materials Application Programming Interface (API): A simple, flexible and efficient API for
materials data based on REpresentational State Transfer (REST) principles. Comput. Mater. Sci. 97, 209–215
(2015).
33. Banik, S. et al. CEGANN: Crystal Edge Graph Attention Neural Network for multiscale classification of materials
environment. npj Comput. Mater. 9, 1–12 (2023).
34. Ziletti, A., Kumar, D., Scheffler, M. & Ghiringhelli, L. M. Insightful classification of crystal structures using deep
learning. Nat. Commun. 9, 1–10 (2018).
35. Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
36. Kuhn, M., Letunic, I., Jensen, L. J. & Bork, P. The SIDER database of drugs and side effects. Nucleic Acids Res. 44,
D1075–D1079 (2016).
37. O’Boyle, N. M. et al. Open Babel. J. Cheminform. 3, 1–14 (2011).
38. RDKit: Open-source cheminformatics. https://www.rdkit.org.
39. Chen, C. & Ong, S. P. AtomSets as a hierarchical transfer learning framework for small and large materials datasets.
npj Comput. Mater. 7, (2021).
40. Li, Y. et al. GLAM : An adaptive graph learning method for automated molecular interactions and properties
predictions. Nat. Mach. Intell. 4, 645–651 (2022).
41. Li, P. et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Brief. Bioinform. 22,
bbaa266 (2021).
42. Baek, J., Kang, M. & Hwang, S. J. Accurate Learning of Graph Representations with Graph Multiset Pooling. 1, 1–
22 (2021).
43. Guo, Z. et al. Few-shot graph learning for molecular property prediction. Web Conf. 2021 - Proc. World Wide Web
Conf. WWW 2021 2559–2567 (2021) doi:10.1145/3442381.3450112.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy