0% found this document useful (0 votes)
32 views10 pages

FTIR Acm

Uploaded by

rbnchy999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

FTIR Acm

Uploaded by

rbnchy999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/323156909

Priority based functional group identification of organic molecules using


machine learning

Conference Paper · January 2018


DOI: 10.1145/3152494.3152522

CITATIONS READS
17 143,317

4 authors, including:

Rushikesh Nalla Manish Narwaria


Stony Brook University University of Nantes
1 PUBLICATION 17 CITATIONS 44 PUBLICATIONS 2,176 CITATIONS

SEE PROFILE SEE PROFILE

Bhaskar Chaudhury
Dhirubhai Ambani Institute of Information and Communication Technology
83 PUBLICATIONS 1,221 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rushikesh Nalla on 30 August 2018.

The user has requested enhancement of the downloaded file.


Priority Based Functional Group Identification of Organic
Molecules using Machine Learning
Rushikesh Nalla Rajdeep Pinge
Group in Computational Science and HPC Group in Computational Science and HPC
DA-IICT, Gandhinagar, India DA-IICT, Gandhinagar, India

Manish Narwaria Bhaskhar Chaudhury


Group in Computational Science and HPC Group in Computational Science and HPC
DA-IICT, Gandhinagar, India DA-IICT, Gandhinagar, India
manish_narwaria@daiict.ac.in bhaskar_chaudhury@daiict.ac.in

ABSTRACT 1 INTRODUCTION
Functional groups in organic compounds determine the properties In Organic Chemistry, Functional Groups are specific/classified
of the compounds/molecules. When multiple functional groups groups of bound atoms which appear together within molecules,
are present, the dominant functional group determines majority and determine the chemical and physical properties of the com-
of the properties of the compound. Hence priority based identifi- pounds/molecules. Regardless of what molecule contains it, does
cation of functional groups is an important problem in chemistry. not matter how large or small it is, same functional groups will go
Fourier-transform Infrared spectroscopy (FTIR) is a commonly used through similar chemical reactions and a molecule containing a
spectroscopic method for identifying the presence or absence of particular functional group is expected to exhibit reactions charac-
functional groups within a compound, and the current approach for teristic of that functional group. In other words, functional groups
this task mainly relies on visual inspection and analysis of the FTIR are centers of chemical reactivity of a molecule and it is necessary
spectral data. However, such visual identification process by hu- to identify the functional groups within a molecule when naming
mans is error prone, especially when patterns in the FTIR spectrum it [1]. Accurate identification of functional groups has significant
overlap, resulting in loss of uniqueness of features which help in applications in several fields such as biochemistry, drug discovery,
identification of different functional groups in the unknown sample. medicinal chemistry, toxicity assessment, pharmaceuticals, molecu-
Therefore, the main goal of this paper is to develop a machine- lar biology and chemical nomenclature. There are many real-life
learning based classification system which can perform priority situations where knowledge of chemical properties of a mixture of
based functional group identification of organic molecules. To the unknown organic molecules is important in order to understand
best of our knowledge, this is the first effort to address this problem its behaviour. In such cases, it is even more important to identify
using machine learning (ML), and a unique aspect of our study the functional groups present in the mixture since they represent
is the incorporation of domain specific information into the pro- the behaviour and properties of the mixture.
cess of classification by employing a set of priority rules generated Fourier-transform Infrared (FTIR) spectroscopy is an important,
from expert knowledge. We have carried out extensive study on commonly used spectroscopic method for identifying the presence
real IR spectral data, first using a rule based approach and then or absence of functional groups within a compound and thereby
using ML in an effort to improve the classification accuracy. Our helps in the structural identification of unknown molecules [2, 3]. It
analysis indicates that the basic rule based method is reasonably ef- is based on the interaction of Infra-Red light with molecules present
fective in predicting the presence (or absence) of functional groups. in a sample and the absorption of particular frequencies of the IR
However, such approach is practically not accurate enough for the radiation. Each chemical bond within a molecule has a certain char-
more challenging problem of priority based identification, and ML acteristic vibrational frequency and corresponding wave-number
based classification offers much higher identification accuracies in (multiplicative inverse of frequency). When IR light is incident on
this case. The primary reason is that ML algorithm can adaptively a sample, the frequencies of absorption correlate to the vibration
exploit data patterns to classify the functional group unlike the of specific chemical bonds present in the molecule. Examination
rule-based approach which uses a fixed set of rules for the said pur- of the transmitted light gives a spectrum of transmittance with re-
pose. Finally, we have also carried out extensive statistical analysis spect to incident frequency of IR light thereby revealing how much
of the results by using confidence intervals and permutation tests, energy is absorbed at each frequency (or wavelength). Generally
in an effort to gain more descriptive information about the learning a conventional infrared instrument records spectra from an upper
process, and not simply treat it as a black box. limit of around 4000 cm −1 down to 400 cm−1 . Characteristic ab-
sorption (or transmittance) patterns in the IR spectrum of a sample
molecule helps in deducing the presence or absence of a specific
KEYWORDS functional group in the sample. The spectrum is rich in information
Organic Compounds, Chemical Bonds, Functional Groups, Func- with complex patterns and consists specific features. A spectro-
tional Group Priority, Fourier Transform Infrared (FTIR) Spec- scopist (human) interprets the data by using the well established
troscopy, Pattern Identification, Machine Learning (ML) relationships between the molecular structure and patterns in the
obtained spectra. In addition to these rules, visual aspects of spec- work was that it used a linear activation function to learn and clas-
troscopy which includes recognizing characteristic patterns and sify molecules into different functional groups which was criticized
its interpretation relative to the structure requires experience, and in the book by Minsky and Pappert [6] while Fessenden and Gy-
plays an important role in accurate interpretation. The IR spectrum orgyi [7] termed it as "oversimplification" of the problem. Robb
of an unknown sample can be compared with previously known and Munk along with Madison published another paper [8] which
reference spectra patterns leading to identification of unknown used one hidden layer in the architecture on the same dataset. In
functional groups in the sample molecule and this forms the basis this case, the number of features were reduced to 36 and sigmoid
of machine learning based spectral searching. activation function was used. A detailed analysis of the method
There are millions of organic compounds and the most important was done by varying various parameters. The accuracy achieved
reason for classifying compounds by their functional groups is that on trained data set was nearly 80% while it was just over 60% on
it classifies their chemical behaviour. However, if more than one the testing data set. Fessenden and Gyorgyi [7] also used single
functional group is present in the compound, then the properties hidden layer architecture but they did the experiment only on 6 ma-
of the compound are determined by the most dominant functional jor functional groups and also checked for the presence of 3 more
group present in the compound [4]. The dominant functional group bonds. So, in total 9 element vector predictions were done. This was
is determined by priority order of functional groups and plays an insufficient since many of the important functional groups as well
important role in determining the properties of complex organic as their priority order was not considered during identification.
molecules. Therefore the interpretation of FTIR spectra is not only In the next few years a lot of studies were conducted by varying
simply assigning group frequencies to molecules but requires classi- the number of input points, number of structural features, number
fication based on functional group priority. Function f denotes the of classes, output threshold, etc. [8–14]. Meyer and Hobert [15]
mapping between features and corresponding functional groups. further used Principle Component Analysis (PCA) to reduce the
f : Features → Functional Group of Compound. spectral data required for correct prediction. Visser, Luinge and van
Usually, function f is identified by humans through certain rules der Maas [16] focused on identification of individual bonds which
as well as by experience. The common features include, transmit- have overlapping IR spectrum range. However, all these works
tance level or absorption level of IR light, width of peaks appearing have two major drawbacks. Firstly, they used only the standard
in the spectrum, number of peaks in the given range, etc. We be- Multilayer Perceptron (MLP) ANN with one hidden layer, and sec-
lieve, machines can identify a better f through supervised machine ondly they predicted only the presence/absence of each functional
learning methods which will help in more accurate identification group and/or structural feature. Therefore the above works lacked
of priority based functional groups. Most of the works in existing the vital information about the priority order of functional groups,
literature, regarding application of machine learning for interpreta- which helps in determining the name and chemical properties of
tion of IR spectra, focuses on determining the presence or absence the compound if it contains multiple functional groups.
of a functional group in a molecule. In 2000, Tchistiakov and Ruckebusch [17] used different prepro-
To the best of our knowledge, this is the first report describing the cessing techniques like wavelet and Fourier-transform coefficients
priority based functional group identification of organic molecules for reduction of spectra in combination with different Artificial
using machine learning on FTIR data. We first look at the rule based Neural Networks (ANNs) for non-linear hierarchical modelling to
method in which we try to replicate how a human being would comply with scarcity of samples. Brown and Lo [18] did extensive
classify the spectrum and state its limitations of identifying the analysis on classification using Radial Basis Function (RBF) classifi-
dominant functional group. We later see if a machine can identify cation technique of ANN. But these two also neglected the priority
better structures among the human identifiable features in the order of functional groups and focused only on presence-absence
intermediate approach. Finally, we use the entire spectrum and of each feature individually.
let the machine extract the best hidden features by using different In 2001, Tanabe and group [19] used 10000 samples with 15
machine learning algorithms. In the end we compare the results functional groups and more than 100 structural features. They also
obtained in all the three approaches and validate our models by checked only for presence or absence of all these structural features
performing statistical tests. although the scale of experiment was very large. They were able
to achieve an average accuracy of 80% for classification of main
functional groups and around 70% when halogens and sulphur
2 RELATED WORK
groups were considered. They believe that it is the limitation of
During early 1990s, a lot of research work involving basic ANN infrared method to identify structures from spectra. The spectral
(Artificial Neural Networks) methods for identification of chemi- database system (SDBS) [20] constructed by the above group is an
cal structural features from FT-IR spectrum of organic molecules open-source website containing over 50000 infrared spectra. We
have been reported. A very detailed study of identifying functional have used IR spectra from this open-source database (images of IR
groups from IR spectrum using ANN was published by Robb and spectra) for our study.
Munk [5] in 1990. They used a one-layer architecture of ANN to From the above literature review it is clear that most of the work
classify a large number of molecules. They represented spectrum on this topic considered only the presence/absence of individual
range of 4000 cm −1 - 400 cm −1 using 640 uniformly spaced data functional groups, bonds and other structural features in a FTIR
points and emphasized on identifying the presence or absence of spectra of a molecule. Functional group priority order was not
about 128 structural features. So essentially, there were 128 binary considered by any of these works. Furthermore, the works were
classifiers in the working model. But the main drawback of this done using only the basic ANN methods. Our aim is to consider
2
and exploit the priority order of functional groups while assigning Table 1: PRIORITY ORDER OF FUNCTIONAL GROUPS
a single label to a molecule, since the properties of the molecule
are determined by the dominant functional group. Functional
Prefix Suffix
Group
3 PROBLEM DESCRIPTION AND IR Carboxylic -carboxylic acid
1 carboxy-
SPECTRAL DATA Acid -oic acid
2 Ester (R)-oxycarbonyl -oate
3.1 Problem Description Acid
Let X be an n ×m data matrix (the i-th row and j-th column vectors 3 halocarbonyl- -oyl halide
Halide
can be denoted as Xi and Xj , respectively). As we are using FTIR -carboxamide
spectrum data, the elements of X typically represent the value of 4 Amide carbonyl-
-amide
transmittance (feature) that is related to a given functional group. 5 Nitrile cyano- -nitrile
Further, a discrete value yi → C where C = [c 1 , c 2 , ...c B ] represents -al
the discrete class label vector (in our case B = 14), and is associated 6 Aldehyde formyl-
-carbaldehde
with each data point Xi . 7 Ketone oxo- -one
The labelled data set will be D = {(Xi , Ci )}i=1
n . Then the problem
8 Alcohol hydroxy- -ol
can be formulated as one of learning the function f which maps
9 Thiol mercapto- -thiol
the data values into the corresponding class label i.e. f : X → C.
10 Amine amino- -amine
However, unlike the traditional supervised classification problem
Arene
where the class label C is well-defined (i.e. the given sample be-
(cyclic
longs completely to one of the classes), the classes in our case tend 11 - benzene
arrays of
to be fuzzy. Specifically, a compound will almost always contain
C=C)
more than one functional group. Therefore, a seemingly logical ap-
12 Alkene alkenyl -ene
proach from machine learning view point would be the use of fuzzy
13 Alkyne alkynyl -yne
classification approach. However, from the view point of the appli-
cation (i.e. functional group identification) and the spectroscopic 14 Alkane alkyl -ane
data, the chemical properties of the compound are determined, 15 Ether alkoxy -ane
to a large extent, by the dominant group, and practically less af- Alkyl
16 halo- -ane
fected by the presence of other groups. In light of such context Halide
and data information, a more effective approach would be priority 17 Nitro nitro- -ane
based classification. Such priority based approach entails exploiting
the hierarchy of functional groups i.e. the more dominant group
will decide the chemical properties of the given test sample. Such it is generally difficult to identify the dominant functional group in
context-specific information is available from domain experts, and the spectrum because of the presence of multiple functional groups.
is shown in Table 1. The ordering in this table shows the priority- The presence of various chemical bonds introduces specific patterns
wise arrangement of importance of functional groups. For instance, in the IR spectrum. These patterns are unique to each functional
if the IR data indicates the presence of carboxylic acid, acid halide group but may not be distinguishable upon visual inspection. For
and amide, then the corresponding sample will inherit properties of example, the presence of chemical bonds causes one or more peaks
carboxylic acid (which is first in the order of priority), and needs to (’inverted peaks’ which are here-on referred to as simply ’peaks’)
be classified as carboxylic acid. Therefore, the mapping f : X → C to appear in the spectra and these peaks are generally Gaussian
needs to consider this information. shaped. Various chemical bonds have different wave-number ranges
As already stated, the current approach to estimate f largely and some even have overlapping ranges. Some of the peaks may
depends on visual inspection by experts. Given the peculiarities superimpose on each other if the ranges are overlapping. So, it is
and fine patterns in IR spectroscopic data, such human-based clas- difficult to identify which bonds are actually present from these
sification is not only error prone (we provide specific motivating mixture of Gaussian shaped peaks. Also, chemical effects or inter-
examples in the next sub-section) but also time-consuming. There- actions between atoms can cause a shift in the peaks. This makes
fore, in this paper, we consider the use of machine learning to derive it difficult to visually identify all the unique patterns in the spec-
the mapping f , and evaluate its effectiveness against a rule-based trum. Therefore to substitute for the current, heavily used and error
approach that humans typically use to classify functional groups prone visual inspection process, machine learning based methods
from FT-IR data. are required which will aid in better identification of dominant
functional group present in a molecule/compound.
3.2 Peculiarities in IR Spectral Data Studying the visual identification process further, a human would
As shown in Figure 1, the spectrum of a molecule is a graph of Trans- be able to easily identify the functional group in some cases. For
mittance (%) vs Wave-number (cm−1 ). A basic characterization (ab- example, the graphs in Figure 2 shows how a human would observe
sence/presence of a functional group) of an unknown sample is the spectrum to extract essential features which would help in
possible by investigating the IR spectrum using first principles and identification of dominant functional group. In the spectrum on the
well established rules for interpreting IR spectra patterns. However, left, two functional groups, namely Carboxylic Acid (made of O-H,
3
Figure 1: Examples of FTIR spectrum. The spectrum on the left is the original spectrum taken from the SDBS database [20].
It also shows the rule based features that have been considered in the rule based approach. (1) Transmittance Level, (2) Width
of Peak, (3) Number of Peaks in the given Range and (4) Sum of Widths of All Peaks in the given Range. The spectrum on the
right is the one regenerated after extracting the data in quantified form from the original image on the left.

Figure 2: Steps in visual identification of functional groups. The spectrum on the left shows that visual identification of func-
tional groups is possible due to simplicity of spectrum. The spectrum on the right shows the peculiarities and hence problems
in identifications due to its complex nature. The original images are taken from SDBS database [20]

C=O, C-O and C-H bonds) and Alcohol (made of O-H, C-O and C-H) used by humans for classification. Secondly, we study an inter-
are present. But the Carboxylic Acid group is higher in the priority mediate method where features used by humans are given to the
table given in Table 1. Hence according to the rules any average machine learner to understand the patterns and classify the com-
human being will be able to identify this as a Carboxylic Acid. On pounds. Results of both these methods are then compared. Finally,
the contrary, for spectra on the right in Figure 2, it would be very machine learning is used on whole data so that machine itself can
difficult for an amateur, inexperienced human being to identify the identify its own features and patterns from the data and build map-
compound because the spectrum is complex and the standard rules ping to identify the functional groups. Furthermore, we compare
are unable to distinguish it. the different machine learning techniques to check whether using
In order to understand the problems in the widely used approach different learners changes the functional group accuracy and we
of visual identification, we first investigate the rule-based approach also check the validity of the obtained results by performing certain

4
statistical tests. On the whole, the goal is to check whether manual Table 2: Wave-number Ranges of Chemical Bonds
feature extraction and human judgment can be replaced by machine
based feature extraction and machine judgment. Bond Wave-number
2850-2950 (Alkane)
3000-3100 (Alkene)
3.3 Data Collection C-H
3290-3310 (Alkyne)
SDBS Database [20] is a large open-source database with spectra 3000-3040 (Aromatic)
of over 50,000 compounds. However, data on the SDBS Database is 1620-1680 (Alkene)
in the form of images which cannot be used directly. Spectrum on C=C
1400-1620 (Aromatic)
the left in Figure 1 is an example of the image of FT-IR spectrum of C triple bond C 2100-2260
a molecule. 1690-1740 (Aldehyde)
Quantification of data from the images is an essential step be- 1710-1780 (Acid)
fore data processing. Since our aim is to predict a single dominant C=O 1630-1690 (Amide)
functional group present in the molecule, we do not check for the 1680-1750 (Ketone)
presence or absence of any of the sub-functional groups. We rather 1735-1750 (Ester)
let the machine identify these sub-features on its own and accord- 3300-3500 (Amine)
ingly predict the functional group of the molecule. Furthermore, N-H
3100-3500 (Amide)
we can work on a small data-set because the number of labels we 2500-3200 (Acid)
focus on is comparatively less. Also, we can do with less number of O-H
3200-3700 (Alcohol/Phenol)
points than suggested by Tanabe et al. [19] which is 3600 because C triple bond N 2220-2260 (Nitrile)
the broad pattern of spectrum remains same even for lesser number 1025-1220 (alkyl, amine, amide)
of points. C-N
1250-1360 (aryl)
Since we are using the same database as used by Tanabe et al., we 1040-1210 (alcohols/phenols)
have taken the help of their paper to finalize the list of functional C-O
1210-1320 (acid)
groups to be considered for our experiment. Out of the 15 functional Nitro N - O bonds 1515-1560 and 1345-1385
groups used by Tanabe et al. we have used 12. Two of the three
remaining groups contain Phosphorous (P) and Sulfur (S) as part
of the compounds. These are minor groups and S is generally a
substitute for Oxygen(O) while P is a substitute for Nitrogen(N). instead of 3600 or more covering entire range. This shows that
Hence, we have not considered them. The wave-number range for there is no need for 1 point for each x-value. This is because this
halogen bonds is from 1400 cm−1 to 500 cm −1 which is part of range is sufficiently wide to contain data points which cover all
the fingerprint region of spectrum. In this region, there are a lot the variations in the spectrum. So instead of increasing the data
of peaks irrespective of the presence of halogen bonds because of by interpolation which may introduce some error, we consider less
the characteristics of IR spectrum. Thus we might encounter a lot number of data points.
of false-positives if we include halogens which might reduce the
overall accuracy. Hence we have not included the last remaining 4 RULE BASED METHOD
group, the halogens. Instead, we have added two more functional We refer the functional group identification process based on the
groups - pure alkanes (C-C and C-H bond) and alkynes (C triple standard rules used for classification as the ’Rule Based Method’.
bond C, C-C, and C-H bond). The reason for this addition is that In this method, a set of deterministic rules are used to identify the
many of these pure Hydrocarbons exist in nature and from Table 2 it functional group in a given compound. This result is reproducible
is clear that they have distinct ranges where the peaks may exist and and can be easily explained because the process is a white-box as
which can be identified uniquely. Among the groups mentioned in against the black-box machine learning methods. This Rule Based
Table 1, we have taken into consideration the functional groups in Method is an attempt is to replicate how an average human being
white background whereas the groups in gray background have not (spectroscopist) would classify the spectrum given a set of standard
been considered. So, in effect we have taken 14 functional groups rules.
for our problem. Except for "nitro" and "alkyne" groups, all other Bond ranges given in Table 2 are the base on which basic features
groups have 100 samples each. Due to unavailability of samples in and rules of identification depend heavily. These ranges give the
the SDBS Database, we could get only 92 samples for ’nitro’ group exact locations in the spectra where the specific bonds may exist
and 49 samples for ’alkyne’ group. and from which, rules of identification of functional groups can be
The obtained data in image form is further processed and quan- built; the well established rules can be found in any standard such
tified. Data cleaning, proper x-y mapping, scaling and interpolation as [4, 21–24].
is performed to prepare data for feature extraction. Finally, after Due to the peculiarities in IR spectra as mentioned in section 3.2,
all the processing, we obtain spectrum on the right in Figure 1. We it is clear that the visual identification process is complex even when
can observe that, visually, this graph is very close to the graph in the rules of identification are well-defined. Hence, it is important
the original data image on the left. This means that our processing to first get a benchmark for the Rule Based Method in order to
accurately captures the behaviour and variations in the original understand what would be the accuracy given the set of standard
data. The regenerated spectrum image contains 945 data points rules.
5
Table 3: Results of Rule Based Method

Functional Total Functional Group Priority Functional Group Priority


Group Samples Not Considered Considered
Predictions Predictions
Accuracy Accuracy
Matched Matched
Carboxylic Acid 100 28 0.28 28 0.28
Ester 100 0 0 0 0
Amide 100 6 0.06 6 0.06
Cyanide/Nitrile 100 85 0.85 72 0.72
Aldehyde 100 50 0.5 16 0.16
Ketone 100 68 0.68 9 0.09
Alcohol 100 96 0.96 84 0.84
Amine 100 72 0.72 0 0
Arene/Aromatic 100 76 0.76 11 0.11
Alkene 100 73 0.73 14 0.14
Alkyne 49 7 0.14 0 0
Alkane 100 98 0.98 72 0.72
Ether 100 99 0.99 0 0
Nitro 92 32 0.35 0 0
Overall Average 1341 790 0.5891 312 0.2327

First, the rules are applied without considering the priority order 5 MACHINE LEARNING BASED METHOD
of functional groups. This is what most of the past research has fo- There are two possible ways of finding features for the machine
cused on. In this method, it is simply checked whether the predicted learning algorithms. First one is to use the basic human identifi-
functional group matches with the actual functional group without able features and check whether machine is able to detect better
considering the priority. The results obtained are shown in column structures and mapping. We call this the ’Intermediate approach’.
3 and 4 under the title ’Functional Group Priority Not Considered’, The second way is to allow the machine to develop its own features
in Table 3. For this method, the overall accuracy is 58.91%. For eight from the data which the machine can use for classification. We have
of the functional groups, the accuracy is more than 60%. This is termed it as the ’Complete Automation Approach’.
because, these functional groups have some distinctive patterns
which can be easily identified upon observing the spectrum and
hence can be included as a rule. For example, the cyanide or nitrile 5.1 Intermediate Approach
group can be easily confirmed from the presence of C triple bond In this approach we try to extract features which a human would
N in the spectrum. Hence it shows high accuracy in column 4 in use to correctly identify the functional group of the molecule. We
Table 3. On the other hand, some functional groups like carboxylic term them to be "handcrafted features". Here we have used bond
acid, ester, amide, aldehyde, ketone have many overlapping features ranges as mentioned in Table 2 to identify the transmittance levels
and hence are difficult to distinguish by simple observations or ba- and widths of peaks in the given ranges.
sic rules. Therefore we get very low accuracy for these functional
groups in column 4 in Table 3. For each of the 23 bond ranges that we have considered, we
Furthermore, when the priority is considered in the Rule Based identify 4 features -
Approach(see column 5 and 6 under the title ’Functional Group
Priority Considered’ in Table 3), it is more difficult to identify the • Transmittance - The peak (trough) with least transmittance
functional groups because if the higher priority group is identified value.
incorrectly, then the remaining groups are not checked for and
• Peak Width - The width of the peak with the least transmit-
the compound is classified incorrectly. Hence for this method, the
tance value.
overall accuracy is 23.27%, which is very low.
• Number of Peaks - The count of peaks in the given bond
This shows that if an amateur person in the field is given a set of
range.
rules and is asked to identify functional groups of organic molecules
• Sum of peak widths - The sum of all the peak widths in the
based on priority, the accuracy would hardly be 25%.
bond range.
To improve this accuracy, more sophisticated technique is re-
quired which would be able to identify patterns and correlations
So in all there are 92 features for each sample which are given
between these basic features. To this end, using machine learning
as input to the machine. As the features are manually extracted
is a viable option.
and the classification is done by the machine, we consider it as an
intermediate approach.
6
5.2 Complete Automation Approach to adaptively learn the desired mapping function f directly from
In the second approach, we give all data points (the complete spec- the data patterns. This, however, does not rule out the possibility
trum) as features to machine learning algorithm. The transmittance of using either standard or domain specific data preprocessing to
levels corresponding to all the available wave-numbers are given make the classification process more robust and accurate.
as input features to the machine learning algorithm and we expect
the algorithms to identify and extract better hidden features and
patterns which would allow the machine to classify the samples
correctly.
We consider a truncated range of 4000 cm −1 to 1000 cm −1 be-
cause as mentioned earlier, 1400 cm−1 to 500 cm−1 range is called
fingerprint region and contains many peaks based on bending vibra-
tions within the molecule. Therefore it may result in giving some
false-positives. Since none of the bonds considered by us lie in the
region 1000 cm−1 to 400 cm −1 , this part of fingerprint region should
not affect the classification accuracy and hence can be neglected.
As we are working with less number of classes, we can work with
few features as well. We split the range uniformly such that we get
data set with 250 features for each sample which acts as input to
the machine learning algorithm.

5.3 Implementation
We have implemented four supervised machine learning algorithms
Figure 3: Comparison of the 10 Fold CV accuracies for the
such as Mulitlayer Perceptron (MLP), Support Vector Machine
intermediate and complete automation methods. The error
(SVM), K-Nearest Neighbors (KNN) and Random Forest Classifier
bands indicate 95% confidence intervals of the mean accura-
(RFC), and compared the results for this problem [25].
cies.
Each sample is labelled according to the functional group to
which it belongs. Each functional group is given a different class
label [0, n-1] where n is the number of functional groups. Since we The more important result, from the view point of the targeted
have 14 functional groups, we have labelled them from 0 to 13. The application, is that both the ML based approaches perform better
data is preprocessed before applying a machine learning algorithm (both practically and statistically) than the rule-based approach.
because most of the methods work better when the features are This is in line with the analysis done in previous sections of the
standardized i.e mean 0 and variance 1. It helps in reducing the paper, and an important conclusion from the view point of applying
overall computation. ML to priority based functional group identification.
For analyzing the performance of each classifier i.e. how it be-
haves on unseen data, a K-Fold cross validation technique is used 6.1 Further analysis using permutation tests
with a K value of 10. It is ensured that in each fold every class has
The results presented in the previous section indicate that the ML
an equal representation. For the K-Fold cross validation accuracy,
based approach provides promising results in comparison to the
each time the training data is split into 10 folds with 9 folds used
traditional and oft-utilized rule-based approach. However, ML meth-
for training the model and tested on 1 fold. This is done K times
ods requires training, and the learning process is typically a black
(K=10) and the average accuracy is reported.
box. This is one notable limitation of ML based approach in compar-
ison to rule-based method which is completely transparent in terms
6 TEST RESULTS AND STATISTICAL of the rules employed (i.e. a white box model) to classify a given
ANALYSIS sample. Nevertheless, we can employ permutation test based anal-
We analyze the performance of various machine learning algorithms ysis to gain some insights and obtain more descriptive information
for both intermediate approach and complete automation approach, about the learning process in ML based approach. We, therefore,
and compare them with the rule-based approach. To that end, the employed two permutation tests proposed in [26]. The first one
10 fold CV accuracies are indicated in Fig. 3 in which the error tests the null hypothesis that the ML algorithm does not exploit
bars denote 95% confidence intervals for the corresponding mean class structure (i.e. connection between data and class labels). The
values. The results indicate that the complete automaton approach second test assesses method performance in terms of using the
performs better than intermediate approach both from statistical features describing the data (or the data itself in case no features
and practical viewpoint. This may be explained by the fact that are extracted), and tests the null hypothesis that the ML algorithm
the features for the latter approach are somewhat restrictive and does not use feature dependency to increase classification accuracy.
may lack discrimination abilities (refer to section 3.2, where we For both the tests, an empirical p value can be estimated as the ratio
have discussed how such handcrafted features may not provide of number of times the classification accuracy is better (or equal)
clear distinction between functional groups). On the other hand, than original data (i.e without any randomization) and the number
in the complete automation approach, we allow the ML algorithm of randomizations applied [26].
7
For the first permutation test, the class labels are randomly per- an accuracy of about 77.5% while SVM obtains relatively higher
muted between the samples. As a result, the underlying class struc- accuracy of 85.96%. This, in combination with the results of second
ture is disturbed and the goal of the test is to assess systematically permutation test, suggests that data preprocessing and/or domain
the performance of ML based method on such unstructured data. In specific feature extraction (instead of just the handcrafted features)
this case, for each ML algorithm, the empirical p values was 0.0099 may reduce redundant data information, and more complex classi-
(< 0.05) (the data was permuted 100 times i.e. 100 randomizations). fiers such as SVM may actually exploit feature (or data) dependency
For each iteration the 10 fold CV accuracy obtained was between to increase further the classification accuracies.
5.5% to 9.5% (in comparison the said accuracy on the original data
was nearly 80% refer Figure 3). In light of this, the corresponding 7 CONCLUSION
null hypothesis can be rejected, and we conclude that all the ML This paper discusses an ML based approach towards priority based
algorithms exploit the class structure in the data. identification of functional groups in organic molecules. To the
As mentioned, the second permutation test was performed to best of our knowledge, this is first such work which considers
test whether the ML algorithm exploits feature dependency to the priority of the functional groups and gives a single label for
increase classification accuracy. Therefore, feature (data) values each sample. The aim was to explore the possibilities of using an
were permuted within each class so that the dependency between accurate automated process to supplement the manual rule based
features (data), if it exists, is broken. In such case, we expect that the approach with machine based feature extraction and identification
classifier will obtain lower classification accuracies as compared to process. In the context of finding the most dominant functional
the one on original data. We performed 100 randomizations, and group in the molecule we find that the general rule based approach
obtained the 10 fold CV accuracies in each iteration. is less effective (recall that it gives only 23.27% accuracy as against
As examples, we show in Fig. 4, the histograms of the resultant 58.91% when we just look whether a functional group is present or
10 fold CV accuracies for KNN and SVM. The original classification not without considering priority). Thus, we explored an ML based
accuracy is also indicated in the respective plots. From these, we approach wherein we considered two cases. In the first case, features
observe that KNN is exploiting data dependency because the accu- which a human would identify were used while in the second case,
racies on randomized data are typically lower than on the original the ML algorithm is allowed to identify patterns directly from the
data. This results in a low p value for KNN under this permutation given spectroscopy data. The accuracy increased to 69.18% in the
test. On the other hand, the case of SVM is different in that when first case while it improved further to 86.71% in the second case.
the feature structure is broken, the classification accuracy increases We also supported our analysis using confidence intervals and
(similar results were obtained for the rest of the classifiers namely, permutation tests in order to properly consider the statistical sig-
MLP with 1 hidden layer, MLP with 2 hidden layers and RFC). Con- nificance of the classifier performances. The permutation tests in
sequently, the p value for SVM (as well as for MLP-1, MLP-2 and particular revealed some deficiencies in complex classifiers such as
RFC) is 1. Note that these results are in agreement with those pub- SVM, and this may be remedied by using a reduced feature space
lished in [26] where simpler classifiers such as KNN tend to obtain by using application specific information as well as other standard
lower p values as compared to the more complex classifiers, under dimensionality reduction techniques. Overall, the paper provides
the second permutation test. positive initial steps towards priority based function group identifi-
cation of organic molecules using the ML approach.

ACKNOWLEDGMENTS
We express our sincere gratitude to National Institute of Advanced
Industrial Science and Technology, Japan for open source SDBS
database of IR spectra without which this work would not be pos-
sible. Author B. Chaudhury would like to acknowledge fruitful
discussions on spectroscopy with Dr. K. S. Maiti of Max-Planck
Institute for Quantum Optics.

Figure 4: Results of feature permutation test. The graph on REFERENCES


the left shows the distribution of mean accuracy for KNN. [1] Saul Patai. Patai’s Chemistry of Functional Groups. Wiley, 1964-1995.
The graph on the right shows the distribution of mean accu- [2] John Coates. Interpretation of infrared spectra, a practical approach. Encyclopedia
of Analytical Chemistry, John Wiley Sons Ltd, page 10815âĂŞ10837, 2000.
racy for SVM. [3] Baker et al. Using fourier transform ir spectroscopy to analyze biological ma-
terials. Nature America, Inc., nature protocols, VOL.9 NO.8, pages 1771–1791,
2014.
Such high p values for the second permutation test also indi- [4] H. Favre and W. Powell. Nomenclature of Organic Chemistry: IUPAC Recommen-
cate that these classifiers (SVM, MLP-1, MLP-2 and RFC) are not dations and Preferred Names. Royal Society of Chemistry, 1st edition, 2013.
[5] E. W. Robb and M. E. A Munk. neural network approach to infrared spectrum
fully exploiting the feature dependency to increase classification interpretation. Mikrochim. Acta [Wien]I, pages 131–155, 1990.
accuracies. Instead they may rely on more dominant feature/data [6] M. Minsky and S. Papert. Perceptrons, mit press, cambridge, ma. 1969.
values to classify the given test sample, and hence less affected by [7] R. J. Fessenden and L. GyoÌĹrgyi. Identifying functional groups in ir spectra
using an artificial neural network. j. Chem, 2:1755–1762, 1991.
permuted values due to randomizations. We also note that KNN is a [8] M. E. Munk, M. S. Madison, and E. W. Robb. Neural network models for infrared
simpler classifier as compared to SVM, MLP and RFC, and achieves spectrum interpretation. Mikrochim. Acta [Wien] II, pages 505–514, 1991.
8
[9] M. Meyer and T. Weigelt. Interpretation of infrared spectra by artificial neural
networks. Anal. Chim. Acta, 265:183–190, 1992.
[10] D. Ricard, C. Cachet, and D. Cabrol-Bass. Neural network approach to structure
feature recognition from infrared spectra. j. Chem, 33:202–210, 1993.
[11] P. N. Penchev, G. N. Andreev, and K. Varmuza. Automatic classiÂőcation of
infrared spectra using a set of improved expert-based features. Anal. Chim. Acta,
388:145–159, 1999.
[12] C. Klawun and C. L. Wilkins. Joint neural network interpretation of infrared and
mass spectra. j. Chem, 36:249–257, 1996.
[13] C. Klawun and C. L. Wilkins. Optimization of functional group prediction from
infrared spectra using neural networks. j. Chem, 36:69–81, 1996.
[14] Judit Ambro. Classifying organic compounds using expert system and neural
networks. Theses, Dissertations, Professional Papers, 5104, 1991.
[15] M. Meyer, K. Meyer, and H. Hobert. Neural networks for interpretation of infrared
spectra using extremely reduced spectral data. Anal. Chim. Acta, 282:407–415,
1993.
[16] T. Visser and H. Luinge. Recognition of visual characteristics of infrared spectra
by artificial neural networks and partial least squares regression. J.; Van der
Maas, J. H, 296:141–154, 1994.
[17] V. Tchistiakov, C. Ruckebusch, L. Duponchel, J. P. Huvenne, and P. Legrand.
Neural network modelling for very small spectral data sets: reduction of the
spectra and hierarchical approach. Chemometrics and Intelligent Laboratory
Systems, 54:93–106, 2000.
[18] Chris W. Brown and Su-Chin Lo. Chemical information based on neural network
processing of near-ir spectra. Anal. Chem, 70(14):2983–2990, 1998.
[19] Kazutoshi Tanabe et al. Identification of chemical structures from infrared spectra
by using neural networks. Appl. Spectrosc, 55:1394–1403, 2001.
[20] K.Tanabe S.Kinugasa and T.Tamura. Sdbsweb:. http://sdbs.db.aist.
go.jp(NationalInstituteofAdvancedIndustrialScienceandTechnology,Japan), ac-
cessed May-July, 2017.
[21] G. C. Bassler R. M. Silverstein and T. C. Morrill. Spectrometric Identification of
Organic Compounds. John Wiley, 5th edition, 1991.
[22] George Socrates. Infrared and Raman Characteristic Group Frequencies: Tables
and Charts. John Wiley and Sons, 3rd edition, 2004.
[23] Peter Larkin. Infrared and Raman Spectroscopy. Elsevier, 1st edition, 2011.
[24] Jr. Leroy G. Wade. Organic Chemistry. Pearson Education, 6th edition, 2007.
[25] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2Nd
Edition). Wiley-Interscience, 2000.
[26] Markus Ojala and Gemma C. Garriga. Permutation tests for studying classifier
performance. J. Mach. Learn. Res., 11:1833–1863, August 2010.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy