Base Paper
Base Paper
net/publication/351361220
Article in International Journal of Advanced Computer Science and Applications · May 2021
CITATIONS READS
7 6,000
2 authors:
All content following this page was uploaded by Suleyman Al-Showarah on 03 June 2021.
Abstract—Identifying birds is one of challenging role for bird algorithms and methods [1][3][4][7][14], but this study differs
watchers due to the similarity of the birds’ forms/image from others in using the following operations: combine
background and the lack of experience for watchers. So, it needs between the fc6/fc7, max between fc6/fc7, min between
a computer system based images to help birdwatchers in order to fc6/fc7, and the average for fc6/fc7 based on VGG-19. Hence,
identify birds. This study aims at investigating the use of deep the field of birdwatching needs more investigations to develop
learning for birds’ identification using convolutional neural systems with new technique that help to identify birds.
network for extracting features from images. The investigation
was performed on database contained 4340 images that collected As the database of images were collected from Jordan, and
by the paper author from Jordan. The Principal Component the statistics number of birds in Jordan as stated in [13] are
Analysis (was applied on layer 6 and 7, as well as on the 434 species belonging to 66 families.
statistical operations of merging the two layers like: average,
minimum, maximum and combine of both layers. The datasets This study aims at investigating the use of deep learning
were investigated by the following classifiers: Artificial neural for birds’ identification using VGG-19 for extracting features
networks, K-Nearest Neighbor, Random Forest, Naïve Bayes and from images. In order to achieve this aim, the investigation for
Decision Tree. Whereas, the metrics used in each classifier are: the performance of different classifiers were performed on the
accuracy, precision, recall, and F-Measure. The results of following classifiers: (KNN, Decision Tree, Random Forest,
investigation include and not limited to the following, the PCA and ANN) on the collected reliable database of birds images
used on the deep features does not only reduce the that available in Jordan.
dimensionality, and therefore, the training/testing time is
reduced significantly, but also allows for increasing the VGG-19 considered as one of the most important models
identification accuracy, particularly when using the Artificial of Convolutional Neural Networks (CNN). Therefore, CNN is
Neural Networks classifier. Based on the results of classifiers; considered as the strongest technique for deep learning used in
Artificial neural networks showed high classification accuracy image identification [9].
(70.9908), precision (0.718), recall (0.71) and F-Measure (0.708)
The main reason of using VGG-19 is to provide high
compared to other classifiers.
precision by finding features with distinctive details in the
Keywords—Birds identification; deep learning convolutional image like the difference in lighting conditions and other
neural networks (CNN); VGG-19; principal component analysis objects surrounding the birds [3]. Moreover, PCA could be
(PCA) employed as dimensionality reduction tools with these
features that would help to reduce number of features that will
I. INTRODUCTION make the training time less.
Many people are interested in observing and studying The motivation to conduct this study represented by:
wildlife, especially in birdwatching. The role of birdwatching 1) The shortage in the field of identifying birds based on
is to preserve the nature by observing bird’s behavior and images. 2) To the best of our knowledge, we have not come
migration pattern. The challenge for bird watchers in across to any study conducted using VGG-19 for identifying
identifying birds based images remains difficult due to the birds. 3) There is shortage in database available in the world
similarity of the birds’ forms/ image background and the lack except these two databases that available in [1] [18]. This case
of experience in this field for watchers [1]. is applicable to Jordan, as there is no database of images for
As mentioned in [17] that birds Voice or Videos were used birds, and there is no program was developed to identify birds.
in earlier technique to predict it species, but this technique Based on the extracted features using VGG-19, the
have many challenges to give an accurate result due to other contribution of this study can provide a research fields with a
background of birds/animal voices. So, images can be best comparison between the results of different aforementioned
choice to be used to identify birds’ species. To implement this classifiers.
technique, the images for all birds’ species need to be trained
to generate a model. Then deep learning algorithm will This study organized into six sections. Section II
convert uploaded image into gray scale format and apply that introduces the overview of previous studies on all related
image on train model to predict best match species name for subjects. Section III describes the used database. Section IV
the uploaded image. discusses the model design and the methodology for the
experiment. Then Section V discusses the results of the
Also, during the previous years, artificial intelligence is experimental, and finally, Section VI presents paper
used in the field of bird watching based images using different conclusion.
251 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
II. RELATED WORK available in [1]. In their study, their database consisted of 300-
Machine learning (ML) represents a set of techniques that 400 different images consists of number of bird species. In
allow systems to discover the required representations to their study, the algorithm used to extract image features is
features detection or classification from the raw data. The AlexNet and then classified by using a SVM classifier. The
performance of works in the classification system depends on results of accuracy is 85%.
the quality of the features. As such of this study can be The researchers in [11] used multiple pre-CNN networks
categorize under the field of ML; this is to make a search in algorithms like: (AlexNet, VGG19 and GoogleNet) on birds
this area for the studies that belong to birds’ identification. dataset that is called (Caltech-UCSD Birds-200-2011). Based
In the literature review, there are number of studies on approach of combining between the aforementioned
conducted in field of identifying birds. But they were algorithms together, the results showed that this approach
conducted in different algorithms and methods, as follows: improved the accuracy that reached to 81.91%, when applied
on Caltech-UCSD Birds-200-2011 dataset compared to other
There are number of studies conducted for identifying datasets used in the same study.
birds based audio/ video like [4][11][6][10]. While other
studies conducted to identify birds based images using AI Another study conducted by [4] in field of database birds
algorithms [1][3][14], but not in what was conducted in this based images and birds identification system. Their study
study. This study used different operations like: MAX, MIN, aimed to classification the birds during flight from video clips.
AVERAGE, and Combine between the layers fc6/fc7 based They approximately collected 952 clips and extracted about
on VGG-19 algorithm. 16,1907 frame photos of 13 birds’ species. In order to improve
the accuracy, the researchers used the two features:
In field of birds database-based images and birds appearance and motion features. Then, they compared their
identification system, the researchers in [19] conducted study proposed method with the classifiers (VGG, MobileNet). The
on data collected mostly from North American of 200 bird proposed method achieved a 90% correct classification rate
species, where they called it: (Caltech-UCSD Birds 200 when using Random forest classifier.
(CUB-200)). They conducted their study based on two simple
features: image sizes and color histograms. In the case of In field of birds’ identification system, the researchers in
image sizes, they represented each image by its width and [3] applied different methods like: 1) softmax regression using
height in pixels. But in case the color histograms, they used 10 manually features on the Caltech-UCSD-Birds-200 dataset
bins per channel, where an applied Principal Component [19]. 2) A multi-class SVM was applied on HOG and RGB on
Analysis was applied. Their results showed how the features extracted from images. 3) A CNN was applied using
performance of the NN classifier degrades as the number of transfer learning algorithm to classify birds. The results of
classes in the dataset is increased, as in [18]. The performance comparing the three methods 46% when using CNN.
of the image size features are close to chance at 0.6% for the In the next section, the database content, number of
200 classes, while the color histogram features increase the images, source of images, and the challenges to classify
performance to 1.7%. Another example of studies that images are explained.
conducted in field of database for birds based images and
birds’ identification system, the researchers in [18] increased III. DATABASE DESIGN
the number of images to 11788 images; as it was 6033 in [19]. The database of birds images were collected from Jordan,
Where they used RGB color histograms and histograms of and it consists of 4340 images of 434 bird species. The
vector-quantized SIFT descriptors with a linear SVM. The database images were obtained from scientific sources and
results obtained of their study for the classification accuracy is were approved by Jordanian Bird Watching Association based
17.3%. on their scientific names [13].
Also, in the field of birds’ identification system, the The images have different backgrounds, where some of
researchers in [14] proposed a new feature to distinguish the them were taken in shadow condition, lightening background,
types of birds. In their study, they used the ratio of the and some of them have other objects in the images as
distance from the eye to the beak root, and the beak width. background. This has added a huge challenge to the
This feature was integrated in the decision tree, and then in researchers to extract features, and to provide high accuracy.
SVM. This proposal was applied to the database that called
(CUB-200-2011 dataset) that mentioned in [18]. The results IV. PROPOSED METHOD
achieved for correct classification rate is 84%. This section presents the procedures that used for the
Another study conducted on birds-identification. Their proposed method in identifying birds using VGG-19. Fig. 1
database was collected in India by the researchers that shows the proposed model.
252 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
The following steps explains the proposed model of this V. EXPERIMENTAL RESULTS AND DISCUSSIONS
study, as follows:
This section presents the performance evaluation results
Step 1): The feature vectors will be extracted form images for the study dataset, which includ the accuracy , F-measure,
automatically using MATLAB for Pretrained VGG-19 to recall, precision and training time for each classifier as
build dataset that includes (feature factors: fc6 and fc7). Each follows:- 1KNN, 3KNN, 5KNN, ANN, Naïve Bayes, Random
dataset (e.g. fc6) contains 4096 columns (representing feature Forest and Decision Tree.
vectors) and 4340 rows (representing the number of samples
(images). The results of this study are displayed as follows:
Step 2): The statistical operations like: (min, max, average, A. Results of both Orginal/Pure fc6/fc7 Datasets Separately
and combined them together) were performed on the Table I shows the results of both orginal of fc6 and fc7
original/pure of fc6 and fc7 layers, this is to obtain new datasets. Naive Bayes has achieved the highest accuracy
dataset to be used in the next stage (step 3) of using classifiers. results for fc6 and fc7 which are (59.002) and (56.106).. While
Explanation on statistical operations, as follows: for the time spend to conduct the test and training dataset,
Decision Tree has spend large time (1406.69s), but KNNs
• Max: used to find the largest value between the two spend less time (0s) compared to other classifers. This is
values in fc6 and fc7 and put value in a new group. because it has no training model; where the test example is
• Min: used to find the less value between the two values compared directly to other examples in the training set, and
in fc6 and fc7 and put value in a new group. that why it is slow in testing, particularly when used a large
number of examples in the training [8][16]. This results match
• Average: used to find average the two values in fc6 and with the results in [5] [12].
fc7 and put value in a new group.
B. Results of the Statitsical Operations on fc6 and fc7
• Combined them together: used to combine the first Datasets
group (4096) next to the second group (4096). This is The section show the results of three dataset by applying
to have a new group that contains 8192 features in this statistical operations(avgerage, maximum, minmum) between
study. the fc6 and fc7 layers.
Step 3): A PCA will be applied on the original/pure of fc6, Table II shows results of the statitsical operations on
fc7, the dataset that obtained from the previous stage (step 2); fc6/fc7 datasets , where Naive Bayes has achieved the highest
this is to produce a new datasets. accuracy results for AVERAGE, MAX, and MIN, which are
The data obtained using the pre-trained VGG-19, is very (57.30), (60.99) , and (57.60) respectively. Despite of the
large (4096), therefore, the PCA was implemented to reduce Naive Bayes have scored acceptable accuracy, F-measure,
the number of features. In PCA, there were set of percentages recall, and precision that outperformed all classifiers, but also
used to show the variance of the data in the results, which are: it was achieved with acceptable training time. This result dis-
95%, 97% and 99% variance of the data (the 4096 features). match with other studies [2] [15].
Step 4): The results were performed based on applying set C. Results of Combine between (Original fc6/ fc7) Dataset
of classifiers on the datasets that obtained from (step 2 and A new dataset was obtained called combine by combining
step 3). of fc6 (4096)and fc7 (4096), which contained 8192 feature
vector, and accordingly will obtained the results:
253 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
Table III shows the birds identification results where In Table IV, the classifer ANN was not used in the
Naive Bayes has achieved the highest accuracy results in previous Tables I to III. This can be explained as follows:
combine (59.4009) of accuracy. The second high result of ANN is the best classifier to be used for deep features, if and
accuracy is 1KNN that has achieved accuracy of 50.2074. only if it is provided with a smaller number of deep features,
While for the time spend to conduct the test and training otherwise, i.e. if it is applied on the original/pure deep
dataset, Decision Tree spend large time (|2484.01s), but KNNs features, which obtained from the VGG-19 layer 6 or 7 or any
spend less time (0s) compared with other classifiers. merging of them both, the training time would be
unacceptably long [2] [15][11].
D. Results of Both Original/pure fc6/fc7 after Applying PCA
Tables IV to V shows the idntification results for each
classifier after applying PCA (95%,97%,and 99%).
254 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
TABLE III. IDENTIFICATION RESULTS OF COMBINE BETWEEN (ORIGINAL FC6/ FC7) DATASET
TABLE IV. IDENTIFICATION RESULTS OF ORIGINAL/PURE FC6 AFTER APPLYING PCA (95%,97%,99%)
Applying PCA has influnced on the training time for fc6 Table V shows the birds identification results for fc7
that made it less for all classifers in Table IV-after applying where the highest accuracy resultant from applying PCA of
PCA compared to the training times in Tables I to III, before (95%,97% and 99%) are in favors of ANN with (65.2995,
applying PCA, especially for Random Forest and Naïve 65.2995 and 67.9493), respectively.
Bayes. The highest accuracy resultant from applying PCA of
(95%, 97% and 99%) is in favor of ANN with (68.8018, 70 The second high accuracy resultant from applying PCA of
and 62.3733%), respectively, which can be attributed to the all percentage of (95%, 97% and 99%) is Naïve Bayes, has
reduced feature vector. achieved accuracy of (58.3641, 56.9585 and 56.3825%),
respectively.
So, it is worth mentioning that the ANN classifier was not
used with other sets except those obtained after applying the E. Results of the Statistical Operations on (fc6 and fc7) after
PCA, this is because of its unacceptable training time. This Applying PCA
results matches with previous studies that stated the training This section presents the identification results of the
time for ANN spend large compared with other classifers statistical operations on each of (average, maximum and
[2][15]. minimum) between the fc6 and fc7 after applying PCA
255 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
(95%,97%,99%), as well as the results of training time for While for the time spend to conduct the test and training
each classifier, as follws: dataset, ANN spend large time 54151.88s.
Table VI shows the birds identification results in (average Table VIII shows the birds identification results in
between (fc6 and fc7)) where the highest accuracy resultant (minimum between (fc6 and fc7)) where the highest accuracy
from applying PCA of (95%, 97% and 99%) are in favors of resultant from applying PCA of (95%) are in favors of ANN
ANN with (69.5622, 69.9078 and 65.5069) respectively. The with (70.8295). It is noted that the result of the ANN is
second-high accuracy resultant from applying PCA of all appeared only for the PCA (95%), but not for the percentage
percentage of (95%, 97% and 99%) is Naïve Bayes that has (97%, and 99%). This is because the large number of features
achieved accuracy of (53.3871, 49.7926 and 39.8157%) for each of PCA (97% and 99%) that reached to (1205 and
respectively. While the time spend to conduct the test and 1910) features respectively. Also, due to its unacceptable
training dataset, ANN spend large time 58379.22s , where that training time (that takes days to provide the results. While
PCA 95 spend less time compared to PCA 97and PCA99. Naïve Bayes achieved accuracy resultant from applying PCA
of all percentage of (95%, 97% and 99%), they are as follows
Table VII shows the birds identification results in (48.7327, 44.1014 and 35%), respectively. While for the time
(maximum between (fc6 and fc7)) where the highest accuracy spend to conduct the test and training dataset, ANN spend
resultant from applying PCA of (95%) are in favors of ANN large time 42677.02s.
with (66.9816) . It is noted that the results of the ANN is
appeared only for PCA (95%), but not for the percentage of F. Results of Combining Feature Vector after Applying PCA
(97%, and 99%). This is because the large number of features This section shows the results of combining between fc6
for each of PCA (97% and 99%) that reached to (1428, and (4096) and fc7 (4096) that reached 8192, but this number of
2117) features, respectively. Therefore there will not be results features have been reduced after appling PCA (95%, 97%,
when using ANN, due to its unacceptable training time (that 99%) that become (250, 440 and 1080) features, respectively.
takes days to provide the results. The results of combine, as follows:
256 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
TABLE VI. IDENTIFICATION RESULTS OF AVERAGE BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)
TABLE VII. IDENTIFICATION RESULTS OF MAXIMUM BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)
257 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
Table IX shows the birds identification results in (combine 2) Some others of previous studies conducted on dataset
between fc6 and fc7) where the highest accuracy resultant containing a large number of images in training dataset
from applying PCA of (95%,97% and 99%) are in favors of (examples) like in [4], [3], [14] that used (161907), (11788),
ANN with (69.5392, 70.9908 and 67.9263), respectively. The (11788) examples recpectively compared to this study which
second high accuracy resultant from applying PCA of all
contained a few images (4340 examples). Few number of
percentage of (95%, 97% and 99%) is Naïve Bayes that has
achieved accuracy of (57.235, 54.1475 and 43.7558%), images (examples) for each bird usually leads to low accuracy
respectively. compared to the large examples, but in constant it was not.
This leads to make more covident in the results of this study.
While for the time spend to conduct the test and training 3) There were studies conducted for identifying birds
dataset, ANN spend large time (56279.29s). Comparison using different algorithms and methods based audio/ video
between the proposal work and previous researchers’ works.
like [4][11][6][10], while other studies conducted to identify
Table X compares the results of the proposed approach birds based images using AI algorithms [1][3][17]. This is less
with three similar approaches for birds identification. in what was conducted in this study that used deep-learning
Table X has approved that the output of our proposal can algorithms and different statistical operations like: MAX,
be considered as one of the interesting study comapred to the MIN, AVERAGE, and combine between the layers fc6/fc7
previous researchs, for several reasons: based on VGG-19 algorithm.
4) This study conducted on different methods like:
1) Some of previous studies were conducted on small
combine between the fc6/fc7, max of fc6/fc7, min of fc6/fc7,
dataset birds (categories) like in [4], [7] that used (13), (16)
and the average for fc6/fc7 based on VGG-19.
categories recpectively, compared to this study that used
(434).
TABLE VIII. IDENTIFICATION RESULTS OF MINIMUM BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)
258 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
TABLE IX. IDENTIFICATION RESULTS OF COMBINE ON (FC6 AND FC7) AFTER APPLYING PCA (95%, 97%, 99%)
259 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021
training/testing time is reduced significantly, but also allows [9] Hijazi, Samer, Rishi Kumar, and Chris Rowen. 2015. “Using
for increasing the identification accuracy, particularly when Convolutional Neural Networks for Image Recognition.” . IP Group,
Cadence. Retrieved from https://ip.cadence.com/uploads/901/cnn_wp-
using the ANN classifier. Based on the results of classifiers; pdf.
ANN showed high classification accuracy (70.9908), precision [10] Incze, A., Jancsó, H. B., Szilágyi, Z., Farkas, A., & Sulyok, C. 2018.
(0.718), recall (0.71) and F-Measure (0.708) compared to Bird sound recognition using a convolutional neural network. In 2018
other classifiers. IEEE 16th International Symposium on Intelligent Systems and
Informatics (SISY) :295-300 IEEE.
It is recommended to conduct more investigation to [11] Korzh, Oxana, Mikel Joaristi, and Edoardo Serra B. 2018.
improve accuracy results and to reduce training time using “Convolutional Neural Network Ensemble Fine-Tuning for Extended
different algorthms. Transfer.” In International Conference on Big Data, 110–23. Retrieved
from http://dx.doi.org/10.1007/978-3-319-94301-5_9.
REFERENCES
[12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.
[1] Tayal, Madhuri, Atharva Mangrulkar, Purvashree Waldey, and Chitra “ImageNet Classification with Deep Convolutional Neural Networks.”
Dangra. 2018. “Bird Identification by Image Recognition.” Helix 8(6): 25: 1–9. Technologies.
4349–4352.
[13] The Royal Society For The Conservation Of Nature 2017.
[2] Albustanji, Abeer. 2019. “Veiled-Face Recognition Using Deep “Birdwatching in Jordan”. Retrieved from https://migratorysoaringbirds.
Learning.” Mutah University. birdlife.org/sites/default/files/jordan_birding_brochure.pdf.
[3] Alter, Anne L, and Karen M Wang. 2017. “An Exploration of Computer [14] Qiao, Baowen, Zuofeng Zhou, Hongtao Yang, and Jianzhong Cao. 2017.
Vision Techniques for Bird Species Classification.”. “Bird Species Recognition Based on SVM Classifier and Decision
[4] Atanbori, John et al. 2018. “Classification of Bird Species from Video Tree.” In 2017 First International Conference on Electronics
Using Appearance and Motion Features” Ecological Informatics 48: 12– Instrumentation & Information Systems 1–4.
23. [15] Sarayrah, Bayan mahmoud. 2019. “Finger Knuckle Print Recognition
[5] Brownlee, Jason. 2016. “How To Use Classification Machine Learning Using Deep Learning.” Mutah University.
Algorithms in Weka.” Retrieved from https://machinelearningmastery. [16] S. Al-Showarah et. al. 2020. “The Effect of Age and Screen Sizes on the
com/use-classification-machine-learning-algorithms-weka/. Usability of Smartphones Based on Handwriting of English Words on
[6] Cai, J., Ee, D., Pham, B., Roe, P., & Zhang, J. (2007, December). Sensor the Touchscreen”, Mu’tah Lil-Buhuth wad-Dirasat, Natural and Applied
network for the monitoring of ecosystem: Bird species recognition. In Sciences series, Vol. 35, No. 1, 2020. ISSN: 1022-6812.
2007 3rd international conference on intelligent sensors, sensor [17] Triveni, G., Malleswari, G. N., Sree, K. N. S., & Ramya, M (2020). Bird
networks and information 293-298. IEEE. Species Identification using Deep Fuzzy Neural Network Int. J. Res.
[7] Kumar, A., & Das, S. D. 2018. “Bird Species Classification Using Appl. Sci. Eng. Technol.(IJRASET), 8: 1214-1219.
Transfer Learning with Multistage Training”. In Workshop on Computer [18] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie 2011. “The
Vision Applications 28-38. Springer, Singapore. Caltech-UCSD Birds-200-2011 Dataset.” Technical Report CNS-TR-
[8] Hassanat, A. (2018). “Furthest-pair-based binary search tree for 2011-001, California Institute of Technology.
speeding big data classification using k-nearest neighbors”. Big Data, [19] Welinder, Peter et al. 2010. “Caltech-UCSD Birds 200.” Technical
6(3): 225-235. ReportCNS-TR-2010-001, California Institute of Technology, 2010.
260 | P a g e
www.ijacsa.thesai.org
View publication stats
Local compressed convex spectral embedding for bird species identification
Anshul Thakur, Vinayak Abrol, Pulkit Sharma, and Padmanabhan Rajan
Citation: The Journal of the Acoustical Society of America 143, 3819 (2018); doi: 10.1121/1.5042241
View online: https://doi.org/10.1121/1.5042241
View Table of Contents: http://asa.scitation.org/toc/jas/143/6
Published by the Acoustical Society of America
Local compressed convex spectral embedding for bird species
identification
Anshul Thakur,a) Vinayak Abrol, Pulkit Sharma, and Padmanabhan Rajan
School of Computing and Electrical Engineering, IIT Mandi, Mandi, Himachal Pradesh-175005, India
(Received 30 November 2017; revised 14 April 2018; accepted 18 April 2018; published online 29
June 2018)
This paper proposes a multi-layer alternating sparsedense framework for bird species identifica-
tion. The framework takes audio recordings of bird vocalizations and produces compressed convex
spectral embeddings (CCSE). Temporal and frequency modulations in bird vocalizations are
ensnared by concatenating frames of the spectrogram, resulting in a high dimensional and highly
sparse super-frame-based representation. Random projections are then used to compress these
super-frames. Class-specific archetypal analysis is employed on the compressed super-frames for
acoustic modeling, obtaining the convex-sparse CCSE representation. This representation effi-
ciently captures species-specific discriminative information. However, many bird species exhibit
high intra-species variations in their vocalizations, making it hard to appropriately model the whole
repertoire of vocalizations using only one dictionary of archetypes. To overcome this, each class is
clustered using Gaussian mixture models (GMM), and for each cluster, one dictionary of archetypes
is learned. To calculate CCSE for any compressed super-frame, one dictionary from each class is
chosen using the responsibilities of individual GMM components. The CCSE obtained using this
GMM-archetypal analysis framework is referred to as local CCSE. Experimental results corrobo-
rate that local CCSE either outperforms or exhibits comparable performances to existing methods
including support vector machine powered by dynamic kernels and deep neural networks.
C 2018 Acoustical Society of America. https://doi.org/10.1121/1.5042241
V
J. Acoust. Soc. Am. 143 (6), June 2018 0001-4966/2018/143(6)/3819/10/$30.00 C 2018 Acoustical Society of America
V 3819
supervised, multi-layer, alternating dense-sparse framework high correlation between atoms of inter-class dictionaries,
to obtain feature representations for bird species identifica- which may degrade the discriminative ability of these dictio-
tion. In the proposed method, a given recorded audio signal naries. Supervised dictionary learning methods such as label-
(dense) is converted into a magnitude spectrogram (sparse). consistent K-singular value decomposition (Ref. 25) overcome
This concept of sparsity comes from the analysis that most of this problem by learning dictionaries in a supervised manner.
the bird vocalizations usually occupy only a few frequency Nevertheless, these supervised dictionary learning techniques
bins in the spectrogram.19 The frequency and temporal modu- are computationally expensive (both in time and space) and are
lations present in bird vocalizations provide species-specific not feasible when a substantial number of classes are involved.
signatures. However, applying matrix factorization techniques In order to overcome this issue, we propose an efficient greedy
on spectrograms directly may not capture these modulations procedure to choose atoms from each dictionary such that the
effectively. To overcome this issue, a certain number of overall correlation among all dictionaries is decreased. This
frames are concatenated around each frame of the spectro- procedure not only reduces the gross-correlation among dictio-
gram for embedding the context. This results in a high dimen- naries but also helps in reducing their size. Decreasing the dic-
sional (sparse) super-frame representation that is capable of tionary size reduces the computational complexity, which can
capturing the frequency and temporal modulations more be helpful for large-scale species identification.
effectively. These high dimensional super-frames are unsuit- The major contributions of this work are summarized as
able for acoustic modeling due to high computational com- follows:
plexity. Since the spectrogram is sparse, this super-frame
(1) Local CCSE, a supervised feature representation, that
representation is also sparse. Hence, super-frames can be
handles intra-class variations efficiently (Algorithm 2).
compressed without losing too much information. Random
(2) The application of a restricted version of archetype anal-
projections,20 which preserve pairwise distance according to
ysis for acoustic modeling.
the JohnsonLindenstrauss (JL) lemma, are used to com-
(3) A greedy procedure to choose a subset of atoms from each
press these super-frames to a low-dimensional representation
dictionary such that the overall correlation among all local
(dense). In the next step, the vocalizations of each bird spe-
cies are modeled using restricted robust archetypal analysis dictionaries of all classes is reduced (Algorithm 1).
(AA). AA provides compact, probabilistic and interpretable The rest of this paper is organized as follows. In Sec. II,
representation23 in comparison to the other matrix factoriza- we describe CCSE-based framework. In Sec. III, the pro-
tion techniques such as non-negative matrix factorization posed local CCSE framework along with the proposed prun-
(NMF) and sparse dictionary learning.22 The learned arche- ing procedure to decrease the inter dictionary correlation is
typal dictionaries are used to obtain a sparse-convex represen- discussed. Experimental setup and observations are in Secs.
tation for the compressed super-frames. These representations IV and V, respectively. Section VI concludes the paper.
are designated as compressed convex spectral embeddings
(CCSE). This CCSE representation captures species-specific II. COMPRESSED CONVEX SPECTRAL EMBEDDINGS
signatures effectively and can be used as feature representa- (CCSE)
tion in any classification framework.
In this section, the overall process to obtain CCSE from
CCSE assumes that the compressed super-frames of a
any input recording is described (Fig. 1). First, we describe
bird species lie on only one manifold. However, a particular
the process of obtaining a compressed super-frame-based
bird species can have a large repertoire of vocalizations that
representation from any input audio recording. Then, we
often occupy different manifolds in the feature space.3
explain the procedure to learn an archetypal dictionary for
Therefore, a single archetypal dictionary per bird species may
each bird species. Finally, we describe the process to obtain
not be able to model the variations present in a single bird
CCSE for any audio recording.
class. We address this problem by proposing to use multiple
archetypal dictionaries to model one bird species. In order to
A. Computing compressed super-frames
learn multiple dictionaries, the compressed super-frames are
clustered using GMM and for each cluster, an individual The short time Fourier transform (STFT) is applied to
archetypal dictionary is learned. To obtain the CCSE for a obtain a magnitude spectrogram S ðm N; m is the number
compressed super-frame, a dictionary is chosen for each class of frequency bins, N is the number of frames) from each input
using the responsibility terms of the class-specific GMM. The audio recording. Short term Fourier analysis often leads to the
CCSE obtained using this GMM-AA-based framework is des- smearing of temporal and frequency modulations present in
ignated as local CCSE. bird vocalizations. In order to capture these modulations more
The archetypes learned using AA approximates the effectively, context information is ingrained into the current
convex-hull of the data, and the estimation of these arche- frame (under processing) of the spectrogram by concatenating
types is often expensive in terms of computation.24 Hence, in W previous and W next frames around the current frame. This
order to speed up the process of finding archetypes, we use a concatenation produces a high dimensional [ð2Wm þ mÞ 1]
restricted version of AA. In restricted AA, only the data representation called a super-frame. The pooled spectrograms
points around the convex hull/boundary are used for deter- ^ ðm l; m
of all the training examples of a particular class, S
mining the archetypes. Conventionally, AA is performed indi- is the number of frequency bins, l is the number of pooled
vidually for each class and without any separate effort to frames), are converted into super-frame representation,
increase the inter-class discrimination. Hence, there can be a F 2 Rð2WmþmÞl , using the aforementioned concatenation
3820 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
FIG. 1. (Color online) Proposed pipeline for obtaining CCSE from an audio signal.
process. These super-frames are high-dimensional, which B. Restricted robust AA for dictionary learning
makes them computationally expensive to process for acous-
The CCSE framework employs archetypal analysis (AA)
tic modeling. However, these super-frame representations are
for acoustic modeling. The compressed super-frames corre-
sparse. The sparsity of the spectrogram and super-frames is
sponding to the bird vocalization regions are used for learning
illustrated in Fig. 2. Due to this sparsity, super-frames are
the archetypes. The bird vocalization regions are identified (in
suitable to attain a high degree of compression. Hence, build-
the input recordings) using a semi-supervised segmentation
ing upon the JL lemma,21 random projections are used
method27 proposed in one of our earlier studies. Using AA,
to compress these super-frames. Gaussian random matrices
which is a non-negative matrix factorization technique, the
satisfy the J-L lemma with high probability.26 Hence, these
matrix of compressed super-frames, X, is decomposed to
random matrices preserve the pair-wise distance between
obtain the representation matrix A as X ¼ DA. The dictionary,
super-frames in the projected space. In particular, a random
D, consists of the archetypes, which lie on the convex hull of
Gaussian matrix G (of dimensions K 2Wm þ m) is used to
data. These archetypes are confined to be the convex combi-
achieve the transformation, / : R2Wmþm ! RK , which com-
presses the super-frames. This compressed representation, nation of the individual data points, i.e., D ¼ XB; D 2 RKd
X ¼ G F; X 2 RKl , is used to learn the archetypal dictio- (d is the number of archetypes) and B 2 Rld .
naries. Figure 2(c) depicts the compressed super-frame repre-
1. Restricting AA
sentation obtained for the spectrogram shown in Fig. 2(a).
Similarly, for a test audio recording, compressed super- Generally, matrix factorization is a computationally
frames are obtained using the same procedure. expensive process and AA is no exception. However, it is
J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3821
known that archetypes lie on the boundary or convex hull of hðuÞ ¼ 1=2minw ½u2 =w þ w.23 The use of Huber loss intro-
the data. This property can be used to restrict the archetypal ^ i jj ; Þ for xi in the opti-
duces a weight wi ¼ maxðjjxi XBa 2
search space to the data points existing around the boundary. mization process, i.e., wi weighs the contribution of xi in the
This restricted search reduces the computational time estimation of archetypes. After the optimization, the weight
required to learn the archetypes. wi becomes larger for the outliers, reducing their importance
Let B be the index of compressed super-frames that lie in finding the archetypes. In this work, the optimization
around the boundary. To find these super-frames, the follow- problem in Eq. (2) is solved using an iterative procedure pro-
ing objective function is minimized: posed by Chen et al.23 (algorithm 3 in Ref. 23).
1X 1
p
¼ ^ i jj2 þ wi ;
jjxi XBa 2
2 i¼1 wi
Dp ¢ bj Ɒ0; jjbj jj1 ¼ 1 ; Dd ¢ ai Ɒ0; jjai jj1 ¼ 1 ;
wi
8i : 1 ! p and 8j : 1 ! d: (2)
^ 2 Rkp , A 2 Rdp
Here xi ; ai ; and bj are the columns of X
pd
and B 2 R ; respectively, wi is a scalar, and is a positive
constant. In contrast to conventional AA employing
Euclidean loss, robust AA employs a Huber loss function FIG. 3. (Color online) Average running time recorded for robust AA and
hðÞ. For scalars u and , the Huber function is defined as restricted AA.
3822 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
class-specific signatures and can be used as a feature repre- A. Learning local dictionaries
sentation for species classification. This behavior is illus-
The compressed super-frames corresponding to the bird
trated in Fig. 4, which shows the average of CCSEs
vocalizations present in the training audio recordings are
obtained for an exemplar vocalization of three different
extracted and pooled together in a class-specific manner as
species. These average CCSEs are obtained using the final
described in Sec. II A. These pooled super-frames are used
dictionary (D) derived from the individual dictionaries of
for learning multiple local dictionaries of a bird class. First,
all three species. The final dictionary contains 128 atoms
a GMM with Z components is used to cluster these super-
per class (the first 128 for black-throated tit, the next 128
frames. Then, restricted robust AA (Sec. II) is applied to get
for black-yellow grosbeak and the last 128 for black-crested
an archetypal dictionary for each of these Z clusters. Hence,
tit). In average CCSEs, the coefficients exhibit higher
one bird species/class is modeled by Z archetypal dictionar-
amplitude for the atoms of D which correspond to the true
ies. It has to be noted the number of GMM components can
class. This corroborates our claim of the discriminative
be different for different classes, e.g., Z can be large for a
nature of CCSE.
class having large variations in vocalizations (e.g., Cassin’s
vireo) as compared to the one with less variations (e.g.,
III. PROPOSED LOCAL CCSE-BASED FRAMEWORK Hutton vireo). Since the clusters within a class can exhibit
Songs phrases and various calls such as alarm calls, more overlap, GMM provides better clustering than the
feeding calls and flight calls form the repertoire of vocaliza- hard-clustering techniques like K-means or K-medoids.
tions that a species can produce. The nature of different kind
of vocalizations can vary considerably.3 A single archetypal B. Decreasing the inter-dictionary correlation
dictionary (as used in CCSE) cannot effectively model all In Sec. III A, all dictionaries are learned independently,
these within-class variations. An effective way to handle this which may lead to high correlation between the inter-
problem is to learn local archetypal dictionaries. The CCSE dictionary atoms. This high correlation is not a big issue for
learned from these local dictionaries provide better represen- the dictionaries of one class. However, if correlation is high
tation for a bird species. Keeping these facts in account and among the dictionaries of different classes, it can affect the
improvising over the CCSE framework, we propose a local classification performance. In order to address this problem,
CCSE-based framework which can handle the variations pre- a greedy pruning procedure is proposed to choose a subset of
sent in vocalizations of various bird species. In this frame- atoms from each dictionary, such that the gross correlation
work, multiple local dictionaries are learned for each class. among all the dictionaries is decreased.
The different local dictionaries model the different sets of Let us denote the jth pruned dictionary of the qth class
vocalizations of a particular species. Out of these local dictio- q
by Dj . The proposed algorithm starts by choosing the inde-
naries, one dictionary per class is chosen to obtain convex pendent atoms from the first dictionary of the first class, D11 ,
sparse representations (CCSE) for a super-frame. This frame- iteratively using the following metric:
work also utilizes a greedy iterative procedure to decrease the
T
1 2
gross correlation between intra and inter-dictionary atoms. i ¼ max jjd11i D11Z D1† 1 1
1Z d1i jj2 s:t: D1Z D1Z is invertible:
This reduces the size of dictionaries making the proposed i62Z
J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3823
ALGORITHM 1: Proposed greedy procedure to decrease the inter- ALGORITHM 2: Procedure to obtain average local CCSE for a bird
dictionary correlation. vocalization.
12 end
species identification. As an illustration, Fig. 5 shows the
13 end two-dimensional (2-D) plot of average local CCSEs for
vocalizations of seven different bird species computed using
t-distributed stochastic neighbor embedding (t-SNE).29 It
must be noted that the parameters used for obtaining these
C. Computing local CCSE representation average local CCSE are for illustration purpose only and
may not be optimal. In this illustration, the super-frame rep-
In order to obtain the local CCSE for any super-frame resentation of 1285 dimensions (for W ¼ 2 and NFFT ¼ 512)
yi, one dictionary from Zq local dictionaries of the qth class is used. Random projections are used to obtain compressed
is chosen. The responsibility of each GMM component/ 500-dimensional representation of these super-frames. Each
cluster in defining yi is calculated and the dictionary corre- species is modeled by a three-component GMM and a 32-
sponding to the component exhibiting maximum responsi- atom dictionary is learned for each component/cluster. One
bility is chosen. This is achieved using the following such 32-atom dictionary is illustrated in Fig. 6. Hence, each
equation: vocalization is represented by 224 (32 7)-dimensional
average local CCSE. The analysis of Fig. 5 makes it clear
q wqz N yi jlqz ; Rqz
z ¼ argmax cz ðyi Þ ¼ Z : (4) that the proposed feature representation, i.e., average local
X q
z CCSE shows different characteristics for different bird spe-
q q
wp N yi jlp ; Rp
p¼1
cies, making them suitable for bird species identification.
The small overlap observed between vocalizations of grey
Here wqz ; lqz ; and Rqz are the weight, the mean and the bush chat, black-crested tit and golden bush-robin could be
covariance of the zth GMM component of the qth class. The due to the similarity between the properties (frequency range
pruned dictionary corresponding to this zth component/clus- and modulations) of the vocalizations of these species.
ter, i.e., Dq
z , is chosen. This procedure is iterated to select Q
dictionaries, one for each class, which are used for obtaining
the local CCSE. These dictionaries are concatenated to form
the final dictionary Dif . The local CCSE for yi is obtained by
projecting it on a simplex corresponding to dictionary Dif ,
using the quadratic programming-based active-set method
proposed by Chen et al.23 (algorithm 2 in Ref. 23). This local
CCSE exhibits high coefficient values corresponding to true
class atoms of Dif and low coefficient values corresponding
to the atoms of other classes (plots similar to Fig. 4 are
obtained). The distinction in local CCSE for super-frames of
different classes makes them an appropriate feature repre-
sentation for classification.
A segmented bird vocalization is represented by average
of local CCSE of all the super-frames corresponding to this
vocalization. Algorithm 2 describes the procedure to obtain
average local CCSE for a bird vocalization. These average FIG. 5. (Color online) Two-dimensional t-SNE visualization of 224-
local CCSEs are used as a feature representation for bird dimensional average local CCSE obtained for seven different bird species.
3824 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
optimal values of d and J are determined empirically. The
classifier used in this work is linear SVM, with an empiri-
cally tuned penalty parameter. The average local CCSE
obtained from each segmented vocalization is used as the
feature representation. Hence, the proposed framework pro-
vides segment/vocalization level classification decisions.
J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3825
evaluated when there is a significant mismatch in training-
testing conditions.
3826 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
FIG. 10. (Color online) Comparison of the classification performance of the proposed local CCSE-based framework with various comparative methods.
methods is illustrated in Fig. 10. It is evident from the figure of the vocalizations for dictionary learning and 25% for
that local CCSE-based framework outperforms the other training SVM). The performance of the proposed framework
methods considered in this study. The classification accuracy and other classification methods is depicted in Fig. 11. The
obtained using the proposed local CCSE-based framework is analysis of Fig. 11 shows that proposed local CCSE frame-
higher than the GMM, GMM-UBM, and SVM powered by work shows a relative improvement of 10.94%, 8.68%,
various dynamic kernels. The local CCSE-based framework 7.52%, 6.67%, 6.8%, 5.53%, 6.23%, 5.12%, 2.04%, and
shows a relative improvement of 14.77%, 10.99%, 8.54%, 3.49% over classification accuracies of GMM, GMM-UBM,
10.32%, 6.45%, 7.32%, and 6.82% over classification accura- PSK, GMMSV, GUMI, IMK, PMK, SK-means, DNN, and
cies of GMM, GMM-UBM, PSK, GMMSV, GUMI, IMK, CCSE, respectively. This shows that the proposed frame-
and PMK, respectively. Also, a relative improvement of 4.6% work is more robust to the mismatched conditions in com-
is observed over the framework using random forest and unsu- parison to the other comparative methods.
pervised feature representations obtained using spherical
K-means. However, the performance of DNN is comparable VI. CONCLUSION
to the proposed framework. A small relative improvement of
In this work, we proposed a local CCSE-based frame-
1.11% is obtained by the proposed framework over the classi-
work for bird species identification using audio recordings.
fication accuracy achieved by DNN. Also, the local CCSE
We demonstrated that local CCSE provides good species
outperforms CCSE by a relative improvement of 3.89%.
discrimination and can be used as a feature representation in
a classification framework. By utilizing super-frames, infor-
E. Robustness comparison
mation about time-frequency modulations are effectively uti-
The performances of most of classification frameworks lized. Apart from this, we also used a restricted version of
are known to degrade when training and testing conditions AA which only processes the data points around the bound-
vary significantly. For the task in hand, these variations can ary to find archetypes. To reduce the size of archetypal dic-
arise due to difference in the recording ambiance and differ- tionaries, we proposed a greedy iterative procedure which
ence in recording devices (e.g., omni-directional vs direc- chooses a subset of atoms from each dictionary such that the
tional microphones). We conduct an experiment to analyze gross-correlation across atoms of all the dictionaries is
the robustness of the proposed framework against differ- decreased. Experimental evaluation showed that the local
ences in recording environments. Five recordings of each of CCSE-based framework outperformed all the existing meth-
the 50 species, considered in this study, are downloaded ods considered in this study. The framework also performed
from Xeno-Canto32 which is a crowd-sourced bird vocaliza- well when there was a difference in training-testing record-
tion database. The recording conditions of the Xeno-Canto ing conditions.
audio recordings (XC) are different from the recordings in Future work will include enforcing the group sparsity
the dataset used for classification comparison in previous for obtaining CCSE. This can further enhance the discrimi-
sub-section. native properties of local CCSE. Also, instead of using the
XC recordings are used for testing while all the record- simple linear classifier such as linear SVM, incorporating
ings used in previous experiments are used for training (75% the ensemble classifiers like random forest and neural
FIG. 11. (Color online) Classification performance of different methods on Xeno-Canto recordings.
J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3827
15
networks can improve the classification performance of the D. Chakraborty, P. Mukker, P. Rajan, and A. Dileep, “Bird call identifica-
tion using dynamic kernel based support vector machines and deep neural
local CCSE-based representation.
networks,” in Proceedings of Int. Conf. Mach. Learn. App. (December
2016), pp. 280–285.
ACKNOWLEDGMENT 16
A. D. Dileep and C. C. Sekhar, “GMM-based intermediate matching ker-
nel for classification of varying length patterns of long duration speech
This work is partially supported by IIT Mandi under the using support vector machines,” IEEE Trans. Neural Net. Learn. Syst.
project IITM/SG/PR/39 and Science and Engineering 25(8), 1421–1432 (2014).
17
Research Board, Government of India under the project V. Bisot, R. Serizel, S. Essid, and G. Richard, “Acoustic scene classifica-
tion with matrix factorization for unsupervised feature learning,” in
SERB/F/7229/2016-2017. Proceedings of Int. Conf. Acoust. Speech, Signal Process. (March 2016),
pp. 6445–6449.
1 18
M. Clout and J. Hay, “The importance of birds as browsers, pollinators P. Giannoulis, G. Potamianos, P. Maragos, and A. Katsamanis, “Improved
and seed dispersers in New Zealand forests,” N. Z. J. Ecol. 12, 27–33 dictionary selection and detection schemes in sparse-CNMF-based over-
(1989). lapping acoustic event detection,” in Proceedings of the Detection and
2
T. S. Brandes, “Automated sound recording and analysis techniques for Classification of Acoustic Scenes and Events 2016 Workshop
bird surveys and conservation,” Bird Conserv. Int. 18(S1), S163–S173 (DCASE2016) (2016), pp. 25–29.
(2008). 19
N.-C. Wang, R. E. Hudson, L. N. Tan, C. E. Taylor, A. Alwan, and R.
3
C.-H. Lee, C.-C. Han, and C.-C. Chuang, “Automatic classification of bird Yao, “Change point detection methodology used for segmenting bird
species from their sounds using two-dimensional cepstral coefficients,” songs,” in Proceedings of Int. Conf. Signal Info. Process. (2013), pp.
IEEE/ACM Trans. Audio, Speech, Language Process. 16(8), 1541–1550 206–209.
20
(2008). J. Haupt and R. Nowak, “Signal reconstruction from noisy random projec-
4
D. E. Kroodsma, E. H. Miller, and H. Ouellet, Acoustic Communication in tions,” IEEE Trans. Info. Theory 52(9), 4036–4048 (2006).
21
Birds: Song Learning and Its Consequences (Academic, New York, P. Frankl and H. Maehara, “The JohnsonLindenstrauss lemma and the
1982), Vol. 2. sphericity of some graphs,” J. Comb. Theory, Ser. B 44(3), 355–362
5
A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropaga- (1988).
22
tion and multivariate statistics,” IEEE Trans. Signal Process. 45(11), I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Process.
2740–2748 (1997). Mag. 28(2), 27–38 (2011).
6 23
A. Harma and P. Somervuo, “Classification of the harmonic structure in Y. Chen, J. Mairal, and Z. Harchaoui, “Fast and robust archetypal analysis
bird vocalization,” in Proceedings of Int. Conf. Acoust. Speech, Signal for representation learning,” in Proceedings of Comp. Vis. Pattern Recog.
Process (May 2004), pp. 701–704. (June 2014), pp. 1478–1485.
7 24
P. Somervuo, A. Harma, and S. Fagerlund, “Parametric representations of V. Abrol, P. Sharma, and A. K. Sao, “Identifying archetypes by exploiting
bird sounds for automatic species recognition,” IEEE/ACM Trans. Audio, sparsity of convex representations,” in Workshop on The Signal
Speech, Lang. Process. 14(6), 2252–2263 (2006). Processing with Adaptive Sparse Structured Representations (SPARS)
8
P. Somervuo and A. Harma, “Bird song recognition based on syllable pair (2017).
25
histograms,” in Proceedings of Int. Conf. Acoust. Speech, Signal Process. Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for
(May 2004), Vol. 5, pp. V–825. sparse coding via label consistent k-svd,” in Proceedings of Comp. Vis.
9
S. Fagerlund, “Bird species recognition using support vector machines,” Pattern Recog. (June 2011), pp. 1697–1704.
26
EURASIP J. Appl. Signal Process. 2007(1), 038637. S. Dasgupta and A. Gupta, “An elementary proof of a theorem of
10
D. Stowell and M. D. Plumbley, “Automatic large-scale classification of Johnson and Lindenstrauss,” Random Struct. Algorithms 22(1), 60–65
bird sounds is strongly improved by unsupervised feature learning,” PeerJ (2003).
27
2, e488 (2014). A. Thakur, V. Abrol, P. Sharma, and P. Rajan, “Renyi entropy based
11
E. Sprengel, M. Jaggi, Y. Kilcher, and T. Hofmann, “Audio based bird mutual information for semi-supervised bird vocalization segmentation,”
species identification using deep learning techniques,” in CLEF (Working in Proceedings of MLSP (September 2017).
28
Notes) (2016), pp. 547–559. S. Mair, A. Boubekki, and U. Brefeld, “Frame-based data factorizations,”
12
B. P. Toth and B. Czeba, “Convolutional neural networks for large-scale in Proceedings of Int. Conf. Mach. Learn. (August 2017), Vol. 70, pp.
bird song classification in noisy environment,” in CLEF (Working Notes) 2305–2313.
29
(2016), pp. 560–568. L. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn.
13
K. J. Piczak, “Recognizing bird species in audio recordings using deep Res. 9(Nov.), 2579–2605 (2008).
30
convolutional neural networks,” in CLEF (Working Notes) (2016), pp. “Art-sci center, University of California,” http://artsci.ucla.edu/birds/
534–543. database.html/ (Last viewed October 10, 2017).
14 31
R. Narasimhan, X. Z. Fern, and R. Raich, “Simultaneous segmentation “Macaulay library,” http://www.macaulaylibrary.org/ (Last viewed
and classification of bird song using CNN,” in Proceedings of Int. Conf. November 14, 2017).
32
Acoust. Speech, Signal Process. (March 2017), pp. 146–150. “Xeno-canto,” http://www.xeno-canto.org (Last viewed October 14, 2017).
3828 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
380 Advances in Parallel Computing Algorithms, Tools and Paradigms
D.J. Hemanth et al. (Eds.)
© 2022 The authors and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/APC220053
1. Introduction
In recent years, deep learning techniques, like convolutional neural networks (CNNs),
have caught the attention of environmental researchers. Deep learning techniques and
methods are implemented in the field of ecology and research to successfully identify
the animal, bird, or plant species from images. A lot of importance is given to bird
species classification because of its attention in the field of computer vision and for its
promising applications in environmental studies. The identification of bird species is a
challenging task in the research field as it may sometimes lead to uncertainty due to
various appearances of birds, backgrounds, and environmental changes. Recent
development in the deep learning field made the classification of bird species more
flexible.
Birds play an essential role in the ecosystem by directly influencing food
production, human health, and ecology balance. Various kinds of challenges have been
faced by ornithologists for decades regarding the identification of bird species.
Ornithologists study the characteristics and attributes of birds and distinguish them by
their living within the atmosphere, and their ecological influence. on bird species have
led to the development of applications that can be used in tourism, sanctuaries, and
additionally by bird watchers.
1
R. Dharaniya, Assistant Professo, Easwari Engineering College, Chennai, Tamil Nadu, India. E-mail:
rdharaniya@gmail.com.
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 381
Several bird species in the world are critically endangered, vulnerable, and near
threatened. The development of bird species classification system can help the
authorities to keep track of birds in a particular area by observing each species of bird.
In recent years, studies In our work, the dataset is collected using internet resources.
Before using the dataset for the classification, the images will be preprocessed. The
CNN algorithm is used for the classification. The preprocessed images will be used for
feature extraction and classification. The model will be trained and tested to produce a
favorable outcome.
2. Prior work
3. Algorithm
objects in the image, and identify patterns in the image.CNN’s consist of the input
layer, which is a grayscale image; the output layer, which is a binary or multi-class
label; and the hidden layers, which are convolution layers, ReLU layers, pooling layers,
and a fully connected neural network.In the field of image processing, CNN is a
powerful algorithm. These algorithms are currently the best available for automating
the processing of images. Images are made up of RGB combinations.
Three Layers of CNN:
Convolutional layer: An input neuron is connected to each hidden layer in a neural
network.In CNN, only a small portion of input neurons are connected to the hidden
neurons.
Pooling Layer: Feature map dimensions are reduced by this layer.There will be
multiple activation & pooling layers within the hidden layer of the CNN.
Fully-connected layer: Layers that are fully connected from the last few layers in a
network. The output from the final pooling or convolutional layer is flattened and fed
into the fully connected layers.
4. Dataset
First, the dataset was collected from the resources available on the internet. There are
six different bird species or classes with more than 100 images per class. The bird
species are American Goldfinch, Barn Owl, Carmine Bee-eater, Downy Woodpecker,
Emperor Penguin, and Flamingo. The model will be trained on this dataset.
5. Preprocessing
Image pre-processing involves working with images at the lowest level of abstraction
possible. These operations do not increase the level of information in the image, but
rather decrease it if entropy is a measure of information. In addition to cleaning image
data for model input, image preprocessing can also decrease model training time and
increase model inference speed. If input images are exceptionally large, reducing these
images will dramatically improve model training time. It reduces distortions or
enhances certain features for further processing, although geometric transformations
(e.g. rotation, scaling, translation) are often necessary.
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 383
6. Implementation
After preprocessing the images, that is after splitting the images into training and
validation datasets. Next, a network architecture for the model will be created. The
different types of layers are used according to their features namely
4. Dense: Dense layer produces the output as the dot product of input and kernel.
5. In the last, a softmax layer will be used as the activation function because it is a
multi-class classification problem.
Figure 2. Flowchart
Now the model will be trained on 50 epochs and a batch size of 128. During each
epoch, the model performance can be determined by the training and validation
384 Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network
accuracy.Next, the accuracy of the model for the training history and the loss of the
model for the training history will be plotted.The prediction and the original label of
the image will be displayed using the argmax() function. At last, a web application will
be developed to display the result of the model.
7. Result
A bird image will be given as input to the model and the species of the bird will be
displayed along with the image.
The following graph shows the model accuracy and was plotted with epochs along the
x-axis and accuracy rate along the y-axis.
Identifying bird species from an image input by the user is the main goal of the project.
The Convolutional Neural Network was used as it provides good numerical precision
accuracy. The accuracy was about 87%-92%. Wildlife researchers can use this to keep
track of wildlife movement and behavior in specific habitats. Various deep learning
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 385
techniques can be applied in the future to enhance the accuracy and performance of the
model. The future work also includes developing a mobile application for more
convenient use. Also, this can be implemented in real-time monitoring of bird species
in sanctuaries.
References
[10] Pratik Ghosh, Ankit Agarwalla. Classification of Birds Species Using Artificial Neural Network.
International Journal for Research in Engineering Application & Management (IJREAM), vol. 7, no. 3,
June 2021.
[11] Pankaj Prakash Patil, Atharva Dhananjay Kulkarni, Aakash Ajay Dhembare, Krishna Adar, Rahul
Sonkamb. Bird Species Classification using multi-scale Convoluted Neural Network with Data
Augmentation Techniques. International Journal of Engineering Development and Research, vol. 9, no.
2, 2021.
All-Conv Net for Bird Activity Detection: Significance of Learned Pooling
Arjun Pankajakshan, Anshul Thakur, Daksh Thapar, Padmanabhan Rajan & Aditya Nigam
Abstract tion in frequency along time. On the other hand, the vocaliza-
tions of Cassin’s vireo relatively occupy high frequency bands
Bird activity detection (BAD) deals with the task of predicting and exhibit different kinds of frequency and temporal modula-
the presence or absence of bird vocalizations in a given audio tions. These variations present in bird sounds make it difficult
recording. In this paper, we propose an all-convolutional neu- to model the bird vocalizations class effectively. Apart from
ral network (all-conv net) for bird activity detection. All the this, the background of audio recordings is also unpredictable.
layers of this network including pooling and dense layers are Different biotic and abiotic non-bird sounds can form the back-
implemented using convolution operations. The pooling opera- ground in any audio recording. Hence, the task of BAD can be
tion implemented by convolution is termed as learned pooling. seen as the classification between the universe of bird sounds
This learned pooling takes into account the inter feature-map and the universe of non-bird sounds. The extreme variations
correlations which are ignored in traditional max-pooling. This present in both the sets make this simple two-class classifica-
helps in learning a pooling function which aggregates the com- tion problem challenging. An ideal bird activity detector should
plementary information in various feature maps, leading to bet- work well across different bird species in different recording
ter bird activity detection. Experimental observations confirm environments.
this hypothesis. The performance of the proposed all-conv net
is evaluated on BAD Challenge 2017 dataset. The proposed Many studies in the literature have targeted the task of
all-conv net achieves state-of-art performance with a simple ar- BAD. In our earlier studies [4, 5], the frameworks based on
chitecture and does not employ any data pre-processing or data SVM powered with dynamic kernels are proposed for the task
augmentation techniques. of BAD. In [4], a GMM based variant of the existing proba-
Index Terms: bird activity detection, all-conv net, learned bilistic sequence kernels is proposed while in [5], an archetypal
pooling analysis based sequence kernel is successfully utilized for BAD.
Both these frameworks require less amount of training data and
generalize well on unseen data. However, the classification per-
1. Introduction formance of these frameworks is not up-to the level of state-
Manual monitoring of birds can be a tedious and difficult task of-art BAD frameworks. In [6], a masked non-negative matrix
due to the wide range of habitats, such as islands, marshes and factorization (Masked NMF) approach for bird detection is pro-
swamps, occupied by different bird species [1]. Many bird posed. Masked NMF can handle weak labels by applying a bi-
species are nocturnal which makes it less feasible to manu- nary mask on the activation matrix during dictionary learning.
ally monitor them. Moreover, it requires an experienced bird In [7], a convolutional recurrent neural network (CRNN) for
watcher to accurately identify the bird species. Acoustic moni- BAD is proposed. Convolutional layers of this network extract
toring bypasses the need of manual labour and provides a con- local frequency shift invariant features while recurrent layers
venient way to monitor birds in their natural habitats without learn the long-term temporal relations between the features ex-
any physical intrusion [2]. tracted from short-term frames. The classification performance
Recently, there is a boom in sophisticated automatic record- of this network is comparable to the state-of-art methods on the
ing devices which can be programmed to record the audio data evaluation data of the BAD 2017 dataset. This method does not
for many days. This vast amount of audio data can be used utilize any pre-processing or data augmentation. However, the
for acoustic monitoring, which can give insight into the avian recurrent layers are computationally expensive to train which
diversity, migration patterns and population trends in different increases the overall computational time required to train the
bird species. Bird activity detection (BAD) [3] is generally the CRNN. In [8], two CNNs (sparrow and bulbul) are proposed.
first module in any bioacoustics monitoring system. BAD dis- The ensemble of these two networks provide state-of-art bird
tinguishes the audio recordings having bird vocalizations from activity detection on this dataset.
those recordings which do not contain any bird call or song. Inspired by the success of CNN-based architectures for
Hence, BAD helps in removing the audio recordings which do BAD and simplicity of all-conv neural network proposed in
not contain any bird vocalizations from further processing. This [9], we propose an all-convolutional neural network (all-conv
reduces the amount of data to be processed for other tasks such net) for BAD. This network is characterized by alternating con-
as bird vocalization segmentation and species identification. volution and pooling layers followed by dense layers, where
The task of BAD becomes challenging due to the varia- each of these layers are implemented by the convolution oper-
tions present in bird vocalizations across different bird species. ation itself. Instead of max-pooling, the local features obtained
These variations can be due to the different frequency profiles or using a convolution layer are pooled using a learned aggrega-
different frequency-temporal modulations of different species. tion, implemented using convolution operations. This aggre-
This behaviour is more evident in the vocalizations of Emer- gation function is referred as learned pooling, which aggregates
ald dove and Cassin’s vireo. The sounds produced by Emerald the contemporary information present in different feature-maps.
dove are characterized by low-frequency and have little varia- On the contrary, max-pool operations aggregates the informa-
tion present in each feature-map individually (see section 2).
This behaviour of learned pooling helps in obtaining better dis-
criminating features, leading to a better classification perfor-
mance in comparison to the normal max-pooling. Moreover,
the proposed all-conv net has lesser number of trainable param-
eters in comparison to the existing state-of-art neural networks
for the task in hand and does not utilize any pre-processing and
data augmentation.
The rest of this paper is organized as follows. In Section
2, the proposed deep all-conv framework is described in detail.
Performance analysis and conclusion are in sections 3 and 4,
respectively.
2. Proposed Framework
In this section, we describe the proposed all-conv net for BAD.
First, we describe Mel-spectrogram, a spatio-temporal feature
representation obtained from the raw audio recordings. Then,
we describe the proposed all-conv net which maps the input
Mel-spectrogram to a two dimensional probabilistic score vec-
tor containing probabilities for the presence and absence of any
bird activity.
100
speech+birdcall RCNN [7] 806000
80
Bulbul [8] 373169
music
60 Sparrow [8] 309843
40 All-conv (Proposed) 154414
20
2 4 6 8 10 12 14 16
Feature index
Coefficients
BIRD HUMAN
BIRD
Frequency
SPEECH
10
(kHz)
0 0
0.5 1 1.5 2 2.5
Time (seconds)
Background Human Bird
(b)
1
4
background frames (b) 100 human speech frames and (c) 100
5 bird sound frames.
0 50 100 150 200 250
Frame index
(c)
1
most constant. However, a significant amount of variation is
Normalized
0.8
present for the bird vocalizations.
MI
0.6
0.4
0 50 100 150 200 250
Frame index 3.2. Rényi entropy based mutual information
The feature representation of nth frame, xn , is converted into
Fig. 1: (a) Spectrogram of an audio recording containing hu- a normalized vector using the softmax function: (xn )j =
man speech, background noise and vocalization of Cassin’s PKe
xn
j
xn , for j = 1, 2, .., K.
vireo (b) Feature representation of the above spectrogram (c) k=1 e
k
Normalized MI extracted from the feature representations de- Since the feature coefficients of each non-bird frame ap-
picted in (b). proach zero, each coefficient of the frame becomes almost
equal after applying the softmax function. However, for any
bird vocalization frame, some coefficients exhibit higher val-
Figure 1(b) depicts the feature representation learned ues than others. Hence, the normalized feature representa-
from the time-frequency representation of an audio recording tions for all the non-bird frames are nearly similar and more
shown in Figure 1(a). This feature representation is ob- variation occurs for the bird vocalization frames. Considering
tained by projecting the time-frequency representation on the these feature vectors as sampled random vectors in RK , this
first 5 columns of U. This U is learned by factorizing the can behavior can be discriminated by using mutual informa-
pooled time-frequency representations of the vocalizations tion.
of Cassin’s vireo, a North American song bird, using SVD. Mutual information (MI) between normalized feature rep-
The audio corresponding to the spectrogram shown in figure resentations of each pair of the consecutive frames (i.e. be-
1(a) contains human speech and two Cassin’s vireo vocaliza- tween nth and (n − 1)th frames) is calculated. This serves
tions. By analyzing figure 1(b), it is clear that information the purpose of considering the previous frame along with the
corresponding to the human speech and other background current frame in making the segmentation decisions. MI of
disturbances is not reflected in the feature domain. Each of a random vector with itself is highest, therefore MI between
the coefficients calculated for any non-bird frame has magni- two almost similar feature representations will be higher than
tude close to zero. Hence, the variance of coefficients of any between two representations which are different. Hence for
non-bird frame is low. On the other hand, each coefficient non-bird regions, MI will be high as compared to the regions
calculated for any bird frame has larger magnitudes in com- having vocalizations as depicted in Figure 1(c). Also, since
parison to the non-bird frame. The variance of coefficients the feature representations for the frames of non-bird regions
within a bird sound frame is also high. This is due to the fact are almost similar, MI across all these regions is almost con-
that none of the learned basis vectors have any contribution stant.
in defining the background frames. However, a bird vocaliza- MI between feature representations of two consecutive
tion frame can be represented as a combination of the scaled frames i.e. xn and xn−1 , each of dimensions K × 1, can
versions of the learned basis vectors [12]. The contribution of be calculated as
some basis vectors in defining the input bird frame is higher
than the others. This leads to the difference in magnitudes of M I(xn , xn−1 ) = H(xn ) + H(xn−1 ) − H(xn , xn−1 ) (1)
the coefficients calculated for bird frames.
Here H() represents the entropy. Rényi entropy is used in this
This behavior is highlighted in figure 2. The box plots
work to calculate the MI. Rényi entropy of the pth order for
of coefficients of 100 human speech frames, 100 background
feature representation of nth frame,xn can be calculated as
frames and 100 bird vocalization frames are shown in Fig-
[13]:
ure 2. From this figure it is evident that the magnitude of p
coefficients for human speech and background frames is al- H(xn ) = log(kxn kp ). (2)
1−p
where p controls the sensitivity towards the shape of proba- existing unsupervised segmentation methods such as short-
bility distribution of the coefficients of xn [14]. The value of term energy (STE), spectral entropy (SE) [6], inverse spectral
p (0 < p < 1) is determined experimentally. flatness (ISF) [19] and two-pass unsupervised method (US)
[11]. Apart from these methods, the performance is also com-
3.3. Segmentation using thresholding pared with the supervised template-based method (TM) in [8],
and two variants of the proposed algorithm. The first vari-
The nature of MI calculated from the feature representations ant uses non-negative matrix factorization (NMF) for learn-
makes the task of thresholding simple. Since the MI for back- ing the basis vectors instead of SVD. The second variant uses
ground regions is almost constant, any drop in the value of the normalized energy of the feature coefficients (CE) instead
MI signifies the presence of bird vocalization. The calculated of Rényi entropy based mutual information. The second ex-
MI is smoothed using a moving average filter and normalized periment is to demonstrate the general nature of the proposed
to be between 0 and 1. This results in the MI to be close to 1 method by testing on unseen vocalizations.
for background regions as can be seen in Figure 1(c). Thus, F1 score, defined as the geometric mean of precision and
a threshold t, just below one, is able to reliably discriminate recall, is used as a metric for evaluation, by comparing with
call regions from other regions. the manually labeled ground truth. Both experiments use 10-
fold cross-validation. During each fold, one audio recording
4. PERFORMANCE ANALYSIS was used for learning bases and the rest were used as test
examples. The average results of these 10 folds are presented
4.1. Datasets used in Figure 3 and Table 1 for the first and second experiments
respectively.
Experimental validation of the proposed algorithm is per-
A frame length of 20 ms with a 10 ms overlap, Hamming
formed on three datasets. Two datasets consists of the record-
window and 512 FFT points are used to compute the time-
ings of Cassins vireo, a North American songbird. The third
frequency representations of the input audio. For calculating
dataset has the recordings of another song bird, California
the feature coefficients, we project the time-frequency repre-
thrasher. The first Cassin’s vireo dataset (CV1) contains
sentation of the test audio file on the top K = 5 left singular
twelve audio recordings and are available at [15]. These
vectors. For calculating Rényi entropy, an order of p = 0.7
audio recordings were collected over two months in 2010
is used, and a moving average filter of length 10 is used to
and contain almost 800 bird vocalizations or song phrases of
smooth the MI. A threshold t = 0.9999 is applied on MI to
65 different types. The second Cassin’s vireo dataset (CV2)
segment the bird vocalizations. These values of K, p and t are
and California thrasher recordings (CT) are available at [16].
chosen experimentally using a validation set. Two recordings
Out of the available 459 recordings of Cassin’s vireo, we
from CV1 having the shortest durations are chosen to form
have used only 100 recordings here. The recordings hav-
the validation set. These audio recordings are not used for
ing longest durations and maximum number vocalizations
either learning basis vectors or testing in any of the experi-
are chosen. These recordings contain almost 25000 Cassin’s
ments. After validation, the same values of K, p and t are
vireo vocalizations of 123 different types. Similarly out of the
used for all the experiments (including the noisy cases.)
available 698 California thrasher recordings, we have chosen
The parameter setting used in [11] are used for imple-
100 recordings having maximum durations and number of vo-
menting STE, SE, ISF and US. Similarly the parameter values
calizations. These 100 recordings contain about 15000 bird
discussed in [8] are used for implementing TM. The param-
vocalizations. All the recordings from these three sources
eter setting used in the proposed algorithm is also used for
are field recordings and contain various types of background
implementing NMF and CE. However, in the NMF variant,
noise including human speech. These recordings are 16-bit
approach, 256 basis vectors are learned from the training data.
mono WAV files having a sampling rate of 44.1 kHz.
This number is chosen experimentally.
To test the proposed algorithm in extreme conditions,
background sounds are artificially added to the recordings of
the CV1 dataset. Three different types of background sounds 4.2.1. Comparison of the proposed algorithm with other
i.e. rain, waterfall, river and cicadas at 0 dB, 5 dB, 10 dB, methods
15 dB and 20 dB SNR are added using Filtering and Noise
Adding Tool (FaNt) [17]. The sound files are downloaded The performance of the proposed algorithm is compared
from FreeSound [18]. with other methods on dataset CV1 and on the artificially
created noisy versions of CV1. In the proposed algorithm,
the NMF variant and the CE variant, the basis vectors are
4.2. Experiments
learned from labeled bird vocalizations extracted from a sin-
Two different experiments are performed to evaluate various gle audio recording of CV1 and the rest of the recordings are
aspects of the proposed algorithm. In the first experiment, used for testing during each of the 10 folds. For segment-
we compare the performance of the proposed algorithm with ing noisy versions of CV1, the basis learned from the clean
audio recordings of CV1 are used. The segmentation perfor- learn basis vectors from CV1 and segment the audio record-
mance of the proposed algorithm along with other methods is ings of CV2. In the second part, we use the basis vectors
summarized in Figure 3. learnt from CV1 to segment the recordings of CT i.e. we
By analyzing Figure 3, it can be concluded that the per- learn the basis vectors from the recordings of one species
formance of the proposed algorithm is better than all the other and segment the audio recordings of another. Again, 10 fold
methods except TM in both noisy and clean conditions. How- cross-validation was used. During each fold, one recording
ever, TM is a template based method which may not be scal- from CV1 was used to learn basis vectors and testing was
able and requires prior knowledge of all the vocalizations we performed on all the recordings of CV2 and CT. These are
wish to segment, in the form of templates. Also, the perfor- tabulated in Table 1. The analysis of Table 1 shows that the
mance of the proposed method is not affected vigorously as proposed algorithm is able to segment CV2 recordings having
compared to performances of STE, ENT, ISF and US in low 123 different types of Cassin’s vireo vocalizations using the
SNR conditions. The NMF and CE variants also outperforms basis vectors learned from a single audio recording of CV1
these methods. The use of the top K left singular vectors in which has 10 to 25 different types of Cassin’s vireo vocaliza-
the proposed algorithm instead of the NMF based dictionary tions (the number of vocalizations in the training recording
provided better segmentation in all conditions. The CE vari- depends on the fold). Hence, the proposed method is able
ant method gave good segmentation performance. This shows to segment vocalizations which are not used in learning the
that a simple energy-based segmentation is good enough to basis vectors.
segment the bird vocalizations learnt from basis vectors. But Also, the proposed algorithm is able to segment record-
since Rényi entropy based MI uses context information in ings of California thresher using basis vectors learned from
terms of the previous and current frames, it provides slightly Cassin’s vireo. The segmentation performance obtained in
better segmentation. this cross-species experiment is also compared with the seg-
mentation performance obtained by using the basis vectors
STE SE ISF US TM NMF CE Prop. learned from the vocalizations of California thrasher. No sig-
(a) (b) nificant difference is observed in the performances which fur-
0.8 0.8 ther supports the generic nature of the proposed algorithm.
0.7 Table 1 also depicts the performance of other methods with
F1-score
F1-score
0.6
0.6 Training Testing Testing
0.5 Method
Dataset on CV2 on CT
0.5
0.4 STE - 0.53 0.55
0.4
0 5 10 15 20 SE - 0.56 0.57
0 5 10 15 20
SNR (dB) SNR (dB) ISF - 0.6 0.61
0.9
(e) US - 0.64 0.63
TM CV2, CT 0.79 0.78
F1-score
0.8