0% found this document useful (0 votes)
15 views40 pages

Base Paper

The document presents a study on a Birds Identification System utilizing deep learning techniques, specifically convolutional neural networks (CNN) with VGG-19 for feature extraction from images. The research, conducted with a dataset of 4340 bird images from Jordan, evaluates various classifiers and demonstrates that the Artificial Neural Networks classifier achieves the highest accuracy. The study highlights the challenges in bird identification due to similarities in bird forms and backgrounds, emphasizing the need for advanced computational methods in this field.

Uploaded by

akshivaradh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views40 pages

Base Paper

The document presents a study on a Birds Identification System utilizing deep learning techniques, specifically convolutional neural networks (CNN) with VGG-19 for feature extraction from images. The research, conducted with a dataset of 4340 bird images from Jordan, evaluates various classifiers and demonstrates that the Artificial Neural Networks classifier achieves the highest accuracy. The study highlights the challenges in bird identification due to similarities in bird forms and backgrounds, emphasizing the need for advanced computational methods in this field.

Uploaded by

akshivaradh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/351361220

Birds Identification System using Deep Learning

Article in International Journal of Advanced Computer Science and Applications · May 2021

CITATIONS READS
7 6,000

2 authors:

Suleyman Al-Showarah Sohyb Qbailat


Mu’tah University Mu’tah University
17 PUBLICATIONS 73 CITATIONS 3 PUBLICATIONS 15 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Suleyman Al-Showarah on 03 June 2021.

The user has requested enhancement of the downloaded file.


(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

Birds Identification System using Deep Learning


Suleyman A. Al-Showarah1, Sohyb T. Al-qbailat2
Faculty of Information Technology
Mutah University, Karak
Jordan

Abstract—Identifying birds is one of challenging role for bird algorithms and methods [1][3][4][7][14], but this study differs
watchers due to the similarity of the birds’ forms/image from others in using the following operations: combine
background and the lack of experience for watchers. So, it needs between the fc6/fc7, max between fc6/fc7, min between
a computer system based images to help birdwatchers in order to fc6/fc7, and the average for fc6/fc7 based on VGG-19. Hence,
identify birds. This study aims at investigating the use of deep the field of birdwatching needs more investigations to develop
learning for birds’ identification using convolutional neural systems with new technique that help to identify birds.
network for extracting features from images. The investigation
was performed on database contained 4340 images that collected As the database of images were collected from Jordan, and
by the paper author from Jordan. The Principal Component the statistics number of birds in Jordan as stated in [13] are
Analysis (was applied on layer 6 and 7, as well as on the 434 species belonging to 66 families.
statistical operations of merging the two layers like: average,
minimum, maximum and combine of both layers. The datasets This study aims at investigating the use of deep learning
were investigated by the following classifiers: Artificial neural for birds’ identification using VGG-19 for extracting features
networks, K-Nearest Neighbor, Random Forest, Naïve Bayes and from images. In order to achieve this aim, the investigation for
Decision Tree. Whereas, the metrics used in each classifier are: the performance of different classifiers were performed on the
accuracy, precision, recall, and F-Measure. The results of following classifiers: (KNN, Decision Tree, Random Forest,
investigation include and not limited to the following, the PCA and ANN) on the collected reliable database of birds images
used on the deep features does not only reduce the that available in Jordan.
dimensionality, and therefore, the training/testing time is
reduced significantly, but also allows for increasing the VGG-19 considered as one of the most important models
identification accuracy, particularly when using the Artificial of Convolutional Neural Networks (CNN). Therefore, CNN is
Neural Networks classifier. Based on the results of classifiers; considered as the strongest technique for deep learning used in
Artificial neural networks showed high classification accuracy image identification [9].
(70.9908), precision (0.718), recall (0.71) and F-Measure (0.708)
The main reason of using VGG-19 is to provide high
compared to other classifiers.
precision by finding features with distinctive details in the
Keywords—Birds identification; deep learning convolutional image like the difference in lighting conditions and other
neural networks (CNN); VGG-19; principal component analysis objects surrounding the birds [3]. Moreover, PCA could be
(PCA) employed as dimensionality reduction tools with these
features that would help to reduce number of features that will
I. INTRODUCTION make the training time less.
Many people are interested in observing and studying The motivation to conduct this study represented by:
wildlife, especially in birdwatching. The role of birdwatching 1) The shortage in the field of identifying birds based on
is to preserve the nature by observing bird’s behavior and images. 2) To the best of our knowledge, we have not come
migration pattern. The challenge for bird watchers in across to any study conducted using VGG-19 for identifying
identifying birds based images remains difficult due to the birds. 3) There is shortage in database available in the world
similarity of the birds’ forms/ image background and the lack except these two databases that available in [1] [18]. This case
of experience in this field for watchers [1]. is applicable to Jordan, as there is no database of images for
As mentioned in [17] that birds Voice or Videos were used birds, and there is no program was developed to identify birds.
in earlier technique to predict it species, but this technique Based on the extracted features using VGG-19, the
have many challenges to give an accurate result due to other contribution of this study can provide a research fields with a
background of birds/animal voices. So, images can be best comparison between the results of different aforementioned
choice to be used to identify birds’ species. To implement this classifiers.
technique, the images for all birds’ species need to be trained
to generate a model. Then deep learning algorithm will This study organized into six sections. Section II
convert uploaded image into gray scale format and apply that introduces the overview of previous studies on all related
image on train model to predict best match species name for subjects. Section III describes the used database. Section IV
the uploaded image. discusses the model design and the methodology for the
experiment. Then Section V discusses the results of the
Also, during the previous years, artificial intelligence is experimental, and finally, Section VI presents paper
used in the field of bird watching based images using different conclusion.

251 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

II. RELATED WORK available in [1]. In their study, their database consisted of 300-
Machine learning (ML) represents a set of techniques that 400 different images consists of number of bird species. In
allow systems to discover the required representations to their study, the algorithm used to extract image features is
features detection or classification from the raw data. The AlexNet and then classified by using a SVM classifier. The
performance of works in the classification system depends on results of accuracy is 85%.
the quality of the features. As such of this study can be The researchers in [11] used multiple pre-CNN networks
categorize under the field of ML; this is to make a search in algorithms like: (AlexNet, VGG19 and GoogleNet) on birds
this area for the studies that belong to birds’ identification. dataset that is called (Caltech-UCSD Birds-200-2011). Based
In the literature review, there are number of studies on approach of combining between the aforementioned
conducted in field of identifying birds. But they were algorithms together, the results showed that this approach
conducted in different algorithms and methods, as follows: improved the accuracy that reached to 81.91%, when applied
on Caltech-UCSD Birds-200-2011 dataset compared to other
There are number of studies conducted for identifying datasets used in the same study.
birds based audio/ video like [4][11][6][10]. While other
studies conducted to identify birds based images using AI Another study conducted by [4] in field of database birds
algorithms [1][3][14], but not in what was conducted in this based images and birds identification system. Their study
study. This study used different operations like: MAX, MIN, aimed to classification the birds during flight from video clips.
AVERAGE, and Combine between the layers fc6/fc7 based They approximately collected 952 clips and extracted about
on VGG-19 algorithm. 16,1907 frame photos of 13 birds’ species. In order to improve
the accuracy, the researchers used the two features:
In field of birds database-based images and birds appearance and motion features. Then, they compared their
identification system, the researchers in [19] conducted study proposed method with the classifiers (VGG, MobileNet). The
on data collected mostly from North American of 200 bird proposed method achieved a 90% correct classification rate
species, where they called it: (Caltech-UCSD Birds 200 when using Random forest classifier.
(CUB-200)). They conducted their study based on two simple
features: image sizes and color histograms. In the case of In field of birds’ identification system, the researchers in
image sizes, they represented each image by its width and [3] applied different methods like: 1) softmax regression using
height in pixels. But in case the color histograms, they used 10 manually features on the Caltech-UCSD-Birds-200 dataset
bins per channel, where an applied Principal Component [19]. 2) A multi-class SVM was applied on HOG and RGB on
Analysis was applied. Their results showed how the features extracted from images. 3) A CNN was applied using
performance of the NN classifier degrades as the number of transfer learning algorithm to classify birds. The results of
classes in the dataset is increased, as in [18]. The performance comparing the three methods 46% when using CNN.
of the image size features are close to chance at 0.6% for the In the next section, the database content, number of
200 classes, while the color histogram features increase the images, source of images, and the challenges to classify
performance to 1.7%. Another example of studies that images are explained.
conducted in field of database for birds based images and
birds’ identification system, the researchers in [18] increased III. DATABASE DESIGN
the number of images to 11788 images; as it was 6033 in [19]. The database of birds images were collected from Jordan,
Where they used RGB color histograms and histograms of and it consists of 4340 images of 434 bird species. The
vector-quantized SIFT descriptors with a linear SVM. The database images were obtained from scientific sources and
results obtained of their study for the classification accuracy is were approved by Jordanian Bird Watching Association based
17.3%. on their scientific names [13].
Also, in the field of birds’ identification system, the The images have different backgrounds, where some of
researchers in [14] proposed a new feature to distinguish the them were taken in shadow condition, lightening background,
types of birds. In their study, they used the ratio of the and some of them have other objects in the images as
distance from the eye to the beak root, and the beak width. background. This has added a huge challenge to the
This feature was integrated in the decision tree, and then in researchers to extract features, and to provide high accuracy.
SVM. This proposal was applied to the database that called
(CUB-200-2011 dataset) that mentioned in [18]. The results IV. PROPOSED METHOD
achieved for correct classification rate is 84%. This section presents the procedures that used for the
Another study conducted on birds-identification. Their proposed method in identifying birds using VGG-19. Fig. 1
database was collected in India by the researchers that shows the proposed model.

252 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

Fig. 1. Proposed Model.

The following steps explains the proposed model of this V. EXPERIMENTAL RESULTS AND DISCUSSIONS
study, as follows:
This section presents the performance evaluation results
Step 1): The feature vectors will be extracted form images for the study dataset, which includ the accuracy , F-measure,
automatically using MATLAB for Pretrained VGG-19 to recall, precision and training time for each classifier as
build dataset that includes (feature factors: fc6 and fc7). Each follows:- 1KNN, 3KNN, 5KNN, ANN, Naïve Bayes, Random
dataset (e.g. fc6) contains 4096 columns (representing feature Forest and Decision Tree.
vectors) and 4340 rows (representing the number of samples
(images). The results of this study are displayed as follows:

Step 2): The statistical operations like: (min, max, average, A. Results of both Orginal/Pure fc6/fc7 Datasets Separately
and combined them together) were performed on the Table I shows the results of both orginal of fc6 and fc7
original/pure of fc6 and fc7 layers, this is to obtain new datasets. Naive Bayes has achieved the highest accuracy
dataset to be used in the next stage (step 3) of using classifiers. results for fc6 and fc7 which are (59.002) and (56.106).. While
Explanation on statistical operations, as follows: for the time spend to conduct the test and training dataset,
Decision Tree has spend large time (1406.69s), but KNNs
• Max: used to find the largest value between the two spend less time (0s) compared to other classifers. This is
values in fc6 and fc7 and put value in a new group. because it has no training model; where the test example is
• Min: used to find the less value between the two values compared directly to other examples in the training set, and
in fc6 and fc7 and put value in a new group. that why it is slow in testing, particularly when used a large
number of examples in the training [8][16]. This results match
• Average: used to find average the two values in fc6 and with the results in [5] [12].
fc7 and put value in a new group.
B. Results of the Statitsical Operations on fc6 and fc7
• Combined them together: used to combine the first Datasets
group (4096) next to the second group (4096). This is The section show the results of three dataset by applying
to have a new group that contains 8192 features in this statistical operations(avgerage, maximum, minmum) between
study. the fc6 and fc7 layers.
Step 3): A PCA will be applied on the original/pure of fc6, Table II shows results of the statitsical operations on
fc7, the dataset that obtained from the previous stage (step 2); fc6/fc7 datasets , where Naive Bayes has achieved the highest
this is to produce a new datasets. accuracy results for AVERAGE, MAX, and MIN, which are
The data obtained using the pre-trained VGG-19, is very (57.30), (60.99) , and (57.60) respectively. Despite of the
large (4096), therefore, the PCA was implemented to reduce Naive Bayes have scored acceptable accuracy, F-measure,
the number of features. In PCA, there were set of percentages recall, and precision that outperformed all classifiers, but also
used to show the variance of the data in the results, which are: it was achieved with acceptable training time. This result dis-
95%, 97% and 99% variance of the data (the 4096 features). match with other studies [2] [15].
Step 4): The results were performed based on applying set C. Results of Combine between (Original fc6/ fc7) Dataset
of classifiers on the datasets that obtained from (step 2 and A new dataset was obtained called combine by combining
step 3). of fc6 (4096)and fc7 (4096), which contained 8192 feature
vector, and accordingly will obtained the results:

253 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

Table III shows the birds identification results where In Table IV, the classifer ANN was not used in the
Naive Bayes has achieved the highest accuracy results in previous Tables I to III. This can be explained as follows:
combine (59.4009) of accuracy. The second high result of ANN is the best classifier to be used for deep features, if and
accuracy is 1KNN that has achieved accuracy of 50.2074. only if it is provided with a smaller number of deep features,
While for the time spend to conduct the test and training otherwise, i.e. if it is applied on the original/pure deep
dataset, Decision Tree spend large time (|2484.01s), but KNNs features, which obtained from the VGG-19 layer 6 or 7 or any
spend less time (0s) compared with other classifiers. merging of them both, the training time would be
unacceptably long [2] [15][11].
D. Results of Both Original/pure fc6/fc7 after Applying PCA
Tables IV to V shows the idntification results for each
classifier after applying PCA (95%,97%,and 99%).

TABLE I. IDENTIFICATION RESULT OF BOTH ORIGINAL/PURE FC6/FC7 DATASETS

Classifiers Accuracy Precision Recall F-measure Training Time (seconds)


fc6 results
1KNN 47.0507 0.542 0.471 0.479 0
3KNN 41.7512 0.506 0.418 0.421 0
5KNN 44.3318 0.535 0.443 0.451 0
Naïve Bayes 59.0092 0.642 0.59 0.601 9.15
Random forest 14.447 0.227 0.144 0.153 35.93
Decision Tree 12.8802 0.133 0.129 0.127 1438.85
fc7 results
1KNN 50.8065 0.552 0.508 0.511 0
3KNN 46.1751 0.544 0.462 0.463 0
5KNN 47.4885 0.556 0.475 0.48 0.02
Naïve bayes 56.106 0.609 0.561 0.571 8.23
Random forest 21.2442 0.295 0.212 0.22 60.53
Decision Tree 17.5115 0.185 0.175 0.174 1406.69

TABLE II. IDENTIFICATION RESULTS OF AVERAGE, MAXIMUM AND MINIMUM (FC6⊕FC7)

Classifiers Accuracy Precision Recall f-measure Training Time (sec)


Avg results
1KNN 47.0276 0.536 0.47 0.478 0
3KNN 41.5668 0.497 0.416 0.418 0
5KNN 43.8479 0.523 0.438 0.444 0
Naïve Bayes 57.3041 0.624 0.573 0.584 7.11
Random forest 14.424 0.22 0.144 0.148 44.96
Decision tree 13.3641 0.139 0.134 0.131 1278.55
Max results
KNN 49.6313 0.577 0.496 0.505 0
3KNN 44.9309 0.555 0.449 0.456 0.11
5KNN 47.5806 0.583 0.476 0.487 0
Naïve Bayes 60.9908 0.67 0.61 0.622 7.08
Random forest 16.8433 0.265 0.168 0.176 31.28
Decision tree 14.9078 0.154 0.149 0.148 1467.56
Min results
1KNN 44.9309 0.513 0.449 0.456 0
3KNN 39.2627 0.491 0.393 0.396 0.02
5KNN 40.1152 0.494 0.401 0.408 0
Naïve Bayes 57.6037 0.632 0.576 0.586 7.19
Random forest 12.9493 0.204 0.129 0.133 53.8
Decision tree 11.0829 0.118 0.111 0.11 1198.16

254 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

TABLE III. IDENTIFICATION RESULTS OF COMBINE BETWEEN (ORIGINAL FC6/ FC7) DATASET

Classifiers Accuracy Precision Recall f-measure Training Time (sec)


Combine results
KNN 50.2074 0.555 0.502 0.506 0.01
3KNN 45.2995 0.539 0.453 0.455 0.74
5KNN 47.6728 0.563 0.477 0.482 0
Naïve Bayes 59.4009 0.64 0.594 0.603 14.68
Random forest 18.1797 0.256 0.182 0.185 47.38
Decision tree 16.129 0.166 0.161 0.159 2484.01

TABLE IV. IDENTIFICATION RESULTS OF ORIGINAL/PURE FC6 AFTER APPLYING PCA (95%,97%,99%)

Classifiers Accuracy Precision Recall F-measure Training Time (sec)

fc6 (PCA 95%) results


ANN 68.8018 0.695 0.688 0.685 23378.32
1KNN 27.6959 0.52 0.277 0.317 0
3KNN 15.2074 0.391 0.152 0.185 0
5KNN 15.2765 0.416 0.153 0.19 0
Naïve Bayes 52.0737 0.631 0.521 0.549 0.41
Random forest 6.7281 0.13 0.067 0.067 20.99
Decision tree 14.5392 0.153 0.145 0.144 107.12
fc6 (PCA 97%) results
ANN 70 0.658 0.65 0.648 19022.88
KNN 19.2857 0.49 0.193 0.237 0
3KNN 8.1797 0.278 0.082 0.104 0
5KNN 8.1567 0.292 0.082 0.107 0
Naïve Bayes 48.318 0.622 0.483 0.52 1.11
Random forest 3.8018 0.085 0.038 0.038 27.03
Decision tree 14.1014 0.154 0.141 0.141 188.18
fc6 (PCA 99%) results
ANN 62.3733 0.642 0.624 0.623 48850.24
KNN 8.6636 0.325 0.087 0.113 0
3KNN 1.8433 0.072 0.018 0.022 0
5KNN 1.9355 0.079 0.019 0.023 0
Naïve bayes 37.9032 0.581 0.379 0.428 1.44
Random forest 2.0507 0.04 0.021 0.02 28.25
Decision tree 13.1567 0.143 0.132 0.132 471.77

Applying PCA has influnced on the training time for fc6 Table V shows the birds identification results for fc7
that made it less for all classifers in Table IV-after applying where the highest accuracy resultant from applying PCA of
PCA compared to the training times in Tables I to III, before (95%,97% and 99%) are in favors of ANN with (65.2995,
applying PCA, especially for Random Forest and Naïve 65.2995 and 67.9493), respectively.
Bayes. The highest accuracy resultant from applying PCA of
(95%, 97% and 99%) is in favor of ANN with (68.8018, 70 The second high accuracy resultant from applying PCA of
and 62.3733%), respectively, which can be attributed to the all percentage of (95%, 97% and 99%) is Naïve Bayes, has
reduced feature vector. achieved accuracy of (58.3641, 56.9585 and 56.3825%),
respectively.
So, it is worth mentioning that the ANN classifier was not
used with other sets except those obtained after applying the E. Results of the Statistical Operations on (fc6 and fc7) after
PCA, this is because of its unacceptable training time. This Applying PCA
results matches with previous studies that stated the training This section presents the identification results of the
time for ANN spend large compared with other classifers statistical operations on each of (average, maximum and
[2][15]. minimum) between the fc6 and fc7 after applying PCA

255 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

(95%,97%,99%), as well as the results of training time for While for the time spend to conduct the test and training
each classifier, as follws: dataset, ANN spend large time 54151.88s.
Table VI shows the birds identification results in (average Table VIII shows the birds identification results in
between (fc6 and fc7)) where the highest accuracy resultant (minimum between (fc6 and fc7)) where the highest accuracy
from applying PCA of (95%, 97% and 99%) are in favors of resultant from applying PCA of (95%) are in favors of ANN
ANN with (69.5622, 69.9078 and 65.5069) respectively. The with (70.8295). It is noted that the result of the ANN is
second-high accuracy resultant from applying PCA of all appeared only for the PCA (95%), but not for the percentage
percentage of (95%, 97% and 99%) is Naïve Bayes that has (97%, and 99%). This is because the large number of features
achieved accuracy of (53.3871, 49.7926 and 39.8157%) for each of PCA (97% and 99%) that reached to (1205 and
respectively. While the time spend to conduct the test and 1910) features respectively. Also, due to its unacceptable
training dataset, ANN spend large time 58379.22s , where that training time (that takes days to provide the results. While
PCA 95 spend less time compared to PCA 97and PCA99. Naïve Bayes achieved accuracy resultant from applying PCA
of all percentage of (95%, 97% and 99%), they are as follows
Table VII shows the birds identification results in (48.7327, 44.1014 and 35%), respectively. While for the time
(maximum between (fc6 and fc7)) where the highest accuracy spend to conduct the test and training dataset, ANN spend
resultant from applying PCA of (95%) are in favors of ANN large time 42677.02s.
with (66.9816) . It is noted that the results of the ANN is
appeared only for PCA (95%), but not for the percentage of F. Results of Combining Feature Vector after Applying PCA
(97%, and 99%). This is because the large number of features This section shows the results of combining between fc6
for each of PCA (97% and 99%) that reached to (1428, and (4096) and fc7 (4096) that reached 8192, but this number of
2117) features, respectively. Therefore there will not be results features have been reduced after appling PCA (95%, 97%,
when using ANN, due to its unacceptable training time (that 99%) that become (250, 440 and 1080) features, respectively.
takes days to provide the results. The results of combine, as follows:

TABLE V. IDENTIFICATION RESULTS OF ORIGINAL/PURE FC7 AFTER APPLYING PCA (95%,97%,99%)

Classifiers Accuracy Precision Recall F measure Training Time (sec)

fc7 (PCA 95%) results


ANN 64.977 0.658 0.65 0.648 12295.32
KNN 41.4055 0.509 0.414 0.427 0
3KNN 34.9078 0.502 0.349 0.365 0
5KNN 36.7051 0.52 0.367 0.386 0
Naïve bayes 58.3641 0.643 0.584 0.598 0.06
Random forest 15.8986 0.24 0.159 0.167 15.5
Decision tree 17.0737 0.177 0.171 0.169 40.36
fc7 (PCA 97%) results
ANN 65.2995 0.66 0.653 0.651 15658
KNN 38.6175 0.532 0.386 0.409 0.01
3KNN 29.7926 0.507 0.298 0.326 0
5KNN 30.4147 0.52 0.304 0.337 0
Naïve bayes 56.9585 0.646 0.57 0.588 0.11
Random forest 12.2811 0.211 0.123 0.13 17.32
Decision tree 16.4977 0.173 0.165 0.164 71.95
fc7 (PCA 99%) results
ANN 67.9493 0.686 0.679 0.676 23197.76
KNN 27.3272 0.565 0.273 0.324 0.01
3KNN 15.2995 0.45 0.153 0.195 0
5KNN 15.4147 0.464 0.154 0.198 0
Naïve bayes 56.3825 0.678 0.564 0.592 0.53
Random forest 4.3779 0.088 0.044 0.044 22.7
Decision tree 14.7926 0.151 0.148 0.146 137.95

256 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

TABLE VI. IDENTIFICATION RESULTS OF AVERAGE BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)

Classifiers Accuracy Precision Recall F measure Training Time (sec)


AVG (PCA95%) results
ANN 69.5622 0.703 0.696 0.693 16452.89
KNN 29.1705 0.523 0.292 0.331 0
3KNN 16.6359 0.418 0.166 0.202 0.02
5KNN 16.659 0.429 0.167 0.205 0
Naïve bayes 53.3871 0.635 0.534 0.56 1.35
Random forest 5.7143 0.111 0.057 0.057 15.99
Decision tree 15.2304 0.165 0.152 0.152 85.88
AVG (PCA 97%) results
ANN 69.9078 0.707 0.699 0.696 24498.83
KNN 20.6221 0.503 0.206 0.254 0
3KNN 8.5484 0.282 0.085 0.109 0
5KNN 8.8249 0.316 0.088 0.115 0
Naïve Bayes 49.7926 0.631 0.498 0.53 0.48
Random forest 3.9862 0.082 0.04 0.039 22.49
Decision tree 14.0553 0.149 0.141 0.14 138.67
AVG (PCA 99%) results
ANN 65.5069 0.666 0.655 0.652 58379.22
KNN 9.4009 0.34 0.094 0.121 0
3KNN 2.1889 0.089 0.022 0.027 0
5KNN 2.2811 0.094 0.023 0.028 0
Naïve bayes 39.8157 0.607 0.398 0.45 2.2
Random forest 2.0968 0.04 0.021 0.019 26.89
Decision tree 13.9171 0.149 0.139 0.139 314.45

TABLE VII. IDENTIFICATION RESULTS OF MAXIMUM BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)

Classifiers Accuracy Precision Recall F measure Training Time (sec)


MAX (PCA95%) results
ANN 66.9816 0.68 0.67 0.668 54151.88
KNN 13.2488 0.432 0.132 0.172 0
3KNN 4.7235 0.169 0.047 0.059 0
5KNN 4.447 0.161 0.044 0.055 0
Naïve bayes 46.5207 0.658 0.465 0.517 1.75
Random forest 2.9032 0.071 0.029 0.032 30.5
Decision tree 15.0922 0.16 0.151 0.15 423.62
MAX (PCA 97%) results
ANN - - - - -
KNN 7.8341 0.32 0.078 0.106 0
3KNN 2.0507 0.07 0.021 0.022 0
5KNN 2.3272 0.073 0.023 0.026 0
Naïve bayes 41.682 0.648 0.417 0.478 2.46
Random forest 2.6959 0.07 0.027 0.028 33
Decision tree 14.8618 0.158 0.149 0.148 556.78
MAX (PCA 99%) results
ANN - - - - -
KNN 3.2258 0.132 0.032 0.043 0
3KNN 0.9908 0.026 0.01 0.008 0
5KNN 1.0599 0.026 0.011 0.009 0
Naïve bayes 32.1889 0.585 0.322 0.38 3.52
Random forest 1.7512 0.043 0.018 0.02 35.19
Decision tree 14.6774 0.156 0.147 0.146 1517.2

257 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

Table IX shows the birds identification results in (combine 2) Some others of previous studies conducted on dataset
between fc6 and fc7) where the highest accuracy resultant containing a large number of images in training dataset
from applying PCA of (95%,97% and 99%) are in favors of (examples) like in [4], [3], [14] that used (161907), (11788),
ANN with (69.5392, 70.9908 and 67.9263), respectively. The (11788) examples recpectively compared to this study which
second high accuracy resultant from applying PCA of all
contained a few images (4340 examples). Few number of
percentage of (95%, 97% and 99%) is Naïve Bayes that has
achieved accuracy of (57.235, 54.1475 and 43.7558%), images (examples) for each bird usually leads to low accuracy
respectively. compared to the large examples, but in constant it was not.
This leads to make more covident in the results of this study.
While for the time spend to conduct the test and training 3) There were studies conducted for identifying birds
dataset, ANN spend large time (56279.29s). Comparison using different algorithms and methods based audio/ video
between the proposal work and previous researchers’ works.
like [4][11][6][10], while other studies conducted to identify
Table X compares the results of the proposed approach birds based images using AI algorithms [1][3][17]. This is less
with three similar approaches for birds identification. in what was conducted in this study that used deep-learning
Table X has approved that the output of our proposal can algorithms and different statistical operations like: MAX,
be considered as one of the interesting study comapred to the MIN, AVERAGE, and combine between the layers fc6/fc7
previous researchs, for several reasons: based on VGG-19 algorithm.
4) This study conducted on different methods like:
1) Some of previous studies were conducted on small
combine between the fc6/fc7, max of fc6/fc7, min of fc6/fc7,
dataset birds (categories) like in [4], [7] that used (13), (16)
and the average for fc6/fc7 based on VGG-19.
categories recpectively, compared to this study that used
(434).
TABLE VIII. IDENTIFICATION RESULTS OF MINIMUM BETWEEN (FC6 AND FC7) AFTER APPLYING OF PCA (95%,97%,99%)

Classifiers Accuracy Precision Recall F measure Training Time (sec)


MIN (PCA 95%) results
ANN 70.8295 0.715 0.708 0.993 42677.02
KNN 17.6037 0.515 0.176 0.223 0
3KNN 6.106 0.234 0.061 0.078 0
5KNN 6.1982 0.238 0.062 0.078 0
Naïve bayes 48.7327 0.661 0.487 0.539 1.2
Random forest 3.5023 0.093 0.035 0.038 33.74
Decision tree 13.6636 0.153 0.142 0.142 2829.83
MIN (PCA 97%) results
ANN - - - - -
KNN 9.5853 0.371 0.096 0.129 0
3KNN 2.5115 0.079 0.025 0.027 0
5KNN 2.6267 0.096 0.026 0.029 0
Naïve bayes 44.1014 0.652 0.441 0.501 2.5
Random forest 2.5115 0.057 0.025 0.025 29.52
Decision tree 13.6636 0.147 0.137 0.136 1007.75
MIN (PCA 99%) results
ANN - - - - -
KNN 3.871 0.176 0.039 0.051 0.01
3KNN 0.8756 0.024 0.009 0.007 0
5KNN 0.7834 0.019 0.008 0.007 0
Naïve bayes 35 0.615 0.35 0.414 3.38
Random forest 2.0507 0.046 0.021 0.021 31.93
Decision tree 13.341 0.144 0.133 0.134 547.58

258 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

TABLE IX. IDENTIFICATION RESULTS OF COMBINE ON (FC6 AND FC7) AFTER APPLYING PCA (95%, 97%, 99%)

Classifiers Accuracy Precision Recall F measure Training Time (sec)


Combine (PCA 95%) results
ANN 69.5392 0.703 0.695 0.693 20103.19
KNN 35 0.544 0.35 0.382 0
3KNN 24.3088 0.459 0.243 0.275 0
5KNN 24.1705 0.487 0.242 0.28 0
Naïve bayes 57.235 0.653 0.572 0.592 0.89
Random forest 7.3041 0.166 0.073 0.08 167.7
Decision tree 16.0599 0.167 0.161 0.159 96.93
Combine (PCA 97%) results
ANN 70.9908 0.718 0.71 0.708 24033.56
KNN 27.1659 0.547 0.272 0.319 0
3KNN 13.7558 0.397 0.138 0.172 0
5KNN 14.5161 0.436 0.145 0.186 0
Naïve bayes 54.1475 0.654 0.541 0.568 0.66
Random forest 5.1152 0.115 0.051 0.054 26.12
Decision tree 15.4839 0.161 0.155 0.153 128.03
Combine (PCA 99%) results
ANN 67.9263 0.685 0.679 0.675 56279.29
KNN 10.8065 0.39 0.108 0.142 0.02
3KNN 2.9493 0.106 0.029 0.035 0
5KNN 2.9493 0.112 0.029 0.036 0
Naïve bayes 43.7558 0.647 0.438 0.49 3.59
Random forest 2.3733 0.038 0.024 0.022 113.17
Decision tree 14.8618 0.153 0.149 0.147 403.92

TABLE X. COMPARISON BETWEEN PROPOSAL OF THIS STUDY AND RELATED WORKS

Method Dataset (name) # of example # of category Result


[4] CNN+RandomForest Frames of Videos 161907 13 ACC=90%
[3] Regularized Softmax Reg w/ Broad Classes CUB200-2011 11788 200 ACC=70%
[14] HSV+SVM CUB200-2011 11788 200 ACC=83.87%
CVIP 2018 Bird Species
[7] Mask R-CNN + Ensemble Model 150 16 Precision= 56.58
challenge
proposal of this
(Combine orginal (fc6+fc7) after PCA)+ANN JOP(new datset) 4340 434 ACC=71%
study
VI. CONCLUSION Since the size of the deep feature vector obtained from the
This study aims at investigating the use of deep learning VGG19’s layers (6 or 7) is very large (4096), we opt for
for birds’ identification system using VGG-19 for extracting Principal Component Analysis (PCA) and to do the
features from images. VGG-19 is one of the pre-trained dimensionality reduction. Moreover, it was created more
convolutional neural network (CNN) networks that used for feature vectors called statistical operations to generate more
image identification which was used in this paper to extract datasets from (fc6 and fc7) using average, minimum,
the features from birds’ images. maximum and combine of both layers.
Database of this study is contained 4340 images of 434 The created datasets (i.e. with PCA and without PCA), as
bird species obtained from scientific sources and where well as the datasets that created from statistical operations are
approval by Jordanian Bird Watching Association based on used as input for classification using various machine learning
scientific name. classifiers including Artificial neural networks (ANN), K-
Nearest Neighbor (KNN), Random Forest, Naïve Bayes and
In this study, the two layers in the structure of VGG19 to Decision Tree.
get the features were used layer 6 (called fc6) and layer 7
(called fc7); each layer consists of 4096 features. The results of investigation in this study include and not
limited to the following, the PCA used on the deep features
does not only reduce the dimensionality, and therefore, the

259 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 4, 2021

training/testing time is reduced significantly, but also allows [9] Hijazi, Samer, Rishi Kumar, and Chris Rowen. 2015. “Using
for increasing the identification accuracy, particularly when Convolutional Neural Networks for Image Recognition.” . IP Group,
Cadence. Retrieved from https://ip.cadence.com/uploads/901/cnn_wp-
using the ANN classifier. Based on the results of classifiers; pdf.
ANN showed high classification accuracy (70.9908), precision [10] Incze, A., Jancsó, H. B., Szilágyi, Z., Farkas, A., & Sulyok, C. 2018.
(0.718), recall (0.71) and F-Measure (0.708) compared to Bird sound recognition using a convolutional neural network. In 2018
other classifiers. IEEE 16th International Symposium on Intelligent Systems and
Informatics (SISY) :295-300 IEEE.
It is recommended to conduct more investigation to [11] Korzh, Oxana, Mikel Joaristi, and Edoardo Serra B. 2018.
improve accuracy results and to reduce training time using “Convolutional Neural Network Ensemble Fine-Tuning for Extended
different algorthms. Transfer.” In International Conference on Big Data, 110–23. Retrieved
from http://dx.doi.org/10.1007/978-3-319-94301-5_9.
REFERENCES
[12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.
[1] Tayal, Madhuri, Atharva Mangrulkar, Purvashree Waldey, and Chitra “ImageNet Classification with Deep Convolutional Neural Networks.”
Dangra. 2018. “Bird Identification by Image Recognition.” Helix 8(6): 25: 1–9. Technologies.
4349–4352.
[13] The Royal Society For The Conservation Of Nature 2017.
[2] Albustanji, Abeer. 2019. “Veiled-Face Recognition Using Deep “Birdwatching in Jordan”. Retrieved from https://migratorysoaringbirds.
Learning.” Mutah University. birdlife.org/sites/default/files/jordan_birding_brochure.pdf.
[3] Alter, Anne L, and Karen M Wang. 2017. “An Exploration of Computer [14] Qiao, Baowen, Zuofeng Zhou, Hongtao Yang, and Jianzhong Cao. 2017.
Vision Techniques for Bird Species Classification.”. “Bird Species Recognition Based on SVM Classifier and Decision
[4] Atanbori, John et al. 2018. “Classification of Bird Species from Video Tree.” In 2017 First International Conference on Electronics
Using Appearance and Motion Features” Ecological Informatics 48: 12– Instrumentation & Information Systems 1–4.
23. [15] Sarayrah, Bayan mahmoud. 2019. “Finger Knuckle Print Recognition
[5] Brownlee, Jason. 2016. “How To Use Classification Machine Learning Using Deep Learning.” Mutah University.
Algorithms in Weka.” Retrieved from https://machinelearningmastery. [16] S. Al-Showarah et. al. 2020. “The Effect of Age and Screen Sizes on the
com/use-classification-machine-learning-algorithms-weka/. Usability of Smartphones Based on Handwriting of English Words on
[6] Cai, J., Ee, D., Pham, B., Roe, P., & Zhang, J. (2007, December). Sensor the Touchscreen”, Mu’tah Lil-Buhuth wad-Dirasat, Natural and Applied
network for the monitoring of ecosystem: Bird species recognition. In Sciences series, Vol. 35, No. 1, 2020. ISSN: 1022-6812.
2007 3rd international conference on intelligent sensors, sensor [17] Triveni, G., Malleswari, G. N., Sree, K. N. S., & Ramya, M (2020). Bird
networks and information 293-298. IEEE. Species Identification using Deep Fuzzy Neural Network Int. J. Res.
[7] Kumar, A., & Das, S. D. 2018. “Bird Species Classification Using Appl. Sci. Eng. Technol.(IJRASET), 8: 1214-1219.
Transfer Learning with Multistage Training”. In Workshop on Computer [18] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie 2011. “The
Vision Applications 28-38. Springer, Singapore. Caltech-UCSD Birds-200-2011 Dataset.” Technical Report CNS-TR-
[8] Hassanat, A. (2018). “Furthest-pair-based binary search tree for 2011-001, California Institute of Technology.
speeding big data classification using k-nearest neighbors”. Big Data, [19] Welinder, Peter et al. 2010. “Caltech-UCSD Birds 200.” Technical
6(3): 225-235. ReportCNS-TR-2010-001, California Institute of Technology, 2010.

260 | P a g e
www.ijacsa.thesai.org
View publication stats
Local compressed convex spectral embedding for bird species identification
Anshul Thakur, Vinayak Abrol, Pulkit Sharma, and Padmanabhan Rajan

Citation: The Journal of the Acoustical Society of America 143, 3819 (2018); doi: 10.1121/1.5042241
View online: https://doi.org/10.1121/1.5042241
View Table of Contents: http://asa.scitation.org/toc/jas/143/6
Published by the Acoustical Society of America
Local compressed convex spectral embedding for bird species
identification
Anshul Thakur,a) Vinayak Abrol, Pulkit Sharma, and Padmanabhan Rajan
School of Computing and Electrical Engineering, IIT Mandi, Mandi, Himachal Pradesh-175005, India

(Received 30 November 2017; revised 14 April 2018; accepted 18 April 2018; published online 29
June 2018)
This paper proposes a multi-layer alternating sparsedense framework for bird species identifica-
tion. The framework takes audio recordings of bird vocalizations and produces compressed convex
spectral embeddings (CCSE). Temporal and frequency modulations in bird vocalizations are
ensnared by concatenating frames of the spectrogram, resulting in a high dimensional and highly
sparse super-frame-based representation. Random projections are then used to compress these
super-frames. Class-specific archetypal analysis is employed on the compressed super-frames for
acoustic modeling, obtaining the convex-sparse CCSE representation. This representation effi-
ciently captures species-specific discriminative information. However, many bird species exhibit
high intra-species variations in their vocalizations, making it hard to appropriately model the whole
repertoire of vocalizations using only one dictionary of archetypes. To overcome this, each class is
clustered using Gaussian mixture models (GMM), and for each cluster, one dictionary of archetypes
is learned. To calculate CCSE for any compressed super-frame, one dictionary from each class is
chosen using the responsibilities of individual GMM components. The CCSE obtained using this
GMM-archetypal analysis framework is referred to as local CCSE. Experimental results corrobo-
rate that local CCSE either outperforms or exhibits comparable performances to existing methods
including support vector machine powered by dynamic kernels and deep neural networks.
C 2018 Acoustical Society of America. https://doi.org/10.1121/1.5042241
V

[PG] Pages: 3819–3828

I. INTRODUCTION Various methods have been proposed in the literature for


the problem of bird species identification/classification from
Birds play many important roles in upholding ecological
recorded bird songs or calls. In an initial study, McIlraith and
balance, from maintaining the forest cover by seed dispersal
Card5 proposed to use a two-layer feed forward neural network
and pollination, to occupying various levels in the food
with back propagation for bird song classification. Harma and
chain.1 However, due to human-induced climate change and
Somervuo6–8 used sinusoidal modeling of syllables (the basic
habitat destruction, many bird species are facing the threat
unit of bird song) for species classification. Fagerlund9 pro-
of population decline.2 This has led to several conservation
posed a decision tree-based hierarchical classification frame-
efforts, of which surveying and monitoring are integral com-
work for bird species recognition, where each node of the tree
ponents. These include maintaining records of avian diver-
is a support vector machine (SVM). The feature representation
sity and the populations of various species in a particular
used is Mel frequency cepstral coefficients (MFCC) and low-
area of interest.3 The manual surveying of birds in their natu-
level signal descriptors. Lee et al.3 proposed to use two-
ral habitat can be difficult as birds occupy a wide range of
dimensional cepstral coefficients for bird species identification.
habitats. Moreover, it is time-consuming, expensive and
Their study also proposed to tackle within-class variation by
experienced bird watchers are required. Thus, there is a need
prototyping each class using vector quantization and Gaussian
to develop automatic methods for surveying birds in their
mixture models. Stowel and Plumbley10 proposed a spherical
natural habitat.
K-means-based unsupervised representations for bird species
Acoustic communication in birds is very rich,4 hence,
classification. Apart from these methods, many studies have
the presence of many birds species can be detected by ana-
targeted various bioacoustic problems using deep learning,
lyzing their sounds or vocalizations. This makes acoustic
e.g., deep convolution neural networks (CNN) have been used
monitoring a convenient and passive method to monitor
for bird species identification.11–14 Chakraborty et al.15 utilized
birds in their respective habitats. Recent advancements in
a three-layered deep neural network (DNN) for bird species
programmable recording devices have made acoustic moni-
toring feasible. These devices can record a large amount of classification, where MFCCs are used as the feature represen-
acoustic data, which can be used for monitoring avian diver- tation. Apart from DNN, their study also explored Gaussian
sity. In this work, we target the problem of bird species iden- mixture model (GMM), GMM-UBM (universal background
tification from recorded acoustic data, which forms the model) and SVM powered by various dynamic kernels16 for
backbone of an acoustic monitoring system. species identification.
Leveraging on the success of learned-feature representa-
tions obtained by factorizing spectrograms for acoustic scene
a)
Electronic mail: anshul_thakur@students.iitmandi.ac.in classification17 and acoustic event detection,18 we propose a

J. Acoust. Soc. Am. 143 (6), June 2018 0001-4966/2018/143(6)/3819/10/$30.00 C 2018 Acoustical Society of America
V 3819
supervised, multi-layer, alternating dense-sparse framework high correlation between atoms of inter-class dictionaries,
to obtain feature representations for bird species identifica- which may degrade the discriminative ability of these dictio-
tion. In the proposed method, a given recorded audio signal naries. Supervised dictionary learning methods such as label-
(dense) is converted into a magnitude spectrogram (sparse). consistent K-singular value decomposition (Ref. 25) overcome
This concept of sparsity comes from the analysis that most of this problem by learning dictionaries in a supervised manner.
the bird vocalizations usually occupy only a few frequency Nevertheless, these supervised dictionary learning techniques
bins in the spectrogram.19 The frequency and temporal modu- are computationally expensive (both in time and space) and are
lations present in bird vocalizations provide species-specific not feasible when a substantial number of classes are involved.
signatures. However, applying matrix factorization techniques In order to overcome this issue, we propose an efficient greedy
on spectrograms directly may not capture these modulations procedure to choose atoms from each dictionary such that the
effectively. To overcome this issue, a certain number of overall correlation among all dictionaries is decreased. This
frames are concatenated around each frame of the spectro- procedure not only reduces the gross-correlation among dictio-
gram for embedding the context. This results in a high dimen- naries but also helps in reducing their size. Decreasing the dic-
sional (sparse) super-frame representation that is capable of tionary size reduces the computational complexity, which can
capturing the frequency and temporal modulations more be helpful for large-scale species identification.
effectively. These high dimensional super-frames are unsuit- The major contributions of this work are summarized as
able for acoustic modeling due to high computational com- follows:
plexity. Since the spectrogram is sparse, this super-frame
(1) Local CCSE, a supervised feature representation, that
representation is also sparse. Hence, super-frames can be
handles intra-class variations efficiently (Algorithm 2).
compressed without losing too much information. Random
(2) The application of a restricted version of archetype anal-
projections,20 which preserve pairwise distance according to
ysis for acoustic modeling.
the JohnsonLindenstrauss (JL) lemma, are used to com-
(3) A greedy procedure to choose a subset of atoms from each
press these super-frames to a low-dimensional representation
dictionary such that the overall correlation among all local
(dense). In the next step, the vocalizations of each bird spe-
cies are modeled using restricted robust archetypal analysis dictionaries of all classes is reduced (Algorithm 1).
(AA). AA provides compact, probabilistic and interpretable The rest of this paper is organized as follows. In Sec. II,
representation23 in comparison to the other matrix factoriza- we describe CCSE-based framework. In Sec. III, the pro-
tion techniques such as non-negative matrix factorization posed local CCSE framework along with the proposed prun-
(NMF) and sparse dictionary learning.22 The learned arche- ing procedure to decrease the inter dictionary correlation is
typal dictionaries are used to obtain a sparse-convex represen- discussed. Experimental setup and observations are in Secs.
tation for the compressed super-frames. These representations IV and V, respectively. Section VI concludes the paper.
are designated as compressed convex spectral embeddings
(CCSE). This CCSE representation captures species-specific II. COMPRESSED CONVEX SPECTRAL EMBEDDINGS
signatures effectively and can be used as feature representa- (CCSE)
tion in any classification framework.
In this section, the overall process to obtain CCSE from
CCSE assumes that the compressed super-frames of a
any input recording is described (Fig. 1). First, we describe
bird species lie on only one manifold. However, a particular
the process of obtaining a compressed super-frame-based
bird species can have a large repertoire of vocalizations that
representation from any input audio recording. Then, we
often occupy different manifolds in the feature space.3
explain the procedure to learn an archetypal dictionary for
Therefore, a single archetypal dictionary per bird species may
each bird species. Finally, we describe the process to obtain
not be able to model the variations present in a single bird
CCSE for any audio recording.
class. We address this problem by proposing to use multiple
archetypal dictionaries to model one bird species. In order to
A. Computing compressed super-frames
learn multiple dictionaries, the compressed super-frames are
clustered using GMM and for each cluster, an individual The short time Fourier transform (STFT) is applied to
archetypal dictionary is learned. To obtain the CCSE for a obtain a magnitude spectrogram S ðm  N; m is the number
compressed super-frame, a dictionary is chosen for each class of frequency bins, N is the number of frames) from each input
using the responsibility terms of the class-specific GMM. The audio recording. Short term Fourier analysis often leads to the
CCSE obtained using this GMM-AA-based framework is des- smearing of temporal and frequency modulations present in
ignated as local CCSE. bird vocalizations. In order to capture these modulations more
The archetypes learned using AA approximates the effectively, context information is ingrained into the current
convex-hull of the data, and the estimation of these arche- frame (under processing) of the spectrogram by concatenating
types is often expensive in terms of computation.24 Hence, in W previous and W next frames around the current frame. This
order to speed up the process of finding archetypes, we use a concatenation produces a high dimensional [ð2Wm þ mÞ  1]
restricted version of AA. In restricted AA, only the data representation called a super-frame. The pooled spectrograms
points around the convex hull/boundary are used for deter- ^ ðm  l; m
of all the training examples of a particular class, S
mining the archetypes. Conventionally, AA is performed indi- is the number of frequency bins, l is the number of pooled
vidually for each class and without any separate effort to frames), are converted into super-frame representation,
increase the inter-class discrimination. Hence, there can be a F 2 Rð2WmþmÞl , using the aforementioned concatenation

3820 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
FIG. 1. (Color online) Proposed pipeline for obtaining CCSE from an audio signal.

process. These super-frames are high-dimensional, which B. Restricted robust AA for dictionary learning
makes them computationally expensive to process for acous-
The CCSE framework employs archetypal analysis (AA)
tic modeling. However, these super-frame representations are
for acoustic modeling. The compressed super-frames corre-
sparse. The sparsity of the spectrogram and super-frames is
sponding to the bird vocalization regions are used for learning
illustrated in Fig. 2. Due to this sparsity, super-frames are
the archetypes. The bird vocalization regions are identified (in
suitable to attain a high degree of compression. Hence, build-
the input recordings) using a semi-supervised segmentation
ing upon the JL lemma,21 random projections are used
method27 proposed in one of our earlier studies. Using AA,
to compress these super-frames. Gaussian random matrices
which is a non-negative matrix factorization technique, the
satisfy the J-L lemma with high probability.26 Hence, these
matrix of compressed super-frames, X, is decomposed to
random matrices preserve the pair-wise distance between
obtain the representation matrix A as X ¼ DA. The dictionary,
super-frames in the projected space. In particular, a random
D, consists of the archetypes, which lie on the convex hull of
Gaussian matrix G (of dimensions K  2Wm þ m) is used to
data. These archetypes are confined to be the convex combi-
achieve the transformation, / : R2Wmþm ! RK , which com-
presses the super-frames. This compressed representation, nation of the individual data points, i.e., D ¼ XB; D 2 RKd
X ¼ G  F; X 2 RKl , is used to learn the archetypal dictio- (d is the number of archetypes) and B 2 Rld .
naries. Figure 2(c) depicts the compressed super-frame repre-
1. Restricting AA
sentation obtained for the spectrogram shown in Fig. 2(a).
Similarly, for a test audio recording, compressed super- Generally, matrix factorization is a computationally
frames are obtained using the same procedure. expensive process and AA is no exception. However, it is

FIG. 2. (Color online) (a) Spectrogram


of a Cassin’s vireo vocalization, (b)
1285-dimensional (2Wm þ m ¼ 1285
and m ¼ 257) super-frame representa-
tion obtained from (a) using W ¼ 2. W
is window size for concatenation and
m is the number of frequency bins.
(c) Compressed super-frames of 500
dimensions (K ¼ 500) obtained by pro-
jecting (b) on a random Gaussian
matrix.

J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3821
known that archetypes lie on the boundary or convex hull of hðuÞ ¼ 1=2minw ½u2 =w þ w.23 The use of Huber loss intro-
the data. This property can be used to restrict the archetypal ^ i jj ; Þ for xi in the opti-
duces a weight wi ¼ maxðjjxi  XBa 2
search space to the data points existing around the boundary. mization process, i.e., wi weighs the contribution of xi in the
This restricted search reduces the computational time estimation of archetypes. After the optimization, the weight
required to learn the archetypes. wi becomes larger for the outliers, reducing their importance
Let B be the index of compressed super-frames that lie in finding the archetypes. In this work, the optimization
around the boundary. To find these super-frames, the follow- problem in Eq. (2) is solved using an iterative procedure pro-
ing objective function is minimized: posed by Chen et al.23 (algorithm 3 in Ref. 23).

jjX  XCjj2F s:t: diagðCÞ ¼ 0; 3. Computational efficiency


ci Ɒ0; and jjci jj1 ¼ 1; (1) The computational saving obtained using restricted AA
is highlighted in Fig. 3. The average running times recorded
where diagð:Þ denotes the diagonal elements. The solution C for learning 32-archetypes from different number of super-
(having columns ci) that minimizes the given objective func- frames, using restricted robust AA and traditional robust
tion, can be interpreted as the coefficient matrix for repre- AA, are depicted in Fig. 3. This experiment is conducted on
senting each compressed super-frame (xi) in X as a linear a PC running Ubuntu 16.0 with 16 Gb of RAM, and an Intel
combination of other compressed super-frames.24 The signif- i7 CPU with 3.00 GHz clock speed. The implementation is
icant values (i.e., high magnitude values) of the solution cor- in Matlab 2014a. Each super-frame is of 500-dimensions and
respond to the boundary points xz, such that z 2 B. These 100 iterations are used for learning the archetypes for both
values are obtained by maximizing the negative gradient of the setups. The analysis of Fig. 3 shows that for all configu-
the error cost in Eq. (1) (involving inner products) with rations, the average running time for restricted robust AA is
respect to ci. The principles of convex geometry state that significantly less than the robust AA. The restricted AA
the inner product between two points is maximum when one shows a relative drop of 67.5% in average running time
of the points lies on boundary of the data.28 As a result, the across all configurations.
solution that minimizes the error cost in Eq. (1) ensures that
the union of the indices of high magnitude elements of each C. Computing CCSE representation
ci refer to super-frames around the boundary. Hence, using
^ 2 RKp (p is the The compressed super-frames are obtained for an audio
this procedure X 2 RKl is reduced to X
recording using the procedure discussed in Sec. II A. Here,
number of chosen boundary super-frames such that p  l).
the vocalization regions are identified and super-frames
The problem in Eq. (1) can be solved using a fast quadratic
corresponding to these regions are extracted. The same
programming (QP) solver such as MATLAB’s quadprog and is
Gaussian random matrix is employed for obtaining com-
a one-time procedure.
pressed super-frames during training and testing. The final
dictionary, D, is obtained by concatenating individual dictio-
2. Restricted robust AA
naries of each bird species/class, i.e., D ¼ ½D1 D2    Dq ,
The presence of outliers in data changes the convex where Dq is the archetypal dictionary learned for the qth
hull, which affects the performance of AA. The outliers can class using restricted robust AA (discussed in Sec. II B 3).
arise due to noise or segmentation errors. In order to address The CCSE for any compressed super-frame, yi, is obtained
this issue, we propose to use robust AA (Ref. 23) on X, ^ by projecting yi on to a simplex corresponding to dictionary
which mitigates the affects of outliers to a large extent. In D, as further described in Sec. III. This CCSE contain strong
particular, the archetypal dictionary, D, is computed by opti-
mizing the following function:23
X
p
 
argmin h jjxi  Dai jj2
B; A i¼1
bj 2 Dp ; ai 2 Dd

1X 1
p
¼ ^ i jj2 þ wi ;
jjxi  XBa 2
2 i¼1 wi
   
Dp ¢ bj Ɒ0; jjbj jj1 ¼ 1 ; Dd ¢ ai Ɒ0; jjai jj1 ¼ 1 ;
wi  
8i : 1 ! p and 8j : 1 ! d: (2)

^ 2 Rkp , A 2 Rdp
Here xi ; ai ; and bj are the columns of X
pd
and B 2 R ; respectively, wi is a scalar, and  is a positive
constant. In contrast to conventional AA employing
Euclidean loss, robust AA employs a Huber loss function FIG. 3. (Color online) Average running time recorded for robust AA and
hðÞ. For scalars u and , the Huber function is defined as restricted AA.

3822 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
class-specific signatures and can be used as a feature repre- A. Learning local dictionaries
sentation for species classification. This behavior is illus-
The compressed super-frames corresponding to the bird
trated in Fig. 4, which shows the average of CCSEs
vocalizations present in the training audio recordings are
obtained for an exemplar vocalization of three different
extracted and pooled together in a class-specific manner as
species. These average CCSEs are obtained using the final
described in Sec. II A. These pooled super-frames are used
dictionary (D) derived from the individual dictionaries of
for learning multiple local dictionaries of a bird class. First,
all three species. The final dictionary contains 128 atoms
a GMM with Z components is used to cluster these super-
per class (the first 128 for black-throated tit, the next 128
frames. Then, restricted robust AA (Sec. II) is applied to get
for black-yellow grosbeak and the last 128 for black-crested
an archetypal dictionary for each of these Z clusters. Hence,
tit). In average CCSEs, the coefficients exhibit higher
one bird species/class is modeled by Z archetypal dictionar-
amplitude for the atoms of D which correspond to the true
ies. It has to be noted the number of GMM components can
class. This corroborates our claim of the discriminative
be different for different classes, e.g., Z can be large for a
nature of CCSE.
class having large variations in vocalizations (e.g., Cassin’s
vireo) as compared to the one with less variations (e.g.,
III. PROPOSED LOCAL CCSE-BASED FRAMEWORK Hutton vireo). Since the clusters within a class can exhibit
Songs phrases and various calls such as alarm calls, more overlap, GMM provides better clustering than the
feeding calls and flight calls form the repertoire of vocaliza- hard-clustering techniques like K-means or K-medoids.
tions that a species can produce. The nature of different kind
of vocalizations can vary considerably.3 A single archetypal B. Decreasing the inter-dictionary correlation
dictionary (as used in CCSE) cannot effectively model all In Sec. III A, all dictionaries are learned independently,
these within-class variations. An effective way to handle this which may lead to high correlation between the inter-
problem is to learn local archetypal dictionaries. The CCSE dictionary atoms. This high correlation is not a big issue for
learned from these local dictionaries provide better represen- the dictionaries of one class. However, if correlation is high
tation for a bird species. Keeping these facts in account and among the dictionaries of different classes, it can affect the
improvising over the CCSE framework, we propose a local classification performance. In order to address this problem,
CCSE-based framework which can handle the variations pre- a greedy pruning procedure is proposed to choose a subset of
sent in vocalizations of various bird species. In this frame- atoms from each dictionary, such that the gross correlation
work, multiple local dictionaries are learned for each class. among all the dictionaries is decreased.
The different local dictionaries model the different sets of Let us denote the jth pruned dictionary of the qth class
vocalizations of a particular species. Out of these local dictio- q
by Dj . The proposed algorithm starts by choosing the inde-
naries, one dictionary per class is chosen to obtain convex pendent atoms from the first dictionary of the first class, D11 ,
sparse representations (CCSE) for a super-frame. This frame- iteratively using the following metric:
work also utilizes a greedy iterative procedure to decrease the
T
1 2
gross correlation between intra and inter-dictionary atoms. i ¼ max jjd11i  D11Z D1† 1 1
1Z d1i jj2 s:t: D1Z D1Z is invertible:
This reduces the size of dictionaries making the proposed i62Z

framework computationally efficient. (3)

Here d11i is an atom of D11 ; † denotes the pseudo-inverse, Z


denotes the set of indices of the selected atoms and
D11Z D11 , denotes the current set of selected atoms.
Equation (3) computes the distance of an atom d11i to the
space spanned by the atoms in D11Z , and selects the one
which lies at maximum distance from the span of D11Z .
This atom exhibits minimum correlation to atoms present
in the already selected set, D11Z . In order to choose J atoms
from D11 , Eq. (3) is iterated J times. Hence, a pruned dictio-
1
nary, D1 D11 , is obtained. This whole procedure is
repeated for each local dictionary of each class to find the
uncorrelated atoms with respect to the previously selected
atoms from all the dictionaries. Algorithm 1 describes the
procedure to obtain the pruned versions of all the dictio-
naries. All local dictionaries of each class are given as
input to algorithm 1. The output is a set of pruned dictio-
naries, each having J (J < d) atoms. Hence, along with cor-
relation, this procedure also decreases the size of
FIG. 4. (Color online) Average CCSEs obtained for a vocalization of (a)
black-throated tit, (b) black-yellow grosbeak and (c) black-crested tit. Each dictionaries, thus reducing the computational complexity
bird species is modeled by an archetypal dictionary having 128 atoms. of the whole framework.

J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3823
ALGORITHM 1: Proposed greedy procedure to decrease the inter- ALGORITHM 2: Procedure to obtain average local CCSE for a bird
dictionary correlation. vocalization.

input: Dqz ; zth dictionary of qth class input: Dq z ; 8 q : 1 ! Q; 8 z : 1 ! Zq


8q : 1 ! Q (number of classes) Gq, GMM of qth class, 8q : 1 ! Q (number of classes)
8z : 1 ! Zq (number of local dictionaries in qth class) Y, ðK  IÞ, compressed super-frames of a bird vocalization
dqzi : ith atom of Dqz output: LCavg, average local CCSE for Y of dimensions Qd  1
J, the number of atoms to be selected per dictionary  z
1 for 1 to I do
 i
W ¼ ½ , set of currently selected dictionary atoms 2  Df ¼ ½ 
1 1 1 q q 
output: D ¼ ½D1 D2   DZ   DZ1 DZ ; Set of pruned dictionaries 3  for q 1 to Q do
q

1 D ¼ ½ ; W ¼ ½W d11  1 4   z ¼ arg maxz cz ðyi Þ; 8z : 1 ! Zq == Using Eq: ð4Þ
  i
5  Df ¼ ½Df Dz 
i q
2 for q 1 to Q do
 
3  for z 1 to Zq do 6  end

4   S ¼ ½  == Set to store indices of selected atoms 7  ai ¼ simplex Projection ðDif ; yi Þ== Achieving convex decomposition

5   for j 1 to J do  using Active  set QP solver and yi is ith column of Y
6    i ¼ arg maxi jjdqzi  WW † dqzi jj22 s:t: W T W is invertible 8 end
   == 8i : 1 ! d ðnumber of atomsÞ X I
 9 LCavg ¼ 1I ai
7    W ¼ ½W dqzi  i¼1
8    S ¼ S [ i
9   end
10   Dq q
z ¼ Dz ½:; S
11   D ¼ ½D Dz 
q

12  end
species identification. As an illustration, Fig. 5 shows the
13 end two-dimensional (2-D) plot of average local CCSEs for
vocalizations of seven different bird species computed using
t-distributed stochastic neighbor embedding (t-SNE).29 It
must be noted that the parameters used for obtaining these
C. Computing local CCSE representation average local CCSE are for illustration purpose only and
may not be optimal. In this illustration, the super-frame rep-
In order to obtain the local CCSE for any super-frame resentation of 1285 dimensions (for W ¼ 2 and NFFT ¼ 512)
yi, one dictionary from Zq local dictionaries of the qth class is used. Random projections are used to obtain compressed
is chosen. The responsibility of each GMM component/ 500-dimensional representation of these super-frames. Each
cluster in defining yi is calculated and the dictionary corre- species is modeled by a three-component GMM and a 32-
sponding to the component exhibiting maximum responsi- atom dictionary is learned for each component/cluster. One
bility is chosen. This is achieved using the following such 32-atom dictionary is illustrated in Fig. 6. Hence, each
equation: vocalization is represented by 224 (32  7)-dimensional
  average local CCSE. The analysis of Fig. 5 makes it clear
q wqz N yi jlqz ; Rqz
z ¼ argmax cz ðyi Þ ¼ Z : (4) that the proposed feature representation, i.e., average local
X  q
z CCSE shows different characteristics for different bird spe-
q q
wp N yi jlp ; Rp
p¼1
cies, making them suitable for bird species identification.
The small overlap observed between vocalizations of grey
Here wqz ; lqz ; and Rqz are the weight, the mean and the bush chat, black-crested tit and golden bush-robin could be
covariance of the zth GMM component of the qth class. The due to the similarity between the properties (frequency range
pruned dictionary corresponding to this zth component/clus- and modulations) of the vocalizations of these species.
ter, i.e., Dq
z , is chosen. This procedure is iterated to select Q
dictionaries, one for each class, which are used for obtaining
the local CCSE. These dictionaries are concatenated to form
the final dictionary Dif . The local CCSE for yi is obtained by
projecting it on a simplex corresponding to dictionary Dif ,
using the quadratic programming-based active-set method
proposed by Chen et al.23 (algorithm 2 in Ref. 23). This local
CCSE exhibits high coefficient values corresponding to true
class atoms of Dif and low coefficient values corresponding
to the atoms of other classes (plots similar to Fig. 4 are
obtained). The distinction in local CCSE for super-frames of
different classes makes them an appropriate feature repre-
sentation for classification.
A segmented bird vocalization is represented by average
of local CCSE of all the super-frames corresponding to this
vocalization. Algorithm 2 describes the procedure to obtain
average local CCSE for a bird vocalization. These average FIG. 5. (Color online) Two-dimensional t-SNE visualization of 224-
local CCSEs are used as a feature representation for bird dimensional average local CCSE obtained for seven different bird species.

3824 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
optimal values of d and J are determined empirically. The
classifier used in this work is linear SVM, with an empiri-
cally tuned penalty parameter. The average local CCSE
obtained from each segmented vocalization is used as the
feature representation. Hence, the proposed framework pro-
vides segment/vocalization level classification decisions.

1. Train/test data distribution


A threefold cross-validation is used to compare the classi-
fication performance of the proposed local CCSE framework
FIG. 6. (Color online) A 32-atom archetypal dictionary learned for one clus- and the comparative methods. 33.33% of the vocalizations
ter of black-yellow grosbeak. present in each fold (per class) are used for training while the
remaining are used for testing. 75% of these 33.33% training
IV. EXPERIMENTAL SETUP vocalizations are used for learning dictionaries while remain-
In this section, we discuss the dataset used, along with ing 25% vocalizations are used to obtain the average local
various parameters used in the experimental evaluation. In CCSE for training the SVM. The results presented here are
addition, the methods used for comparative study are also averaged across all three folds.
listed here.
2. Comparative methods
A. Dataset used The classification performance of the proposed local
Audio recordings containing vocalizations of 50 different CCSE framework is compared with GMM, GMM-UBM,
bird species are used for evaluating the classification perfor- SVM powered by dynamic kernels and DNN-based classi-
mance of the proposed local CCSE. These audio recordings fiers. Different dynamic kernels used in this study are: proba-
are obtained from three different sources. Recordings of 26 bilistic sequence kernel (PSK), Gaussian mixture models
bird species were obtained from the Great Himalayan national super-vector kernel (GMMSV), GMM-UBM mean interval
park (GHNP), in north India. These recordings were collected kernel (GUMI), GMM-based pyramid match kernel (PMK),
manually using a directional microphone. The recordings of and GMM-based intermediate matching kernel (IMK). The
seven bird species were obtained from the bird audio database DNN used for comparison is a three layered fully connected
maintained by the Art & Science center, UCLA.30 The audio network with 512 hidden units.15 To tackle over-fitting, a
recordings of the remaining 17 bird species were obtained drop-out rate of 10% is used. MFCC using delta and acceler-
from the Macaulay Library.31 These recordings are provided ation coefficients, with a temporal context of seven previous
on an academic research license. All the recordings available and seven next MFCC frames are used as feature representa-
are 16-bit WAV files having a sampling rate of 44.1 kHz, tions in the above mentioned methods. For the GMM-based
with the duration ranging from 18 s to 3 min. Although most classifier, the optimal number of GMM components per class
of the recordings are mono channel, dual channel recordings is learned using BIC. Further, a UBM built by pooling the
are also present, of which the first channel is used here. The frames of all classes and fitting a 128-component GMM, is
information about these 50 species along with the total num- used for the GMM-UBM method. In addition, spherical K-
ber of recordings and vocalizations per species is available at means-based unsupervised feature representation10 is also
http://goo.gl/cAu4Q1. used for comparison. Here, features are obtained using 500
clusters means and a random forest classifier (with 200 deci-
B. Parameter setting sion trees) is used for classification.
The performance of local CCSE is also compared with
In our experiments, each recording is converted to spec-
CCSE (see Sec. II). For classification, each vocalization is
trogram using STFT (with 512 FFT points) on a frame-by-
represented as the average of CCSE obtained for all the
frame basis, with a frame size of 20 ms and 50% overlap.
super-frames of that vocalization. Each class is modeled by a
The super-frames are obtained using a window length of
single dictionary having 128 archetypes and a linear SVM is
seven (W ¼ 7), which are compressed using random projec-
used for classification purposes.
tions to have a dimension of K ¼ 1000. These optimal values
of window length and the dimensions of compressed super-
V. EXPERIMENTAL OBSERVATIONS
frames are determined experimentally as discussed in Sec.
V. The number of GMM components (Zq) range from 3 to In this section, first, we describe the effects of size of
8 for different classes. The optimal number of GMM compo- context window, extent of compression in super-frames and
nents are selected using the Bayesian information criterion size of pruned dictionaries on the classification performance
(BIC); the GMM giving least BIC is used. The number of of the proposed framework. Then, the classification perfor-
atoms in each archetypal dictionary (learned for each GMM mance of the proposed framework is evaluated against the
component) is d ¼ 128. These atoms are pruned down to performances of the various existing methods. Finally, the
J ¼ 32, using the procedure described in Algorithm 1. These performance of the proposed framework and local CCSE is

J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3825
evaluated when there is a significant mismatch in training-
testing conditions.

A. Effect of context window size (W)


A smaller value of W leads to a super-frame representation
having less context information and lower dimensionality. On
the other hand, larger value of W produces super-frames having
more context and high dimensionality. Although these high
dimensional super-frames are compressed using random pro-
jections, obtaining a larger compression ratio may lead to the
loss of information. Hence, an appropriate value of W is chosen
empirically. The minimum value of W which gives the maxi-
mum classification performance can be considered as optimal.
Figure 7 shows the classification performance achieved by the FIG. 8. (Color online) Effect of compression on classification performance
local CCSE-based framework for different values of W. It is and average running time required for learning local dictionaries.
clear from the figure that the incorporation of context informa-
tion improves the classification performance. The maximum nature of the super-frames. Figure 8 also shows the increment
accuracy is achieved for W ¼ 7. On increasing W further does in average running time (average time recorded for 10 runs)
not lead to better classification. Hence, W ¼ 7 is chosen for all for learning local dictionaries of 50 classes (used in experi-
the experiments in this study. It must be noted that for all the mentation) as the dimensionality of compressed super-frames
values of W, a compression ratio of 75% was maintained for is increased. Hence, compressing the super-frames provide
obtaining the compressed super-frames. Using a very large significant computational gain in the proposed framework.
value of W (W > 10) can lead to over-fitting by affecting the
generative nature of the proposed method, as shown in Fig. 7. C. Size of pruned dictionaries vs classification
performance
B. Compression vs classification/computation
The pruning procedure given in algorithm 2 decreases
trade-off
the size of dictionaries by choosing a subset of atoms from
The computational complexity of robust AA and active- each dictionary. In this experiment, we analyzed the extent
set simplex decomposition is directly dependent on the to which the size of dictionaries can be reduced without
dimensionality of data points.23 Hence, reducing the dimen- showing performance degradation. Originally, each dictio-
sionality of super-frames makes the proposed framework nary has 128 atoms. We pruned these dictionaries to have
computationally more efficient. As discussed earlier, a win- 64, 32, 16, and 8 atoms. Figure 9 depicts the classification
dow size of W ¼ 7 is used in our experimentation. This gives performance of local CCSE for each of these cases. It can be
rise to 3855-dimensional super-frames [FFT points ¼ 512, observed from Fig. 9 that using pruned dictionaries having
3855 ¼ 257 x(7 þ 1 þ 7)]. To determine the extent of com- 32 atoms each, provide the same classification performance
pression that can be achieved in the super-frames, we experi- as the original dictionaries.
mented with different compression rates and the results are
shown in Fig. 8. It can be observed that one can achieve a
D. Classification performance
75% compression (K ¼ 1000 from original dimension of
3855), without any decrease in the classification accuracy. The comparison of classification performance of the pro-
This high compression can be attributed to the highly sparse posed local CCSE-based framework with various comparative

FIG. 7. (Color online) Effect of the size of context window on classification


performance. FIG. 9. (Color online) Number of chosen atoms vs classification accuracy.

3826 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
FIG. 10. (Color online) Comparison of the classification performance of the proposed local CCSE-based framework with various comparative methods.

methods is illustrated in Fig. 10. It is evident from the figure of the vocalizations for dictionary learning and 25% for
that local CCSE-based framework outperforms the other training SVM). The performance of the proposed framework
methods considered in this study. The classification accuracy and other classification methods is depicted in Fig. 11. The
obtained using the proposed local CCSE-based framework is analysis of Fig. 11 shows that proposed local CCSE frame-
higher than the GMM, GMM-UBM, and SVM powered by work shows a relative improvement of 10.94%, 8.68%,
various dynamic kernels. The local CCSE-based framework 7.52%, 6.67%, 6.8%, 5.53%, 6.23%, 5.12%, 2.04%, and
shows a relative improvement of 14.77%, 10.99%, 8.54%, 3.49% over classification accuracies of GMM, GMM-UBM,
10.32%, 6.45%, 7.32%, and 6.82% over classification accura- PSK, GMMSV, GUMI, IMK, PMK, SK-means, DNN, and
cies of GMM, GMM-UBM, PSK, GMMSV, GUMI, IMK, CCSE, respectively. This shows that the proposed frame-
and PMK, respectively. Also, a relative improvement of 4.6% work is more robust to the mismatched conditions in com-
is observed over the framework using random forest and unsu- parison to the other comparative methods.
pervised feature representations obtained using spherical
K-means. However, the performance of DNN is comparable VI. CONCLUSION
to the proposed framework. A small relative improvement of
In this work, we proposed a local CCSE-based frame-
1.11% is obtained by the proposed framework over the classi-
work for bird species identification using audio recordings.
fication accuracy achieved by DNN. Also, the local CCSE
We demonstrated that local CCSE provides good species
outperforms CCSE by a relative improvement of 3.89%.
discrimination and can be used as a feature representation in
a classification framework. By utilizing super-frames, infor-
E. Robustness comparison
mation about time-frequency modulations are effectively uti-
The performances of most of classification frameworks lized. Apart from this, we also used a restricted version of
are known to degrade when training and testing conditions AA which only processes the data points around the bound-
vary significantly. For the task in hand, these variations can ary to find archetypes. To reduce the size of archetypal dic-
arise due to difference in the recording ambiance and differ- tionaries, we proposed a greedy iterative procedure which
ence in recording devices (e.g., omni-directional vs direc- chooses a subset of atoms from each dictionary such that the
tional microphones). We conduct an experiment to analyze gross-correlation across atoms of all the dictionaries is
the robustness of the proposed framework against differ- decreased. Experimental evaluation showed that the local
ences in recording environments. Five recordings of each of CCSE-based framework outperformed all the existing meth-
the 50 species, considered in this study, are downloaded ods considered in this study. The framework also performed
from Xeno-Canto32 which is a crowd-sourced bird vocaliza- well when there was a difference in training-testing record-
tion database. The recording conditions of the Xeno-Canto ing conditions.
audio recordings (XC) are different from the recordings in Future work will include enforcing the group sparsity
the dataset used for classification comparison in previous for obtaining CCSE. This can further enhance the discrimi-
sub-section. native properties of local CCSE. Also, instead of using the
XC recordings are used for testing while all the record- simple linear classifier such as linear SVM, incorporating
ings used in previous experiments are used for training (75% the ensemble classifiers like random forest and neural

FIG. 11. (Color online) Classification performance of different methods on Xeno-Canto recordings.

J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al. 3827
15
networks can improve the classification performance of the D. Chakraborty, P. Mukker, P. Rajan, and A. Dileep, “Bird call identifica-
tion using dynamic kernel based support vector machines and deep neural
local CCSE-based representation.
networks,” in Proceedings of Int. Conf. Mach. Learn. App. (December
2016), pp. 280–285.
ACKNOWLEDGMENT 16
A. D. Dileep and C. C. Sekhar, “GMM-based intermediate matching ker-
nel for classification of varying length patterns of long duration speech
This work is partially supported by IIT Mandi under the using support vector machines,” IEEE Trans. Neural Net. Learn. Syst.
project IITM/SG/PR/39 and Science and Engineering 25(8), 1421–1432 (2014).
17
Research Board, Government of India under the project V. Bisot, R. Serizel, S. Essid, and G. Richard, “Acoustic scene classifica-
tion with matrix factorization for unsupervised feature learning,” in
SERB/F/7229/2016-2017. Proceedings of Int. Conf. Acoust. Speech, Signal Process. (March 2016),
pp. 6445–6449.
1 18
M. Clout and J. Hay, “The importance of birds as browsers, pollinators P. Giannoulis, G. Potamianos, P. Maragos, and A. Katsamanis, “Improved
and seed dispersers in New Zealand forests,” N. Z. J. Ecol. 12, 27–33 dictionary selection and detection schemes in sparse-CNMF-based over-
(1989). lapping acoustic event detection,” in Proceedings of the Detection and
2
T. S. Brandes, “Automated sound recording and analysis techniques for Classification of Acoustic Scenes and Events 2016 Workshop
bird surveys and conservation,” Bird Conserv. Int. 18(S1), S163–S173 (DCASE2016) (2016), pp. 25–29.
(2008). 19
N.-C. Wang, R. E. Hudson, L. N. Tan, C. E. Taylor, A. Alwan, and R.
3
C.-H. Lee, C.-C. Han, and C.-C. Chuang, “Automatic classification of bird Yao, “Change point detection methodology used for segmenting bird
species from their sounds using two-dimensional cepstral coefficients,” songs,” in Proceedings of Int. Conf. Signal Info. Process. (2013), pp.
IEEE/ACM Trans. Audio, Speech, Language Process. 16(8), 1541–1550 206–209.
20
(2008). J. Haupt and R. Nowak, “Signal reconstruction from noisy random projec-
4
D. E. Kroodsma, E. H. Miller, and H. Ouellet, Acoustic Communication in tions,” IEEE Trans. Info. Theory 52(9), 4036–4048 (2006).
21
Birds: Song Learning and Its Consequences (Academic, New York, P. Frankl and H. Maehara, “The JohnsonLindenstrauss lemma and the
1982), Vol. 2. sphericity of some graphs,” J. Comb. Theory, Ser. B 44(3), 355–362
5
A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropaga- (1988).
22
tion and multivariate statistics,” IEEE Trans. Signal Process. 45(11), I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Process.
2740–2748 (1997). Mag. 28(2), 27–38 (2011).
6 23
A. Harma and P. Somervuo, “Classification of the harmonic structure in Y. Chen, J. Mairal, and Z. Harchaoui, “Fast and robust archetypal analysis
bird vocalization,” in Proceedings of Int. Conf. Acoust. Speech, Signal for representation learning,” in Proceedings of Comp. Vis. Pattern Recog.
Process (May 2004), pp. 701–704. (June 2014), pp. 1478–1485.
7 24
P. Somervuo, A. Harma, and S. Fagerlund, “Parametric representations of V. Abrol, P. Sharma, and A. K. Sao, “Identifying archetypes by exploiting
bird sounds for automatic species recognition,” IEEE/ACM Trans. Audio, sparsity of convex representations,” in Workshop on The Signal
Speech, Lang. Process. 14(6), 2252–2263 (2006). Processing with Adaptive Sparse Structured Representations (SPARS)
8
P. Somervuo and A. Harma, “Bird song recognition based on syllable pair (2017).
25
histograms,” in Proceedings of Int. Conf. Acoust. Speech, Signal Process. Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for
(May 2004), Vol. 5, pp. V–825. sparse coding via label consistent k-svd,” in Proceedings of Comp. Vis.
9
S. Fagerlund, “Bird species recognition using support vector machines,” Pattern Recog. (June 2011), pp. 1697–1704.
26
EURASIP J. Appl. Signal Process. 2007(1), 038637. S. Dasgupta and A. Gupta, “An elementary proof of a theorem of
10
D. Stowell and M. D. Plumbley, “Automatic large-scale classification of Johnson and Lindenstrauss,” Random Struct. Algorithms 22(1), 60–65
bird sounds is strongly improved by unsupervised feature learning,” PeerJ (2003).
27
2, e488 (2014). A. Thakur, V. Abrol, P. Sharma, and P. Rajan, “Renyi entropy based
11
E. Sprengel, M. Jaggi, Y. Kilcher, and T. Hofmann, “Audio based bird mutual information for semi-supervised bird vocalization segmentation,”
species identification using deep learning techniques,” in CLEF (Working in Proceedings of MLSP (September 2017).
28
Notes) (2016), pp. 547–559. S. Mair, A. Boubekki, and U. Brefeld, “Frame-based data factorizations,”
12
B. P. Toth and B. Czeba, “Convolutional neural networks for large-scale in Proceedings of Int. Conf. Mach. Learn. (August 2017), Vol. 70, pp.
bird song classification in noisy environment,” in CLEF (Working Notes) 2305–2313.
29
(2016), pp. 560–568. L. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn.
13
K. J. Piczak, “Recognizing bird species in audio recordings using deep Res. 9(Nov.), 2579–2605 (2008).
30
convolutional neural networks,” in CLEF (Working Notes) (2016), pp. “Art-sci center, University of California,” http://artsci.ucla.edu/birds/
534–543. database.html/ (Last viewed October 10, 2017).
14 31
R. Narasimhan, X. Z. Fern, and R. Raich, “Simultaneous segmentation “Macaulay library,” http://www.macaulaylibrary.org/ (Last viewed
and classification of bird song using CNN,” in Proceedings of Int. Conf. November 14, 2017).
32
Acoust. Speech, Signal Process. (March 2017), pp. 146–150. “Xeno-canto,” http://www.xeno-canto.org (Last viewed October 14, 2017).

3828 J. Acoust. Soc. Am. 143 (6), June 2018 Thakur et al.
380 Advances in Parallel Computing Algorithms, Tools and Paradigms
D.J. Hemanth et al. (Eds.)
© 2022 The authors and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/APC220053

Bird Species Identification Using


Convolutional Neural Network
Dharaniya R a,1, Preetha Mb and Yashmi SC
a
Assistant Professor, Department of Computer Science and Engineering,
b,c
UG students, Department of Computer Science and Engineering,
Easwari Engineering College. Chennai, India

Abstract.Identification of bird species is an important task in biodiversity


preservation and ecosystem maintenance. Birds also help in various activities like
agriculture, landscape formation, coral reef formation, etc. Identification and
observation of the bird species is a vital task in ecology. Due to the development in
the field of machine learning, the automatic classification of bird species has been
made simple. The advancements in technology led to the development of various
bird species classification systems. Mostly, a web application or mobile
application will be developed to display the result of the classification model. This
paper presents the implementation details of bird species identification using a
Convolutional Neural Network.

Keywords. Bird species classification, Feature extraction, Deep learning,


Convolution Neural Network (CNN).

1. Introduction

In recent years, deep learning techniques, like convolutional neural networks (CNNs),
have caught the attention of environmental researchers. Deep learning techniques and
methods are implemented in the field of ecology and research to successfully identify
the animal, bird, or plant species from images. A lot of importance is given to bird
species classification because of its attention in the field of computer vision and for its
promising applications in environmental studies. The identification of bird species is a
challenging task in the research field as it may sometimes lead to uncertainty due to
various appearances of birds, backgrounds, and environmental changes. Recent
development in the deep learning field made the classification of bird species more
flexible.
Birds play an essential role in the ecosystem by directly influencing food
production, human health, and ecology balance. Various kinds of challenges have been
faced by ornithologists for decades regarding the identification of bird species.
Ornithologists study the characteristics and attributes of birds and distinguish them by
their living within the atmosphere, and their ecological influence. on bird species have
led to the development of applications that can be used in tourism, sanctuaries, and
additionally by bird watchers.

1
R. Dharaniya, Assistant Professo, Easwari Engineering College, Chennai, Tamil Nadu, India. E-mail:
rdharaniya@gmail.com.
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 381

Several bird species in the world are critically endangered, vulnerable, and near
threatened. The development of bird species classification system can help the
authorities to keep track of birds in a particular area by observing each species of bird.
In recent years, studies In our work, the dataset is collected using internet resources.
Before using the dataset for the classification, the images will be preprocessed. The
CNN algorithm is used for the classification. The preprocessed images will be used for
feature extraction and classification. The model will be trained and tested to produce a
favorable outcome.

2. Prior work

[1] In these nine features of color-based maremeasures of mean, standard deviation,


and skewness of red, green, and blue (RGB) planes are found in bird images. SVM
algorithm was implemented for feature extraction and classification. A fast detection
model known as SDD was used for predicting the locations of the multiple category
objects in an image. The stochastic gradient descent (SGD) algorithm was used to train
the SVM classifiers. In [3], a CNN-based architecture had been proposed for bird
species classification. Histogram of Oriented Gradient (HOG) had been utilized for
feature extractionand the LeNet model was chosen for the classification process. [4]
This paper proposed a Machine Learning approach to identify Bangladesh birds. The
VGG-16 network was applied for feature extraction and SVM was applied for the
classification of bird species. AMobileNet model [5] was proposed for the
classification of Indian species. A transfer learning technique was used to retrain the
MobileNet model.[6] A bird species classification model using a deep convolution
neural network was proposed. SoftMax layer was used in CNN to improve the
performance of the system.
A Deep Convolutional Neural network [7] was used and parallel processing was
carried out using GPU technology and theGoogleNet framework had been applied to
identify the bird images. In this [8], a novel deep learning model was proposed to
classify bird species along with another deep learning model using pre-trainedResNet
architecture. The end-to-end deep network for fine-grained visual categorization [9]
called Collaborative Convolutional Network (CoCoNet) was proposed and the
implementation and performance of the model were based on the Indian bird's
dataset.[10] A transfer learning-based method using InceptionResNet-v2 was
developed to detect and classify bird species. Swapping of misclassified data between
training and validation datasets and fivefold cross-validation was performed. An
Artificial Neural Network was proposed after selecting a combination of features from
shape, color, and texture features [11]. The classification was done by using Multilayer
Perceptron (MLP). [12]Here, a multi-scale Convoluted Neural Network with Data
augmentation Techniques was used to train the system and a skip connection method
was used to improve feature extraction.

3. Algorithm

This paper discusses the implementation of CNN to identify bird species. A


convolutional neural network (CNN) is a class of deep learning algorithms that utilize
machines to take in input images, assign weights and biases to various aspects and
382 Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network

objects in the image, and identify patterns in the image.CNN’s consist of the input
layer, which is a grayscale image; the output layer, which is a binary or multi-class
label; and the hidden layers, which are convolution layers, ReLU layers, pooling layers,
and a fully connected neural network.In the field of image processing, CNN is a
powerful algorithm. These algorithms are currently the best available for automating
the processing of images. Images are made up of RGB combinations.
Three Layers of CNN:
Convolutional layer: An input neuron is connected to each hidden layer in a neural
network.In CNN, only a small portion of input neurons are connected to the hidden
neurons.

Pooling Layer: Feature map dimensions are reduced by this layer.There will be
multiple activation & pooling layers within the hidden layer of the CNN.
Fully-connected layer: Layers that are fully connected from the last few layers in a
network. The output from the final pooling or convolutional layer is flattened and fed
into the fully connected layers.

4. Dataset

First, the dataset was collected from the resources available on the internet. There are
six different bird species or classes with more than 100 images per class. The bird
species are American Goldfinch, Barn Owl, Carmine Bee-eater, Downy Woodpecker,
Emperor Penguin, and Flamingo. The model will be trained on this dataset.

Figure 1. Bird Images

5. Preprocessing

Image pre-processing involves working with images at the lowest level of abstraction
possible. These operations do not increase the level of information in the image, but
rather decrease it if entropy is a measure of information. In addition to cleaning image
data for model input, image preprocessing can also decrease model training time and
increase model inference speed. If input images are exceptionally large, reducing these
images will dramatically improve model training time. It reduces distortions or
enhances certain features for further processing, although geometric transformations
(e.g. rotation, scaling, translation) are often necessary.
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 383

6. Implementation

After preprocessing the images, that is after splitting the images into training and
validation datasets. Next, a network architecture for the model will be created. The
different types of layers are used according to their features namely

1. Conv_2d: It is used to create a convolutional kernel that is convolved with the


input layer to produce the output tensor

2. Max_pooling2d: It is a downsampling technique that takes out the maximum


value over the window defined by pool size.

3. Flatten: It flattens the input and creates a 1D output.

4. Dense: Dense layer produces the output as the dot product of input and kernel.

5. In the last, a softmax layer will be used as the activation function because it is a
multi-class classification problem.

Figure 2. Flowchart

Now the model will be trained on 50 epochs and a batch size of 128. During each
epoch, the model performance can be determined by the training and validation
384 Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network

accuracy.Next, the accuracy of the model for the training history and the loss of the
model for the training history will be plotted.The prediction and the original label of
the image will be displayed using the argmax() function. At last, a web application will
be developed to display the result of the model.

Figure 3. Implementation process

7. Result

A bird image will be given as input to the model and the species of the bird will be
displayed along with the image.

Figure 4. Output Screenshot

The following graph shows the model accuracy and was plotted with epochs along the
x-axis and accuracy rate along the y-axis.

8. Conclusion and Future work

Identifying bird species from an image input by the user is the main goal of the project.
The Convolutional Neural Network was used as it provides good numerical precision
accuracy. The accuracy was about 87%-92%. Wildlife researchers can use this to keep
track of wildlife movement and behavior in specific habitats. Various deep learning
Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network 385

techniques can be applied in the future to enhance the accuracy and performance of the
model. The future work also includes developing a mobile application for more
convenient use. Also, this can be implemented in real-time monitoring of bird species
in sanctuaries.

Figure 5. Accuracy Graph

References

[1] RosnizaRoslan, Nur AmalinaNazery,Nursuriati Jamil,Raseeda Hamzah. Color-based bird image


classification using Support Vector Machine.2017 IEEE 6th Global Conference on Consumer
Electronics (GCCE).
[2] Susanto Kumar Ghosh, Mohammad Rafiqul Islam. Convolutional Neural Network Based on HOG
Feature for Bird Species Detection and Classification. International Conference on Recent Trends in
Image Processing and Pattern Recognition, Springer, vol. 1035, pp. 363-373, 2018.
[3] Shazzadul Islam, Sabit Ibn, Ali Khan, Md. Minhazul Abedin Khan, Mohammad Habibullah, Amit
Kumar Das. Bird Species Classification from an Image Using VGG-16 Network. Proceedings of the
2019 7th International Conference on Computer and Communications Management, ACM, pp. 38–42,
2019.
[4] Md. Romyull Islam, Nishat Tasnim, Shaon Bhatta Shuvo. MobileNet Model for Classifying Local
Birds of Bangladesh from Image Content Using Convolutional Neural Network. 2019 10th
International Conference on Computing, Communication and Networking Technologies (ICCCNT),
IEEE.
[5] [6] Sofia K. Pillai, M. M. Raghuwanshi, Urmila Shrawankar. Deep Learning Neural Network for
Identification of Bird Species.Computing and Network Sustainability, Springer, vol. 75, pp. 291-298,
2019.
[6] PralhadGavali, J.Saira Banu. Bird Species Identification using Deep Learning on GPU platform.2020
International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE),
IEEE.
[7] Kazi Md Ragib, Raisa TaramanShithi, Shihab Ali Haq, Md Hasan, Kazi Mohammed Sakib, Tanjila
Farah. PakhiChini: Automatic Bird Species Identification Using Deep Learning.2020 Fourth World
Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), IEEE.
[8] TapabrataChakraborti, Brendan McCane, Steven Mills, Umapada Pal. CoCoNet: A Collaborative
Convolutional Network applied to fine-grained bird species classification. 2020 35th International
Conference on Image and Vision Computing New Zealand (IVCNZ), IEEE.
[9] Yo-Ping Huang, HaobijmaBasanta. Recognition of Endemic Bird Species Using Deep Learning
Models. IEEE Access, vol. 9,pp. 102975 - 102984, 2021.
386 Dharaniya R et al. / Bird Species Identification Using Convolutional Neural Network

[10] Pratik Ghosh, Ankit Agarwalla. Classification of Birds Species Using Artificial Neural Network.
International Journal for Research in Engineering Application & Management (IJREAM), vol. 7, no. 3,
June 2021.
[11] Pankaj Prakash Patil, Atharva Dhananjay Kulkarni, Aakash Ajay Dhembare, Krishna Adar, Rahul
Sonkamb. Bird Species Classification using multi-scale Convoluted Neural Network with Data
Augmentation Techniques. International Journal of Engineering Development and Research, vol. 9, no.
2, 2021.
All-Conv Net for Bird Activity Detection: Significance of Learned Pooling
Arjun Pankajakshan, Anshul Thakur, Daksh Thapar, Padmanabhan Rajan & Aditya Nigam

School of Computing and Electrical Engineering, Indian Institute of Technology, Mandi


arjunp@projects.iitmandi.ac.in, {anshul thakur,s16007}@students.iitmandi.ac.in,
{padman,aditya}@iitmandi.ac.in

Abstract tion in frequency along time. On the other hand, the vocaliza-
tions of Cassin’s vireo relatively occupy high frequency bands
Bird activity detection (BAD) deals with the task of predicting and exhibit different kinds of frequency and temporal modula-
the presence or absence of bird vocalizations in a given audio tions. These variations present in bird sounds make it difficult
recording. In this paper, we propose an all-convolutional neu- to model the bird vocalizations class effectively. Apart from
ral network (all-conv net) for bird activity detection. All the this, the background of audio recordings is also unpredictable.
layers of this network including pooling and dense layers are Different biotic and abiotic non-bird sounds can form the back-
implemented using convolution operations. The pooling opera- ground in any audio recording. Hence, the task of BAD can be
tion implemented by convolution is termed as learned pooling. seen as the classification between the universe of bird sounds
This learned pooling takes into account the inter feature-map and the universe of non-bird sounds. The extreme variations
correlations which are ignored in traditional max-pooling. This present in both the sets make this simple two-class classifica-
helps in learning a pooling function which aggregates the com- tion problem challenging. An ideal bird activity detector should
plementary information in various feature maps, leading to bet- work well across different bird species in different recording
ter bird activity detection. Experimental observations confirm environments.
this hypothesis. The performance of the proposed all-conv net
is evaluated on BAD Challenge 2017 dataset. The proposed Many studies in the literature have targeted the task of
all-conv net achieves state-of-art performance with a simple ar- BAD. In our earlier studies [4, 5], the frameworks based on
chitecture and does not employ any data pre-processing or data SVM powered with dynamic kernels are proposed for the task
augmentation techniques. of BAD. In [4], a GMM based variant of the existing proba-
Index Terms: bird activity detection, all-conv net, learned bilistic sequence kernels is proposed while in [5], an archetypal
pooling analysis based sequence kernel is successfully utilized for BAD.
Both these frameworks require less amount of training data and
generalize well on unseen data. However, the classification per-
1. Introduction formance of these frameworks is not up-to the level of state-
Manual monitoring of birds can be a tedious and difficult task of-art BAD frameworks. In [6], a masked non-negative matrix
due to the wide range of habitats, such as islands, marshes and factorization (Masked NMF) approach for bird detection is pro-
swamps, occupied by different bird species [1]. Many bird posed. Masked NMF can handle weak labels by applying a bi-
species are nocturnal which makes it less feasible to manu- nary mask on the activation matrix during dictionary learning.
ally monitor them. Moreover, it requires an experienced bird In [7], a convolutional recurrent neural network (CRNN) for
watcher to accurately identify the bird species. Acoustic moni- BAD is proposed. Convolutional layers of this network extract
toring bypasses the need of manual labour and provides a con- local frequency shift invariant features while recurrent layers
venient way to monitor birds in their natural habitats without learn the long-term temporal relations between the features ex-
any physical intrusion [2]. tracted from short-term frames. The classification performance
Recently, there is a boom in sophisticated automatic record- of this network is comparable to the state-of-art methods on the
ing devices which can be programmed to record the audio data evaluation data of the BAD 2017 dataset. This method does not
for many days. This vast amount of audio data can be used utilize any pre-processing or data augmentation. However, the
for acoustic monitoring, which can give insight into the avian recurrent layers are computationally expensive to train which
diversity, migration patterns and population trends in different increases the overall computational time required to train the
bird species. Bird activity detection (BAD) [3] is generally the CRNN. In [8], two CNNs (sparrow and bulbul) are proposed.
first module in any bioacoustics monitoring system. BAD dis- The ensemble of these two networks provide state-of-art bird
tinguishes the audio recordings having bird vocalizations from activity detection on this dataset.
those recordings which do not contain any bird call or song. Inspired by the success of CNN-based architectures for
Hence, BAD helps in removing the audio recordings which do BAD and simplicity of all-conv neural network proposed in
not contain any bird vocalizations from further processing. This [9], we propose an all-convolutional neural network (all-conv
reduces the amount of data to be processed for other tasks such net) for BAD. This network is characterized by alternating con-
as bird vocalization segmentation and species identification. volution and pooling layers followed by dense layers, where
The task of BAD becomes challenging due to the varia- each of these layers are implemented by the convolution oper-
tions present in bird vocalizations across different bird species. ation itself. Instead of max-pooling, the local features obtained
These variations can be due to the different frequency profiles or using a convolution layer are pooled using a learned aggrega-
different frequency-temporal modulations of different species. tion, implemented using convolution operations. This aggre-
This behaviour is more evident in the vocalizations of Emer- gation function is referred as learned pooling, which aggregates
ald dove and Cassin’s vireo. The sounds produced by Emerald the contemporary information present in different feature-maps.
dove are characterized by low-frequency and have little varia- On the contrary, max-pool operations aggregates the informa-
tion present in each feature-map individually (see section 2).
This behaviour of learned pooling helps in obtaining better dis-
criminating features, leading to a better classification perfor-
mance in comparison to the normal max-pooling. Moreover,
the proposed all-conv net has lesser number of trainable param-
eters in comparison to the existing state-of-art neural networks
for the task in hand and does not utilize any pre-processing and
data augmentation.
The rest of this paper is organized as follows. In Section
2, the proposed deep all-conv framework is described in detail.
Performance analysis and conclusion are in sections 3 and 4,
respectively.

2. Proposed Framework
In this section, we describe the proposed all-conv net for BAD.
First, we describe Mel-spectrogram, a spatio-temporal feature
representation obtained from the raw audio recordings. Then,
we describe the proposed all-conv net which maps the input
Mel-spectrogram to a two dimensional probabilistic score vec-
tor containing probabilities for the presence and absence of any
bird activity.

2.1. Feature representation


Mel-spectrogram, a spatio-temporal feature representation, ob-
tained from the audio recordings are used as input to the pro-
posed all-conv net. Short-term Fourier transform (STFT) is em-
ployed to obtain the spectrogram from the input audio record-
ings. The audio signal is windowed using a frame size of 20
ms with no overlap and a Hamming window. 882 FFT points
are used to obtain the Fourier transform of an audio frame. This
process converts a 10 second (duration of recordings in the BAD
2017 dataset) audio recording into a 441 × 500 dimensional
spectrogram representation. Each frame of this spectrogram is
converted into a 40-dimensional vector of log filter bank en-
ergies using Mel filter-bank [7]. Hence, each 10-second au- Figure 1: Proposed all-conv architecture for bird activity detec-
dio recording is represented by a 40 × 500 dimensional Mel- tion
spectrogram.

2.2. All-conv net for BAD


tive grid search is applied to establish the various parameters
The proposed architecture consists of four pairs of convolution such as the number of convolutional layers, the number of fil-
and learned pooling layers followed by two (1, 1) convolution ters, the size of filter kernel and drop-out rates. Following are
layers. The input to the network is a 40×500 Mel-spectrogram. the main highlights of the proposed network:
A kernel size of 5 × 5 with a stride of 1 × 1 is used in the con-
volutional layers. To avoid over-fitting, as data is not abundant, Input: The input to the network is Mel-spectrogram of dimen-
the proposed architecture has only 16 filters in each convolu- sions 40 × 500, extracted from audio wave files (see Section
tional layers. In order to pool the feature maps, the convolution 2.1).
with kernel size of 5 × 1 and 2 × 1 with a stride of 5 × 1 and
2 × 1 respectively is used at the subsequent layers. In the later Learned pooling: Replacing max-pooling with convolution
part of the network, we have two 1 × 1 convolutional layers pooling layers incorporates the inter-feature dependencies. To
with 196 and 2 filters respectively. The output of these layers is pool the feature maps, we utilise strided convolutions instead
a two-dimensional feature map, which is converted into a prob- of max-pooling. Throughout the learning process, the tempo-
abilistic score vector using soft-max function. The elements of ral dimension is kept constant. Although, for BAD, preserv-
this vector signify the probability of the presence and absence ing temporal dimension may not make any difference. But for
of the bird activity in any input Mel-spectrogram. other applications which are extension of BAD such as bird ac-
For regularization, batch normalization [10] has been used tivity/event segmentation, it is essential to preserve the time in-
after convolutional layers and dropout [11] with the probabil- formation. By the series of convolution and learned pooling op-
ity of 0.25 and 0.5 has been used after convolutional layers and erations, the input data has been transformed into a vector with
learned pooling layers respectively. The weights of the network 16 unique spatial feature representations of size 16 × 1 × 500.
has been initialized using random normal distribution. The net- Intuitively, each of these 1 × 500 feature vectors can be inter-
work is optimized using Adam optimizer [12] with a learning preted as a position vector which points to certain regions in the
rate of 0.001 and a decay of 10−6 and binary cross-entropy as input Mel-spectrogram. The elements of these position vectors
the loss function. The network is shown in Fig. 1. An exhaus- are coordinates of the peak responses learned by corresponding
Figure 3: (a) Mel-spectrogram of an audio recording containing
bird activity (b) Response of the 8th filter, learned in the first
convolution layer, for the input Mel-spectrogram shown in (a)
Figure 2: Illustration of the difference between (a) max pooling (c) Response of the 11th filter, learned in the first convolution
and (b) learned pooling. layer, for the input Mel-spectrogram shown in (a).

filters. the contrary, 11th filter is learning the background information


After this stage, these 16 feature vectors are stacked to- (Fig. 3(c)). Thus, it can be deduced that each filter is learn-
gether to form a new 2D feature representation of size 500 × ing a different behaviour or event. The utilization of this com-
16 corresponding to each input Mel-spectrogram. Temporal plementary information in the aggregation function can lead to
pooling is applied on this 2D representation to obtain a 16- more discriminative features. The learned pooling learns this
dimensional feature vector. This temporal pooling is imple- sort of aggregation, leading to better bird activity detection (see
mented using 1D strided convolution. This helps in compress- Section 3).
ing the 2D matrix representation into a low dimensional feature
representation. 2.4. Discriminative nature of the learned features
The first 10 layers of the proposed all-conv net can be seen as
Convolutional fully connected layers: The fully connected lay-
a feature learner while the last 2-dense layers act as a classi-
ers at the end of the network are implemented by using 1 × 1
fier. In this subsection, we analyse the discriminative ability
convolutions in which the number of filters in each of the con-
of the 16-dimensional features generated by the network at the
volution layers are equivalent to the number of neurons in a
10th layer (before the dense layers). Fig. 4 shows the feature
fully connected layer. The first layer consists of 196 filters and
representations for five different audio files present in the BAD
finally, we reduce our feature map to 2 dimensions to get the re-
2017 dataset. These files cover five different cases i.e. first one
quired probability of the presence/absence of the bird activity.
contains a birdcall mimicked by a human (fake birdcall), sec-
ond contains the background without any birdcall (pure noise),
Activations: ReLU (Rectified Linear Unit) activation has been third one contains birdcalls (pure birdcall), fourth one has both
applied over all the convolutional and the learned pooling lay- speech and birdcall (speech+birdcall) and fifth one contains
ers. While on the fully connected layer of 196 filters (which music only (music). The analysis of Fig. 4 highlights the ability
has been implemented using convolutional layer), we have ap- of the proposed all-conv net to differentiate between bird activ-
plied sigmoid activation. Softmax is applied over the final 2- ity and various other acoustic events. The feature coefficients
dimensional output of the network to get the probabilities of corresponding to the first few filters (i.e. 1, 3, 4 and 6) exhibit
presence/absence of bird activity. high magnitudes for birdcalls only. As a result, a clear discrim-
ination between birdcalls and other sounds, such as speech and
2.3. Learned pooling vs. max-pooling music, is evident in these features. This highlights the effective-
As discussed earlier, learned pooling learns an aggregation ness of the proposed all-conv net for the task in hand.
function that utilizes the correlations among different feature
maps. On the other hand, max-pool does not take into consider- 3. Performance Analysis
ation these inter-feature map relations. In max-pool, each fea-
3.1. Dataset and evaluation metric
ture map is processed individually and any aggregated feature
map is a function of only one input feature map. However, in The performance of the proposed all conv-net is evaluated on
learned pooling, an aggregated feature map is a function of mul- the BAD challenge 2017 dataset [13]. The dataset is divided
tiple input feature maps. This behaviour is illustrated in Fig. 2. into two parts: development and evaluation. The development
The analysis of the filters learned at the first convolution layer of dataset has 15690 audio recordings, out of which 7710 has bird
the proposed all-conv net confirm that the information learned activity. The evaluation dataset has 8620 recordings. All of
by each filter can be complementary to other filters. This is il- these recordings are 10 seconds long and are sampled at 44.1
lustrated in Fig. 3. The analysis of Fig. 3(b) illustrates that the kHz. More information about the source and recording condi-
8th filter of the first convolution layer of the proposed all-conv tions of BAD 2017 dataset can be found at [13]. The 80% of
net is only concerned with learning the bird vocalizations. On the development data is used for training the network while rest
Table 1: Number of training parameters in various deep learn-
160 ing frameworks considered in this study
140 fake birdcall
pure noise Framework Trainable parameters
120

pure birdcall DenseNet [15] 328004


Magnitude

100
speech+birdcall RCNN [7] 806000
80
Bulbul [8] 373169
music
60 Sparrow [8] 309843
40 All-conv (Proposed) 154414
20

2 4 6 8 10 12 14 16
Feature index

Figure 4: 16-dimensional feature vectors obtained before the


dense layers

of the 20% is used for validation purposes. The number of ex-


amples in both bird and non-bird classes are forced to be the
same.
The area under the Receiver Operating Characteristic curve
(AUC) [14] is used as a metric to evaluate the performance of
the proposed network.

3.2. Comparative methods


The performance of the proposed all-conv net is compared with
various other methods proposed for BAD on the evaluation
Figure 6: Architecture of the all-conv net and its max-pool vari-
dataset of BAD challenge 2017. We have used rapid probabilis-
ant.
tic sequence kernels (RPSK) [4], masked NMF (MNMF) [6],
RCNN [7], bulbul and sparrow networks [8] and DenseNets
[15] as comparative methods. Apart from the methods listed
here, the performance of the proposed all-conv net can be com- this study. Although the comparative deep learning frameworks
pared with all the entries of the BAD 2017 challenge. The per- considered in this study show comparable performances to the
formance of all the entries on the evaluation dataset is listed at proposed all-conv net, the number of trainable parameters in the
[16]. all-conv net is significantly lesser than the other methods.
Effect of max-pooling and learned pooling on the classifi-
RPSK MNMF DenseNet RCNN bulbul sparrow All-conv cation performance: To analyse the effect of learned pool-
100
ing on the classification performance, we replaced each learned
88.76 88.9
AUC (%)

90 88.2 88.5 88.4


pooling layer of the proposed all-conv net with max-pooling
80 80.1
75.2 layers. This max-pool variant of the proposed network is
70
shown in Fig. 6. This replacement of the learned pooling with
max-pooling showed a relative drop of 4.39% (from 88.9% to
Figure 5: Classification performance of the proposed all-conv 85.01%) in AUC. This justifies the claim that the utilization
net along with various other methods. of the complimentary feature-maps information during pooling
leads to the discriminative feature representations which pro-
vides better classification performance.
3.3. Classification performance
4. Conclusion
The classification performance of the proposed all-conv net
along with various comparative methods is depicted in Fig. 5. In this paper, we proposed an all-conv net for bird activity detec-
The analysis of this figure makes it clear that non-deep learn- tion. This network is characterized by the use of convolution op-
ing methods (RPSK and MNMF) are not able to perform up-to erations for implementing all the layers including pooling and
the level of deep learning based frameworks. The performance the fully connected layers. The performance of this proposed
of all-conv net is comparable to the state-of-art deep learning network is comparable to the state-of-art bird activity detection
frameworks including bulbul-net which is the winning entry of methods. This paper also highlights the use of learned pooling
BAD challenge 2017. Also, it must be noted that the proposed over max-pooling in the proposed network. The experimen-
all-conv net does not utilize any pre-processing and data aug- tal observations verifies the superiority of the learned pooling
mentation. Incorporating these steps in the proposed architec- over traditional max-pooling for the task of bird activity detec-
ture may lead to an improvement in the classification perfor- tion. Future work may include exploring the learned pooling
mance. on other types of the acoustic classification tasks to see if the
Number of trainable parameters: Table 1 shows the number trends observed in bird activity detection replicate themselves
of trainable parameters in the deep frameworks considered in on the other tasks.
5. References
[1] T. S. Brandes, “Automated sound recording and analysis tech-
niques for bird surveys and conservation,” Bird Conservation In-
ternational, vol. 18, no. S1, pp. S163–S173, 2008.
[2] C.-H. Lee, C.-C. Han, and C.-C. Chuang, “Automatic classifica-
tion of bird species from their sounds using two-dimensional cep-
stral coefficients,” IEEE/ACM Trans. Audio, Speech, Language
Process., vol. 16, no. 8, pp. 1541–1550, 2008.
[3] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection
in audio: a survey and a challenge,” in IEEE Int. Workshop Mach.
Learn. Sig. Process., 2016, pp. 1–6.
[4] A. Thakur, R. Jyothi, P. Rajan, and A. D. Dileep, “Archetypal
analysis based sparse convex sequence kernel for bird activity de-
tection,” in Proceedings of Eusipco, Aug., 2017, pp. 1754–1758.
[5] V. Abrol, P. Sharma, A. Thakur, P. Rajan, A. D. Dileep, and
A. K. Sao, “Archetypal analysis based sparse convex sequence
kernel for bird activity detection,” in Proceedings of Eusipco,
Aug., 2017, pp. 4436–4440.
[6] I. Sobieraj, Q. Kong, and M. Plumbley, “Masked non-negative
matrix factorization for bird detection using weakly labelled data,”
in Proc. Eusipco, 2017, pp. 1819–1823.
[7] E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virta-
nen, “Convolutional recurrent neural networks for bird audio de-
tection,” in Proc. Eusipco, 2017, pp. 1744–1748.
[8] T. Grill and J. Schlüter, “Two convolutional neural networks for
bird detection in audio signals,” in Proc. Eusipco, 2017, pp. 1764–
1768.
[9] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
“Striving for simplicity: The all convolutional net,” in Proc. ICLR,
2015.
[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proc.
ICML, 2015, pp. 448–456.
[11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: A simple way to prevent neural net-
works from overfitting,” J. Mach. Learn. Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[12] D. P. Kingma and L. Ba, “Adam: A method for stochastic opti-
mization,” in Proc. ICLR, 2015.
[13] “BAD challenge,” http://machine-listening.eecs.qmul.ac.uk/bird-
audio-detection-challenge/, accessed: 2017-2-1.
[14] J. A. Hanley and B. J. McNeil, “The meaning and use of the area
under a receiver operating characteristic (roc) curve.” Radiology,
vol. 143, no. 1, pp. 29–36, 1982.
[15] T. Pellegrini, “Densely connected cnns for bird audio detection,”
in Proceedings of Eusipco, 2017, pp. 1734–1738.
[16] “BAD Challenge 2017 Results,” http://c4dm.eecs.qmul.ac.uk/
events/badchallenge results/, accessed: 2018-02-20.
2017 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 25–28, 2017, TOKYO, JAPAN

RÉNYI ENTROPY BASED MUTUAL INFORMATION FOR SEMI-SUPERVISED BIRD


VOCALIZATION SEGMENTATION

Anshul Thakur, Vinayak Abrol, Pulkit Sharma and Padmanabhan Rajan

School of Computing and Electrical Engineering


Indian Institute of Technology, Mandi
Email: {anshul thakur, vinayak abrol, pulkit s}@students.iitmandi.ac.in, padman@iitmandi.ac.in

ABSTRACT from the background. The segmentation task becomes chal-


lenging due to the presence of various background sounds
In this paper we describe a semi-supervised algorithm to
such as rain, insects, other animals, passing traffic and other
segment bird vocalizations using matrix factorization and
sounds. In field conditions, background sounds are unpre-
Rényi entropy based mutual information. Singular value
dictable which makes acoustic modeling of the background
decomposition (SVD) is applied on pooled time-frequency
difficult. There is a need for unsupervised or semi-supervised
representations of bird vocalizations to learn basis vectors.
segmentation techniques in which modeling the background
By utilizing only a few of the bases, a compact feature rep-
is not required.
resentation is obtained for input test data. Rényi entropy
In this work, we propose a semi-supervised bird vocal-
based mutual information is calculated between feature rep-
ization segmentation algorithm which can work in different
resentations of consecutive frames. After some simple post-
recording environments and is not influenced by background
processing, a threshold is used to reliably distinguish bird
disturbances to a large extent. A dictionary of basis vectors
vocalizations from other sounds. The algorithm is evaluated
modeling bird vocalizations is learnt from a small amount
on the field recordings of different bird species and different
of labeled training data. The proposed method uses singu-
SNR conditions. The results highlight the effectiveness of the
lar value decomposition (SVD) to learn these basis vectors
proposed method in all SNR conditions, improvements over
from the time-frequency representation (spectrogram) of bird
other methods, and its generality.
vocalizations. By projecting the time-frequency representa-
Index Terms— Bird call segmentation, feature learning tion of a test audio recording onto these bases, a compact
using PCA, Rényi entropy representation is obtained for each short-time frame. By esti-
mating the Rényi entropy based mutual information between
1. INTRODUCTION each pair of consecutive frames, and with some simple post-
processing, bird vocalizations can be easily distinguished
Birds play important roles in maintaining the balance of from other sounds. This is an advantage over some other
ecosystems. They are present at various steps of the food methods, which are prone to false alarms by being unable to
chain, help in pollination and in seed dispersal. Many bird distinguish naturally occurring background sounds like the
species are under the threat of population decline due to habi- sounds of insects. Our experimental evaluation on the vocal-
tat destruction. Surveying and monitoring are essential for ization of passerine birds demonstrates the generic nature of
their conservation. Acoustic monitoring provides a conve- the method.
nient and passive way to monitor bird populations in their The rest of this paper is organized as follows. In section
natural habitats. With the advent of automated recording de- 2, we discuss some of the methods proposed in the litera-
vices (for eg. the SongMeter series from Wildlife Acoustics ture targeting the bird vocalization segmentation along with
Inc.), acoustic monitoring has become easier. These sophis- their drawbacks. We also describe how the proposed algo-
ticated devices can collect large amounts of bioacoustic data. rithm overcome these drawbacks. In section 3, the proposed
By analyzing audio recordings containing bird vocalizations algorithm is described in detail. Performance analysis and
collected in this manner, it is possible to perform tasks such conclusion are in sections 4 and 5 respectively.
as species identification, tracking of migrant species or ex-
amining the avian biodiversity of a given region. Typically, 2. COMPARISONS TO PRIOR WORK
the collected data is processed off-line. In this process, the
first step is usually to determine regions of interest in the Most methods for segmenting bird vocalizations have ei-
recording, also termed as segmenting the vocalizations/calls ther utilized time-domain representations, or have used

978-1-5090-6341-3/17/$31.00 c 2017 IEEE


spectrogram-based representations. We briefly categorize lier methods, the proposed method can discriminate between
some of these below. bird vocalizations and non-bird sounds (see section 3).
Manual segmentation: A few bioacoustic studies dealing Better precision: Mutual information criteria based on
with bird vocalizations such as species identification [1] [2] Rényi entropy is calculated for each pair of consecutive
have used manually segmented bird vocalizations. Manual frames i.e. between the current frame under process and the
segmentation can be tedious and unfeasible if the amount of previous frame, providing more precision as compared to
data to be processed is large. the time-frequency window based entropy calculated in [6].
Energy or entropy-based methods: Energy calculated in Entropy calculated from a time-frequency window several
the time domain has been used to segment the bird vocal- frames long will exhibit the presence of bird vocalization
izations in [3] [4] [5]. Spectral entropy and KL-divergence even if the vocalization is present in the first or last few
based segmentation methods are proposed in [6] and [7] re- frames of the window, leading to a decrease in segmentation
spectively. The regions containing bird vocalizations exhibit precision.
low entropy in comparison to the background regions. In Generalization: Since the learnt dictionary is a genera-
[7], KL-divergence between normalized power spectral den- tive model, a bird vocalization not used in learning the basis
sity of an audio frame and the uniform distribution is com- vectors can still be approximated by the learned basis vectors.
puted. This KL-divergence is a measure of entropy; higher This makes the proposed algorithm more generic as compared
KL-divergence corresponds to less entropy and vice-versa. to the template based technique described in [8]. This behav-
These methods are unsupervised in nature which is desirable. ior is analyzed in detail in section 4.
However, these methods are not able to distinguish bird vo-
calizations from any other sound event. Also, energy-based
segmentation is affected by the presence of background noise. 3. PROPOSED METHOD
Template-based method: A noise-robust template match-
3.1. Dictionary learning
ing based method is proposed in [8]. This methods uses dy-
namic time warping (DTW) and high-energy regions of spec- To learn the basis vectors, the time-frequency representations
trograms to build noise-robust templates for each type of vo- of bird vocalizations are extracted using a small amount of
calization. Each vocalization template is built using various labeled training data. The training labels provide information
examples of that vocalization. This method is effective for about the start and end time of the vocalizations. These ex-
most of the background conditions. However, the disadvan- tracted vocalizations are pooled together to form a matrix M
tage is that we must know beforehand what vocalizations we of dimensions D×N . Here N is the number of pooled frames
wish to segment. Hence, this method may not be scalable in and D represents the number of FFT bins used in the spec-
real-world scenarios. trogram. This matrix, M, is factorized using singular value
Other methods: In [9], time-frequency based segmenta- decomposition (SVD): M = U × Σ × V∗ . Here U is a
tion using a random forest classifier is proposed to segment D × D unitary matrix whose columns contains the left singu-
bird vocalizations in noisy conditions. This method requires lar vectors, Σ is a D × N diagonal matrix containing singular
a large amount of training examples. A spherical K-means values and V is N × N unitary matrix whose columns con-
based feature learning method [10] is proposed to model the tains the right singular vectors. The columns of U are used as
bird vocalizations of different species. In [11], an unsuper- the basis vectors of the subspace on which the time-frequency
vised two-pass segmentation method is proposed. In the first representation of the audio recording is projected to get the
pass, training labels are generated using inverse spectral flat- features in the testing stage.
ness (ISF) from the input recording itself. ISF is used to dis- Typically vocalizations of many song birds occupy only
tinguish vocalizations and background regions from the input few frequency bins at any given time. Hence the information
recordings. These vocalizations and background regions are regarding bird vocalization regions is mostly consolidated in
used to build Gaussian mixture models which are used in the the first few columns of U which correspond to the directions
second pass to classify each input frame as the background or of highest singular values and hence highest variances. Hence
the bird vocalization. However, like energy and entropy, the to retain only the bird vocalization information in the feature
ISF used in first pass is also unable to distinguish bird sounds domain, the input test audio recording is projected on the first
from non-bird sounds. K columns of U as: F = BT × P. Here B is matrix of
D × K dimensions whose columns are the first K columns
2.1. Advantages of the proposed algorithm of U. P is the time-frequency representation of the test audio
signal having D × M dimensions, M is the number of frames
Ability to discriminate other background sounds: Using and D is the number of frequency bins. F is the matrix of
only a few of the basis vectors helps in retaining the informa- dimension K × M whose columns contain the feature repre-
tion corresponding only to the bird vocalizations and not to sentations of each input frame. The value of K is determined
the other background sounds. Hence, unlike some of the ear- experimentally.
×10 -3
(a) 20
10

Coefficients
BIRD HUMAN
BIRD
Frequency

SPEECH
10
(kHz)

0 0
0.5 1 1.5 2 2.5
Time (seconds)
Background Human Bird
(b)
1

2 Fig. 2: Box plots of the coefficients calculated for (a) 100


Coefficient
index

4
background frames (b) 100 human speech frames and (c) 100
5 bird sound frames.
0 50 100 150 200 250
Frame index

(c)
1
most constant. However, a significant amount of variation is
Normalized

0.8
present for the bird vocalizations.
MI

0.6

0.4
0 50 100 150 200 250
Frame index 3.2. Rényi entropy based mutual information
The feature representation of nth frame, xn , is converted into
Fig. 1: (a) Spectrogram of an audio recording containing hu- a normalized vector using the softmax function: (xn )j =
man speech, background noise and vocalization of Cassin’s PKe
xn
j
xn , for j = 1, 2, .., K.
vireo (b) Feature representation of the above spectrogram (c) k=1 e
k

Normalized MI extracted from the feature representations de- Since the feature coefficients of each non-bird frame ap-
picted in (b). proach zero, each coefficient of the frame becomes almost
equal after applying the softmax function. However, for any
bird vocalization frame, some coefficients exhibit higher val-
Figure 1(b) depicts the feature representation learned ues than others. Hence, the normalized feature representa-
from the time-frequency representation of an audio recording tions for all the non-bird frames are nearly similar and more
shown in Figure 1(a). This feature representation is ob- variation occurs for the bird vocalization frames. Considering
tained by projecting the time-frequency representation on the these feature vectors as sampled random vectors in RK , this
first 5 columns of U. This U is learned by factorizing the can behavior can be discriminated by using mutual informa-
pooled time-frequency representations of the vocalizations tion.
of Cassin’s vireo, a North American song bird, using SVD. Mutual information (MI) between normalized feature rep-
The audio corresponding to the spectrogram shown in figure resentations of each pair of the consecutive frames (i.e. be-
1(a) contains human speech and two Cassin’s vireo vocaliza- tween nth and (n − 1)th frames) is calculated. This serves
tions. By analyzing figure 1(b), it is clear that information the purpose of considering the previous frame along with the
corresponding to the human speech and other background current frame in making the segmentation decisions. MI of
disturbances is not reflected in the feature domain. Each of a random vector with itself is highest, therefore MI between
the coefficients calculated for any non-bird frame has magni- two almost similar feature representations will be higher than
tude close to zero. Hence, the variance of coefficients of any between two representations which are different. Hence for
non-bird frame is low. On the other hand, each coefficient non-bird regions, MI will be high as compared to the regions
calculated for any bird frame has larger magnitudes in com- having vocalizations as depicted in Figure 1(c). Also, since
parison to the non-bird frame. The variance of coefficients the feature representations for the frames of non-bird regions
within a bird sound frame is also high. This is due to the fact are almost similar, MI across all these regions is almost con-
that none of the learned basis vectors have any contribution stant.
in defining the background frames. However, a bird vocaliza- MI between feature representations of two consecutive
tion frame can be represented as a combination of the scaled frames i.e. xn and xn−1 , each of dimensions K × 1, can
versions of the learned basis vectors [12]. The contribution of be calculated as
some basis vectors in defining the input bird frame is higher
than the others. This leads to the difference in magnitudes of M I(xn , xn−1 ) = H(xn ) + H(xn−1 ) − H(xn , xn−1 ) (1)
the coefficients calculated for bird frames.
Here H() represents the entropy. Rényi entropy is used in this
This behavior is highlighted in figure 2. The box plots
work to calculate the MI. Rényi entropy of the pth order for
of coefficients of 100 human speech frames, 100 background
feature representation of nth frame,xn can be calculated as
frames and 100 bird vocalization frames are shown in Fig-
[13]:
ure 2. From this figure it is evident that the magnitude of p
coefficients for human speech and background frames is al- H(xn ) = log(kxn kp ). (2)
1−p
where p controls the sensitivity towards the shape of proba- existing unsupervised segmentation methods such as short-
bility distribution of the coefficients of xn [14]. The value of term energy (STE), spectral entropy (SE) [6], inverse spectral
p (0 < p < 1) is determined experimentally. flatness (ISF) [19] and two-pass unsupervised method (US)
[11]. Apart from these methods, the performance is also com-
3.3. Segmentation using thresholding pared with the supervised template-based method (TM) in [8],
and two variants of the proposed algorithm. The first vari-
The nature of MI calculated from the feature representations ant uses non-negative matrix factorization (NMF) for learn-
makes the task of thresholding simple. Since the MI for back- ing the basis vectors instead of SVD. The second variant uses
ground regions is almost constant, any drop in the value of the normalized energy of the feature coefficients (CE) instead
MI signifies the presence of bird vocalization. The calculated of Rényi entropy based mutual information. The second ex-
MI is smoothed using a moving average filter and normalized periment is to demonstrate the general nature of the proposed
to be between 0 and 1. This results in the MI to be close to 1 method by testing on unseen vocalizations.
for background regions as can be seen in Figure 1(c). Thus, F1 score, defined as the geometric mean of precision and
a threshold t, just below one, is able to reliably discriminate recall, is used as a metric for evaluation, by comparing with
call regions from other regions. the manually labeled ground truth. Both experiments use 10-
fold cross-validation. During each fold, one audio recording
4. PERFORMANCE ANALYSIS was used for learning bases and the rest were used as test
examples. The average results of these 10 folds are presented
4.1. Datasets used in Figure 3 and Table 1 for the first and second experiments
respectively.
Experimental validation of the proposed algorithm is per-
A frame length of 20 ms with a 10 ms overlap, Hamming
formed on three datasets. Two datasets consists of the record-
window and 512 FFT points are used to compute the time-
ings of Cassins vireo, a North American songbird. The third
frequency representations of the input audio. For calculating
dataset has the recordings of another song bird, California
the feature coefficients, we project the time-frequency repre-
thrasher. The first Cassin’s vireo dataset (CV1) contains
sentation of the test audio file on the top K = 5 left singular
twelve audio recordings and are available at [15]. These
vectors. For calculating Rényi entropy, an order of p = 0.7
audio recordings were collected over two months in 2010
is used, and a moving average filter of length 10 is used to
and contain almost 800 bird vocalizations or song phrases of
smooth the MI. A threshold t = 0.9999 is applied on MI to
65 different types. The second Cassin’s vireo dataset (CV2)
segment the bird vocalizations. These values of K, p and t are
and California thrasher recordings (CT) are available at [16].
chosen experimentally using a validation set. Two recordings
Out of the available 459 recordings of Cassin’s vireo, we
from CV1 having the shortest durations are chosen to form
have used only 100 recordings here. The recordings hav-
the validation set. These audio recordings are not used for
ing longest durations and maximum number vocalizations
either learning basis vectors or testing in any of the experi-
are chosen. These recordings contain almost 25000 Cassin’s
ments. After validation, the same values of K, p and t are
vireo vocalizations of 123 different types. Similarly out of the
used for all the experiments (including the noisy cases.)
available 698 California thrasher recordings, we have chosen
The parameter setting used in [11] are used for imple-
100 recordings having maximum durations and number of vo-
menting STE, SE, ISF and US. Similarly the parameter values
calizations. These 100 recordings contain about 15000 bird
discussed in [8] are used for implementing TM. The param-
vocalizations. All the recordings from these three sources
eter setting used in the proposed algorithm is also used for
are field recordings and contain various types of background
implementing NMF and CE. However, in the NMF variant,
noise including human speech. These recordings are 16-bit
approach, 256 basis vectors are learned from the training data.
mono WAV files having a sampling rate of 44.1 kHz.
This number is chosen experimentally.
To test the proposed algorithm in extreme conditions,
background sounds are artificially added to the recordings of
the CV1 dataset. Three different types of background sounds 4.2.1. Comparison of the proposed algorithm with other
i.e. rain, waterfall, river and cicadas at 0 dB, 5 dB, 10 dB, methods
15 dB and 20 dB SNR are added using Filtering and Noise
Adding Tool (FaNt) [17]. The sound files are downloaded The performance of the proposed algorithm is compared
from FreeSound [18]. with other methods on dataset CV1 and on the artificially
created noisy versions of CV1. In the proposed algorithm,
the NMF variant and the CE variant, the basis vectors are
4.2. Experiments
learned from labeled bird vocalizations extracted from a sin-
Two different experiments are performed to evaluate various gle audio recording of CV1 and the rest of the recordings are
aspects of the proposed algorithm. In the first experiment, used for testing during each of the 10 folds. For segment-
we compare the performance of the proposed algorithm with ing noisy versions of CV1, the basis learned from the clean
audio recordings of CV1 are used. The segmentation perfor- learn basis vectors from CV1 and segment the audio record-
mance of the proposed algorithm along with other methods is ings of CV2. In the second part, we use the basis vectors
summarized in Figure 3. learnt from CV1 to segment the recordings of CT i.e. we
By analyzing Figure 3, it can be concluded that the per- learn the basis vectors from the recordings of one species
formance of the proposed algorithm is better than all the other and segment the audio recordings of another. Again, 10 fold
methods except TM in both noisy and clean conditions. How- cross-validation was used. During each fold, one recording
ever, TM is a template based method which may not be scal- from CV1 was used to learn basis vectors and testing was
able and requires prior knowledge of all the vocalizations we performed on all the recordings of CV2 and CT. These are
wish to segment, in the form of templates. Also, the perfor- tabulated in Table 1. The analysis of Table 1 shows that the
mance of the proposed method is not affected vigorously as proposed algorithm is able to segment CV2 recordings having
compared to performances of STE, ENT, ISF and US in low 123 different types of Cassin’s vireo vocalizations using the
SNR conditions. The NMF and CE variants also outperforms basis vectors learned from a single audio recording of CV1
these methods. The use of the top K left singular vectors in which has 10 to 25 different types of Cassin’s vireo vocaliza-
the proposed algorithm instead of the NMF based dictionary tions (the number of vocalizations in the training recording
provided better segmentation in all conditions. The CE vari- depends on the fold). Hence, the proposed method is able
ant method gave good segmentation performance. This shows to segment vocalizations which are not used in learning the
that a simple energy-based segmentation is good enough to basis vectors.
segment the bird vocalizations learnt from basis vectors. But Also, the proposed algorithm is able to segment record-
since Rényi entropy based MI uses context information in ings of California thresher using basis vectors learned from
terms of the previous and current frames, it provides slightly Cassin’s vireo. The segmentation performance obtained in
better segmentation. this cross-species experiment is also compared with the seg-
mentation performance obtained by using the basis vectors
STE SE ISF US TM NMF CE Prop. learned from the vocalizations of California thrasher. No sig-
(a) (b) nificant difference is observed in the performances which fur-
0.8 0.8 ther supports the generic nature of the proposed algorithm.
0.7 Table 1 also depicts the performance of other methods with
F1-score

F1-score

0.6 0.6 the proposed method. By analyzing Table 1, it is clear that


0.5
the proposed method outperforms the other methods except
TM, which requires templates of the vocalizations.
0.4
0.4
0 5 10 15 20 0 5 10 15 20
SNR (dB) SNR (dB) Table 1: Results of experiment 2: Performance of the pro-
(c) (d)
posed method for various train-test conditions. (-) indicates
0.8 0.8
that the method is unsupervised. The TM method uses tem-
0.7
plates of the vocalizations.
0.7
F1-score
F1-score

0.6
0.6 Training Testing Testing
0.5 Method
Dataset on CV2 on CT
0.5
0.4 STE - 0.53 0.55
0.4
0 5 10 15 20 SE - 0.56 0.57
0 5 10 15 20
SNR (dB) SNR (dB) ISF - 0.6 0.61
0.9
(e) US - 0.64 0.63
TM CV2, CT 0.79 0.78
F1-score

0.8

0.7 NMF CV1 0.72 0.71


0.6 NMF CT 0.7 0.73
STE SE ISF US TM NMF CE Prop.
CE CV1 0.74 0.71
CE CT 0.71 0.73
Fig. 3: Results of experiment 1: Comparison of segmenta- Prop. CV1 0.76 0.73
tion performances of different segmentation methods on noisy Prop. CT 0.74 0.75
variations of CV1 generated by adding noise types (a) rain,
(b) river, (c) waterfall (d) cicada and (e) on CV1. 4.2.3. Discussion
4.2.2. Generic nature of the proposed algorithm
The proposed method provides reliable segmentation per-
The second experiment has two parts, and establishes the formance only if the vocalizations represented by the bases
generic nature of the proposed algorithm. In the first part, we are similar to the ones in the evaluation data. If the bases
are learned from bird vocalizations which exhibit rapid fre- [7] B. Lakshminarayanan, R. Raich, and X. Fern, “A
quency and temporal modulations but the target vocalizations syllable-level probabilistic framework for bird species
are wideband in nature, the proposed algorithm will fail. For identification,” in Proc. Int. Conf. Mach. Learn. Appli-
example, the basis vectors learned from CV1 are not able to cat., 2009, pp. 53–59.
segment the sounds of greater sooty owls and forest ravens
which are wide-band in nature. On the other hand, sounds [8] K. Kaewtip, L. N. Tan, C. E. Taylor, and A. Alwan,
from other passerine birds like the Verditer flycatcher and “Bird-phrase segmentation and verification: A noise-
blue magpies (found in the Indian subcontinent) are effec- robust template-based approach,” in Proc. Int. Conf.
tively segmented by the bases learnt from Cassin’s vireo. Acoust. Speech, Signal Process, 2015, pp. 758–762.
Five recordings of the vocalizations of these species (down- [9] L. Neal, F. Briggs, R. Raich, and X. Z. Fern, “Time-
loaded from [20]) resulted in an F1-score of 0.32, 0.38, 0.85 frequency segmentation of bird song in noisy acoustic
and 0.79 respectively. environments,” in Proc. Int. Conf. Acoust. Speech, Sig-
nal Process, 2011, pp. 2012–2015.
5. CONCLUSION
[10] D. Stowell and M. D. Plumbley, “Automatic large-scale
This paper presented an algorithm for segmenting birdcalls classification of bird sounds is strongly improved by un-
from the background using matrix factorization and Réyni en- supervised feature learning,” PeerJ, vol. 2, pp. e488,
tropy based MI. Experimental evaluation, including compar- 2014.
isons with six existing methods demonstrated the effective- [11] A. Thakur and P. Rajan, “Model-based unsupervised
ness and the generality of the proposed method. The results segmentation of birdcalls from field recordings,” in
also indicate that the method can be utilized for segmenting Proc. Int. Conf. Signal Process. Commun. Syst., 2016.
the vocalizations of similar birds in other geographic regions,
irrespective of data used in learning the basis vectors. [12] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and
Y. Ma, “Robust face recognition via sparse represen-
tation,” IEEE Trans. Pattern Analysis Mach. Intel., vol.
6. REFERENCES
31, no. 2, pp. 210–227, 2009.
[1] V. M. Trifa, A. N. G. Kirschel, C. E. Taylor, and E. E. [13] S. Dolinar, “Maximum-entropy probability distribu-
Vallejo, “Automated species recognition of antbirds in a tions under lp -norm constraints,” in TDA Progress Re-
Mexican rainforest using hidden Markov models,” Jnl. port, NASA, Communications Systems Research Section,
Acoust. Soc. Amer., vol. 123, no. 4, pp. 2424–2431, Apr February 1991, pp. 74–87.
2008.
[14] T. Maszczyk and W. Duch, “Comparison of Shannon,
[2] C. H. Lee, C. C. Han, and C. C. Chuang, “Automatic Renyi and Tsallis entropy used in decision trees,” Art.
classification of bird species from their sounds using Intel. Soft Comp., pp. 643–651, 2008.
two-dimensional cepstral coefficients,” IEEE Trans. Au-
dio, Speech, Language Process, vol. 16, no. 8, pp. 1541– [15] “Cassin’s vireo recordings,” http://taylor0.
1550, Nov 2008. biology.ucla.edu/al/bioacoustics/, Ac-
cessed: 2016-03-20.
[3] A. Harma and P. Somervuo, “Classification of the har-
monic structure in bird vocalization,” in Proc. Int. Conf. [16] “Art-sci center, University of California,” http://
Acoust. Speech, Signal Process, 2004, pp. 701–704. artsci.ucla.edu/birds/database.html/,
Accessed: 2016-07-10.
[4] P. Somervuo, A. Harma, and S. Fagerlund, “Paramet-
ric representations of bird sounds for automatic species [17] “Filtering and noise adding tool,” http://dnt.kr.
recognition,” IEEE Trans. Audio, Speech, Language hs-niederrhein.de/, Accessed: 2016-11-14.
Process, vol. 14, no. 6, pp. 2252–2263, Nov 2006. [18] “Freesound,” http://freesound.org/, Ac-
[5] S. Fagerlund, “Bird species recognition using support cessed: 2017-3-13.
vector machines,” EURASIP J. Appl. Signal Process., [19] B Yegnanarayana, Carlos Avendano, Hynek Herman-
vol. 2007, no. 1, pp. 64–64, Jan. 2007. sky, and P Satyanarayana Murthy, “Speech enhance-
[6] N. C. Wang, R. E. Hudson, L. N. Tan, C. E. Taylor, ment using linear prediction residual,” Speech Com-
A. Alwan, and K. Yao, “Bird phrase segmentation by mun., vol. 28, no. 1, pp. 25–42, 1999.
entropy-driven change point detection,” in Proc. Int. [20] “Xeno-canto,” http://www.xeno-canto.org,
Conf. Acoust. Speech, Signal Process, 2013, pp. 773– Accessed: 2017-3-14.
777.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy