Applsci 12 11109
Applsci 12 11109
sciences
Article
DeepDetection: Privacy-Enhanced Deep Voice Detection and
User Authentication for Preventing Voice Phishing
Yeajun Kang , Wonwoong Kim, Sejin Lim, Hyunji Kim and Hwajeong Seo *
Abstract: The deep voice detection technology currently being researched causes personal infor-
mation leakage because the input voice data are stored in the detection server. To overcome this
problem, in this paper, we propose a novel system (i.e., DeepDetection) that can detect deep voices
and authenticate users without exposing voice data to the server. Voice phishing prevention is
achieved in two-way approaches by performing primary verification through deep voice detection
and secondary verification of whether the sender is the correct sender through user authentication.
Since voice preprocessing is performed on the user local device, voice data are not stored on the
detection server. Thus, we can overcome the security vulnerabilities of the existing detection research.
We used ASVspoof 2019 and achieved an F1-score of 100% in deep voice detection and an F1 score of
99.05% in user authentication. Additionally, the average EER for user authentication achieved was
0.15. Therefore, this work can be effectively used to prevent deep voice-based phishing.
Keywords: voice phishing; deep voice detection; user authentication; privacy preservation;
autoencoder; convolutional neural networks
This approach prevents voice phishing in two ways: by performing primary verifica-
tion through deep voice detection and secondary verification of whether the sender is the
correct sender through user authentication. Since the voice preprocessing for the classifier
is performed on the user’s local device, the voice data are not stored in the verification
server, thereby overcoming the security vulnerabilities of the existing technique.
1.1. Contribution
1.1.1. Preprocessing Method to Protect Privacy Using an Autoencoder
For deep voice detection, preprocessing must be performed before classification.
However, there is a vulnerability in the way in which voice data are exposed to the server.
We designed the encoder of the autoencoder to perform the preprocessing independently
by deploying it to the users. The model can be separated, and the preprocessing and
classification tasks can be performed on the user’s device and the server, respectively. This
approach has the following advantages. First, it can overcome the problem of original
data exposure that is the vulnerability of existing deep learning-based deep voice detection
methods. Since meaningful features are extracted through the autoencoder, effective data
can be used for user authentication and deep voice detection.
1.1.2. Method for Performing User Authentication and Deep Voice Detection
We propose the DeepDetection system, which can perform user authentication and
deep voice detection using the ASVspoof2019 dataset simultaneously. The fake data of
this dataset is generated by speech synthesis technology, similar to the users belonging
to the real dataset. We adopted it in our work. To the best of our knowledge, this is the
first system that can perform both user authentication and deep voice detection with
ASVspoof2019 simultaneously.
Voice phishing damage caused by simple impersonation can be prevented by au-
thenticating users through user classification; that is, voice phishing can be prevented
more effectively through user authentication. This method prevents voice phishing in two
ways: by performing primary verification through deep voice detection and secondary
verification of whether the sender is the correct sender through user authentication. In
deep voice detection, an F1 score of 100% was achieved, and in user authentication, an F1
score of 99.05% was achieved with the proposed method.
The remainder of this paper is organized as follows. Section 2 presents related works
concerning AI, neural networks, deepfakes, and datasets, while Section 3 presents the
proposed system (i.e., DeepDetection), with its evaluation given in Section 4. Section 5
concludes the paper.
2. Related Works
2.1. Artificial Neural Networks
2.1.1. Convolutional Neural Networks
The convolutional neural network is a type of technology that combines a convolution
operation with an existing neural network. It is designed to imitate human visual process-
ing and shows strong performance in image processing. A CNN consists of convolution
layers that extract and reinforce the features from the input data using a kernel (i.e., filter)
and a classifier that classifies images based on the extracted features. In the convolution
layer, convolution plays a role in extracting an image’s features using filters. At this time,
the same filter traverses the input data and performs convolution operations. Since a CNN
learns the same weights using the same filters, it has the advantage that it takes less time
to learn because the number of parameters to learn is relatively small. After performing
the convolution operation, it enables faster and more effective learning through the activa-
tion function. Pooling simplifies the output by performing nonlinear downsampling and
reducing the number of parameters the network learns [1]. Through the convolution layer,
the general features are initially extracted, specific and unique features are extracted as
the higher-level layers proceed, and the output data are used for the next layer; that is,
Appl. Sci. 2022, 12, 11109 3 of 13
particular features are identified while repeating the process of extracting features from the
input data with learning based on them. After this, the resulting value in the form of an
image is transformed into a one-dimensional array, which is a form that can be classified
and is input to the fully connected layer to perform classification. The more diverse the
data used for learning are, the more accurate classification becomes possible [2].
There is the one-dimensional CNN (CNN 1D), which is effective for training time
series data. In the CNN 2D, which is well known as a neural network for image learning,
the kernel moves horizontally and vertically to learn the local features of the input image,
whereas in the CNN 1D, the kernel moves in one direction. Due to these characteristics, it
is used for learning time series data.
2.1.2. Autoencoder
Figure 1 shows the structure of the autoencoder model. The autoencoder consists of
an encoder and a decoder, and it is an artificial neural network trained with an unsuper-
vised learning method that learns without data labels [3]. It first learns the representation
encoded in the data and then aims to produce an output value from the learned encoding
representation as close as possible to the input data. Thus, the output of the autoencoder is
a prediction of the input. The autoencoder generates a latent variable that is the result of
extracting the features of the input data through the encoder neural network. After this,
the latent variable (i.e., the output of the encoder) is input into the decoder neural network.
The decoder reconstructs the original data (i.e., the input data of the encoder) based on the
latent variable. Unlike the classifier, it can extract the features of the data from the input
and then reconstruct them again. Due to these characteristics, it is mainly used for noise
removal [4], data visualization and reconstruction, and semantic extraction.
MSE = 1
n ∑nt=1 (yt − ŷt )2 (1)
2.1.4. Precision
The precision refers to the amount of data that the model predicts to be true and is
actually true. It is calculated by Equation (2):
TP
Precision = TP+ FP (2)
Appl. Sci. 2022, 12, 11109 4 of 13
2.1.5. Recall
The recall is the amount of data that the model recognizes as true for data that are
actually true. It is calculated by Equation (3):
TP
Recall = TP+ FN (3)
2.1.6. F1-Score
The F1 score, also called the f-measure, is used to measure the accuracy in statistical
analysis. The existing accuracy equation can achieve high performance in the case of a data
imbalance, even when a low-performance model is used. Therefore, the F1 score is used
for accurate calculation results even in the case of a data imbalance. It is calculated as the
harmonic mean of the precision and recall, and it is calculated by Equation (4):
Precision× Recall
F1-score = 2 × Precision+ Recall (4)
2.2. Deepfake
2.2.1. Deepfake
Deepfake is a compound word for deep learning and fake, and it refers to the results
of false video, images, and voices generated through the steps of extraction, learning, and
generation using deep learning technology. As artificial intelligence advances, sophisticated
generation technologies are emerging. Deepfakes can be applied in various fields, but it
can also be abused with negative intentions.
2.3. Dataset
ASVspoof
Automatic speaker verification (ASV) is the authentication of individuals by perform-
ing analysis of speech utterances [5,6]. The ASVspoof challenge is a series of challenges that
promote the development of countermeasures to protect ASV from the threat of spoofing [7].
The ASVspoof 2019 challenge provides a standard database [8] for anti-spoofing. For our ex-
periment, the logical access (LA) part of ASVspoof 2019 was used. The LA of the provided
dataset consists of synthetic speech generated with the very latest, state-of-the-art text-to-
speech synthesis (TTS) and voice conversion (VC) technologies, including Tacotron2 [9]
and WaveNet [10] as well as bona fide user speech. The spoofing data in the ASVspoof
2019 database have been proven to be similar to the target speaker [8]. Most datasets for
deep voice detection consist of deep voices for a single person. However, our proposal is to
enhance the prevention of voice phishing by verifying the correct sender after performing
verification primarily through deep voice detection. To accomplish this, we need a deep
voice dataset for each user. This dataset is suitable for our work as it contains a deep
voice for each user. Therefore, we adopted this dataset in order to show the utility of
DeepDetection.
Appl. Sci. 2022, 12, 11109 5 of 13
FP
FAR = FP+ TP (6)
3. Proposed Method
The proposed system extracts features through an autoencoder from the voice data in
a user’s mobile device. After that, the extracted features are transmitted to the server and
input into the classifier, and deep voice detection and user authentication are performed.
Figure 2 is a schematic diagram of the proposed method, which is shown in a comprehen-
sive manner. As for the specific process, after receiving the voice data as the input, the
size of the voice signal is unified through data preprocessing. The preprocessed data are
input into the autoencoder. During the encoding, meaningful features are extracted. The
encoder can be deployed on users’ devices. As shown in Figure 2, when the user’s mobile
and server environments are divided, only the preprocessed data and not the original voice
are stored in the server. Since the original voice is not exposed, privacy can be protected.
Appl. Sci. 2022, 12, 11109 7 of 13
The extracted feature vector is input into the classifier, and the classifier classifies the input
voice data as users or deep voices. Through this process, user authentication and deep
voice detection are possible simultaneously.
3.1. Dataset
3.1.1. Dataset Configuration
Table 1 shows the details of the dataset for the proposed system.
We classified 20 users and a deep voice. In other words, these entries were divided
into a total of 21 classes. In order to construct a dataset suitable for our system, the deep
voice data were sampled at the same ratio from each user’s deep voice. Then, the deep
voice data were composed into the same class, regardless of whose deep voice it was. The
dataset composed of these criteria was preprocessed for training and inference.
Table 1. Details of dataset (U, D : the number of users and deep voices).
3.1.2. Preprocessing
The voices were the time series data, and the raw audio could not be used as the input
for the neural network. Therefore, we used preprocessing to convert the raw audio into a
numpy array for use as input for the neural network. Because the sequence length of the
original voice data exceeded at least 50,000, a very large amount of RAM was required for
training and inference, and an out of memory (OOM) state occurred in our system with 50
GB of RAM. Since the length of each voice datum was different, it was necessary to unify
the sequence length. In order to reduce the length of the sequence of voice data, as long
as there was no performance degradation, the length was sliced to 1000. Through this, we
reduced the overhead and increased the amount of voice data from 2450 to 130,516. A total
of 130,516 data were divided into 60% for training, 20% for validation, and 20% for testing.
Appl. Sci. 2022, 12, 11109 8 of 13
3.3. Convolutional Neural Network for User Authentication and Deep Voice Detection
Figure 4 is a diagram showing the structure of a neural network for user authentication
and deep voice detection. A vector extracted from speech through a separate autoencoder
is input into the classifier. In the scenario, the autoencoder is deployed on a mobile device.
The classifier classifies a total of 21 classes including a deep voice and User 1∼User 20.
The vector input into the classifier goes through four one-dimensional CNNs. After that, a
flattening operation is performed to be input into the fully connected layer. Unlike a two-
dimensional CNN, a one-dimensional CNN is a neural network mainly used for processing
sequential data. Our dataset was sequential data because it was voices. Therefore, we used
a one-dimensional CNN. The input vector goes through four one-dimensional CNNs. After
that, a flattening operation is performed to input the vector into the fully connected layer.
After flattening the vector, it is input into two fully connected layers. The output layer does
not include the softmax activation function for multiple classifications. Since the softmax
activation function is applied to the cross-entropy loss function provided by PyTorch, a
separate activation function is not used for the output layer. Table 3 shows the details of the
hyperparameters. These are the results derived through hyperparameter tuning. The input
shape of the classifier was (8, 163), and the number of neurons in the output layer was 21.
Since the number of labels was 21, the number of neurons in the output layer was set to
21. The output channels of each Conv1D layer were 32, 64, 32, and 8. The number of output
neurons in the fully connected layer was 64 and 21. The filter size of the convolutional layer
was 15, and the stride was 1. After that, it was flattened and made into a 1-dimensional
array of a length of 856. Since it was a multi-class classification, cross-entropy loss was used,
and Adam with a learning rate of 0.0001 was used as the optimization function. Finally, the
epoch was set to 50, and the batch size was set to 128.
Hyperparameters Descriptions
Shape of Input and Output Input (8, 163), Output (21)
The Number of Labels 21 (1 Deep Voice and 20 Users)
Neurons of Layers Conv1D (149, 135, 121, 107), Flatten (856), Linear (64, 21)
Channel of Layers Conv1D (32, 64, 32, 8)
Kernel Size 15
Strides 1
Batch Size 128
Loss Function Cross-Entropy Loss
Activation Function ReLU
Optimizer Adam (lr = 0.0001)
Epoch 50
Appl. Sci. 2022, 12, 11109 10 of 13
4. Evaluation
4.1. Experiment Environment
We used Google Colaboratory Pro+, a cloud-based service, for this experiment. The
operating system was Ubuntu 18.04.5 LTS, and the RAM was 51 GB. The GPU was a Tesla
P100 16 GB, and the version of CUDA was 11.1. We used Python 3.7.13 and PyTorch
1.11.0+cu113.
probability of 0.04 when classifying 21 classes. In other words, since user authentication
is possible only when classified with a high probability, the proposed model is robust in
determining whether the sender is a deep voice or the correct sender.
Deep Voice User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9
0.1 0.14 0.11 0.09 0.08 0.14 0.25 0.07 0.05 0.3
User 10 User 11 User 12 User 13 User 14 User 15 User 16 User 17 User 18 User 19 User 20
0.08 0.1 0.2 0.12 0.04 0.15 0.18 0.12 0.13 0.21 0.35
Table 6. Comparison with previous works. O = provided method and X = not provided method.
5. Conclusions
In this paper, we proposed a system (i.e., DeepDetection) that simultaneously performs
user authentication and deep voice detection to prevent voice phishing without privacy
leakage. We designed a deep voice detection and user authentication model that achieves
Appl. Sci. 2022, 12, 11109 12 of 13
Author Contributions: Conceptualization, Y.K.; Software, Y.K., W.K., S.L. and H.K.; Supervision,
H.S.; Writing—original draft, Y.K.; Writing—review & editing, W.K., S.L. and H.K. All authors have
read and agreed to the published version of the manuscript.
Funding: This work was supported by Institute for Information & Communications Technology
Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00264, Research on
Blockchain Security Technology for IoT Services, 100%).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
2. Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN variants for computer vision:
History, architecture, application, challenges and future scope. Electronics 2021, 10, 2470. [CrossRef]
3. Ali, M.H.; Jaber, M.M.; Abd, S.K.; Rehman, A.; Awan, M.J.; Vitkutė-Adžgauskienė, D.; Damaševičius, R.; Bahaj, S.A. Harris
Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System. Appl. Sci. 2022, 12, 1091. [CrossRef]
4. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In
Proceedings of the 25th International Conference on Machine Learning, Helsinki Finland, 5–9 June 2008; pp. 1096–1103.
5. Delac, K.; Grgic, M. A survey of biometric recognition methods. In Proceedings of the Elmar-2004. 46th International Symposium
on Electronics in Marine, Zadar, Croatia, 18 June 2004; pp. 184–193.
6. Naika, R. An overview of automatic speaker verification system. In Intelligent Computing and Information and Communication;
Springer: Berlin/Heidelberg, Germany, 2018; pp. 603–610.
7. Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A.
ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv 2019, arXiv:1904.05441.
8. Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.;
et al. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 2020,
64, 101114. [CrossRef]
9. Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE international conference on
acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783.
10. Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K.
WaveNet: A generative model for raw audio. SSW 2016, 125, 2.
11. AlBadawy, E.A.; Lyu, S.; Farid, H. Detecting AI-Synthesized Speech Using Bispectral Analysis. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 104–109.
12. Wang, R.; Juefei-Xu, F.; Huang, Y.; Guo, Q.; Xie, X.; Ma, L.; Liu, Y. Deepsonar: Towards effective and robust detection of
ai-synthesized fake voices. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16
October 2020; pp. 1207–1216.
13. Ballesteros, D.M.; Rodriguez-Ortega, Y.; Renza, D.; Arce, G. Deep4SNet: Deep learning for fake speech classification. Expert Syst.
Appl. 2021, 184, 115465. [CrossRef]
14. Lim, S.Y.; Chae, D.K.; Lee, S.C. Detecting Deepfake Voice Using Explainable Deep Learning Techniques. Appl. Sci. 2022, 12, 3926.
[CrossRef]
Appl. Sci. 2022, 12, 11109 13 of 13
15. Gomez-Alanis, A.; Peinado, A.M.; Gonzalez, J.A.; Gomez, A.M. A light convolutional GRU-RNN deep feature extractor for ASV
spoofing detection. In Proceedings of the Conference of the International Speech Communication Association, Graz, Austria,
15–19 September 2019, Volume 2019; pp. 1068–1072.
16. Chen, T.; Kumar, A.; Nagarsheth, P.; Sivaraman, G.; Khoury, E. Generalization of audio deepfake detection. In Proceedings of the
Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 132–137.
17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
18. Wu, Z.; Das, R.K.; Yang, J.; Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech
attacks. arXiv 2020, arXiv:2009.09637.
19. Ma, H.; Yi, J.; Tao, J.; Bai, Y.; Tian, Z.; Wang, C. Continual learning for fake audio detection. arXiv 2021, arXiv:2104.07286.
20. Wei, L.; Long, Y.; Wei, H.; Li, Y. New Acoustic Features for Synthetic and Replay Spoofing Attack Detection. Symmetry 2022,
14, 274. [CrossRef]
21. Wu, Z.; Li, H. Voice conversion versus speaker verification: An overview. In APSIPA Transactions on Signal and Information
Processing; Cambridge University Press: Cambridge, UK, 2014; Volume 3.