Tran 2019
Tran 2019
article info a b s t r a c t
Article history: In this paper, we present a comprehensive survey of privacy-preserving big data analytics. We
Received 20 January 2019 introduce well-designed taxonomies which offer both systematic views and a detailed classification of
Received in revised form 31 July 2019 this challenging research field. We give insights into recent studies on existing active topics in the field.
Accepted 23 August 2019
Furthermore, we identify open future research directions for privacy-preserving big data analytics.
Available online 7 September 2019
This survey can serve as a good reference source for the development of modern privacy-preserving
Keywords: techniques to address various privacy-related scenarios to be encountered in practice.
Privacy-preserving big data analytics © 2019 Elsevier Inc. All rights reserved.
Anonymity
Private learning
Secure outsourcing
Social networks
∗ Corresponding author. In this part, we review the existing surveys in the literature to
E-mail addresses: hongyen.tran@student.adfa.edu.au (H.-Y. Tran), obtain knowledge of previous efforts and differentiate our work
j.hu@adfa.edu.au (J. Hu). from other existing ones.
https://doi.org/10.1016/j.jpdc.2019.08.007
0743-7315/© 2019 Elsevier Inc. All rights reserved.
208 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
Table 1
Brief summary of related surveys.
Work Big data Privacy disclosure Analytics operations Protection mechanisms Privacy vs other Generalize studies Open research
security concepts to scenarios problems
(1) (2) (3) (4) (5) (6) (7) (8) (9)
Matwin [56]
Li et al [48]
Senosi and
Sibiya [74]
Wang et al.
[91]
Xu et al. [96]
Vennila and
Priyadarshini
[85]
Yu [103]
Fang et al [29]
Wang et al.
[93]
Lv et al [52]
Terzi et al [83]
Ye et al [101]
Pham et al [65]
Abawajy et al.
[1]
Hu and
Vasilakos [36]
Desai et al. [23]
Our work
For privacy-preserving data mining, there are a number of big data analytics by a systematic multi-dimensional and practical
surveys conducted. In 2008, Aggarwal and Yu in [3] organized scenario-based approach, particularly focusing on active research
the topics in privacy-preserving data mining in such a way that domains such as social networks. Our survey also presents the
they were explored independently by cryptography, database and close relationships between different aspects in the field.
statistical disclosure control communities. This study provides
comprehensive survey of traditional techniques on general data,
encompassing randomization and anonymization techniques. It 1.3. Main contributions
also considers encryption methods for distributed data mining.
Subsequent studies [48,56,74,91] are quite similar, but a broader In this paper, we present a survey reflecting the latest devel-
and deeper perspective are not sufficiently provided. opments in the field of privacy-preserving big data analytics, with
Some recent reviews have discussed the challenges and op- the focus on social big data. Our contributions are manifold:
portunities of security and privacy of big data from different
approaches. Vennila and Priyadarshini in [85] focused on global • It provides a comprehensive survey covering both system-
recording anonymization for preserving big data privacy using atic and multi-dimensional view of privacy-preserving big
MapReduce on clouds. Yu et al. in [103] reviewed two major data analytics in an integrated framework with the consid-
research categories of privacy from different disciplines, focusing eration of different typical practical scenarios.
on mathematical models and metrics of a privacy framework. In • It proposes several well-designed taxonomies for privacy-
the reference [96], Xu et al. summarized privacy-related concerns preserving big data analytics, which can help understand
and countermeasures with a user-role-based approach. Differ-
different aspects of the field as well as intriguing relation-
ent privacy attacks and a privacy-preserving framework from
ships between them.
communication perspectives were analyzed in [93]. The refer-
• We have discussed about the difference between privacy
ences [29,52] classified key privacy-preserving techniques in big
and confidentiality, and their different approaches of pro-
data. Terzi et al. in [83] reviewed big data privacy and security
tection.
approaches in literature in terms of infrastructure, application,
and data. The study in [101] introduced security and privacy • Some selective practical privacy-related scenarios with ex-
issues due to the effects of big data characteristics. Based on some amples from emerging applications such as social networks
primary privacy issues in social networks, recent developments are presented.
on big data analytics were discussed in [1,65]. Several reviews of • We have provided and discussed challenges of privacy
privacy-preserving big data analytics for smart grids [23,36] were preservation in the context of big data. With respect to this,
also conducted. open research problems for future work are identified.
There are a number of survey articles about privacy issues in
big data and privacy-preserving data mining techniques. How- The rest of the paper is organized as follows. We introduce
ever, few of them focus on privacy-preserving big data analytics the overview of big data analytics in Section 2. Generic privacy-
with sufficient big data flavor in fundamental data analytics op- preserving mechanisms are explored in Section 3. Section 4 is
erations. Related survey articles are briefly presented in Table 1 about existing research on privacy-preserving big data analytics.
and compared with our work. Our study is distinct from previous Finally, summary and open research problems in the field are
studies, as we extensively cover the area of privacy-preserving identified in Section 5.
H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218 209
2. Overview of big data analytics a fundamental operation in such a whole data analytics process.
Machine learning, particularly, deep learning can be utilized to
Big data is generated in huge volumes along with a variety of address learning problems in big data analytics, including ex-
data formats at a rapid velocity. With such properties, traditional tracting high-level complex patterns from massive volumes of
system platform and data analytics are not able to handle such data [61].
big data. Therefore, hardware platforms (high performance com- Big data generated from various application sources like social
puting clusters, multi-core CPU, graphics processing unit, etc.) media, smart meters, etc., can contain individual user’s private
as well as software frameworks (Hadoop, Spark, etc.) [78,84] information. This sensitive information may be in the explicit
are developed to deal with the problem of high computational form of data input or implicit form of data output, which is
complexity (time complexity and space complexity) of analysis revealed after data analytics processes. Therefore, although data
operations due to big data characteristics. The proposed taxon- analytics is useful in decision making, it potentially leads to se-
omy in Fig. 1 includes primary aspects in the field. It covers the rious privacy concerns. With this in mind, infrastructure security
sources and characteristics of big data, the need of developing and data security/privacy have become significant considerations
high performance computing platforms, and appropriate data besides the utility aspect of analytics.
processing methods for efficiently performing fundamental data
analytics operations in providing intelligence analytics services 3. Generic privacy-preserving mechanisms
with the consideration of big data security and privacy.
First of all, it is the consideration of big data itself, including Privacy commonly relates to the meaning that sensitive in-
data sources and fundamental intrinsic data properties. As infor- formation about individuals or groups is not disclosed to oth-
mation technology spreads fast, most of data have been generated ers. Although privacy and confidentiality are overlapped in some
digitally from emerging applications and technologies, such as contexts, they appear somewhat different, in terms of their con-
social media and Internet of Things (IoTs). These data are large- cepts and protection methods. Confidentiality is considered as a
scale, high-dimensional, dynamic, real-time, and mostly noisy ‘‘data-oriented’’ concept which means it is about data itself with
and of poor quality. Besides, such data are presented in differ- the purpose of keeping the data only be known by authorized
ent forms including structured, semi-structured and unstructured parties [74], while privacy involves an additional ‘‘data-owner-
data [42]. oriented’’ concept. Privacy focuses on the data owners, which
Because of these properties, big data analytics requires ded- might be individuals or groups, with the aim of protecting their
icated platforms and processing mechanisms. Distributed Rela- private information while utilizing their data for data analyt-
tional Database Management System (RDBMS), Distributed File ics [29,93]. Therefore, the concept of privacy includes the consid-
System (DFS), and NoSQL are some of well-known big data plat- eration of the utility trade-off when the public utilizes the data
forms [7,78]. In the response to the problems of analyzing such where the sensitive information should be still kept incompre-
big data, several processing methods have been studied: batch hensible and/or anonymized. Differences in their concepts lead
processing, parallel and distributed computing, interactive and to differences in their protection methods. Privacy-preserving
iterative computation, as well as incremental and approximate protection methods take into account the utility of data besides
approach. As shown in the taxonomy, data analytics is a complete the protection of privacy. Confidentiality is realized primarily by
set of activities, which takes care of the preparation, exploration, encryption and access control [10]. Cryptography is also utilized
modeling and mining of data to extract meaningful insights for to preserve privacy in terms of keeping the data confidential;
descriptive, predictive and prescriptive intelligent analytics appli- however, the employed encryption schemes and cryptographic
cations, such as monitoring systems [49,106,113]. Data mining is protocols have to be able to guarantee utilities on encrypted
210 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
data, which also means data can be analyzed without decryption. where E is the encryption algorithm and M is the set of all
For confidentiality concept, this does not need to be guaranteed. possible messages. Informally, HE is a type of encryption that
Besides cryptographic tools, privacy is also preserved by per- allows a computation performed on cipher texts to generate an
turbation (to mask the true data value) and anonymization (to encrypted result such that if it is decrypted, the computation
hide the ’data–data owner’ links) techniques. Another difference result performed on the plain texts will be the same. Although
is that loss of confidentiality is mostly caused by the weakness cryptographic computation can guarantee ‘‘perfect’’ privacy in
of access control or encryption schemes, while loss of privacy is terms of keeping data confidential by the security strength of
associated with not only that but also the vulnerabilities in the cryptographic primitives, the weaknesses are challenges in the
control of data utilization and data analytics (leading to linking scalability and implementation efficiency, which is mostly due
attacks, inference attacks, etc.). to the high computational complexity of existing homomorphic
In this section, generic data-driven and computation-driven encryption schemes.
privacy-preserving mechanisms will be presented, which are dif-
ferent from the other ones like policy-driven or system- 3.1.2. Non-perturbative methods
driven approaches. We develop a framework-based taxonomy of A non-perturbative method preserves privacy by sanitizing
privacy-preserving mechanisms consisting of three key compo- identifiable information, thereby preventing identity from be-
nents, which are protection methods, models, and metrics. As a ing connected with an adversary’s background information. This
whole, privacy-preserving mechanisms ensure data privacy by a method is originally implemented on structured data [73] and
protection method satisfying a privacy model and being evaluated typically considers data in the form of a private table consisting
by a metric measurement (see Fig. 2). of multiple records. Each record is composed of the following four
types of attributes: Identifiers (ID), Quasi Identifiers (QID), Sensi-
3.1. Protection methods tive Attributes (SA), and Non-sensitive Attributes (NSA). Basically,
the original private table D is transformed to the sanitized table
Privacy-preserving protection methods are classified into T by removing ID and modifying QID to QID′ with the utilization
cryptographic methods, non-perturbation and perturbation. Cryp- of non-perturbative operations on QID such as generalization and
tographic methods guarantee privacy by means of encryption suppression, so that several records become indistinguishable
schemes and protocols, while the others preserve privacy by with respect to QID′ , thereby preserving privacy by ambiguity and
anonymization and perturbation. anonymity.
−−−−−−−→
D(ID, QID, SA, NSA) anonymized T (QID′ , SA, NSA)
3.1.1. Cryptographic methods
Cryptographic methods mostly employ Secure Multi-party Non-perturbative methods with the similar idea have been de-
Computation (SMC) [121] implemented using some forms of veloped to apply on semi-structured and un-structured data, such
Homomorphic Encryption (HE) schemes [2]. They preserve pri- as social network data [1,38,41], trajectory data, data stream [72],
vacy by keeping private data items in encrypted forms, then as well as to deal with scalable issues [58,63,104,107,109,119]
performing utility functionalities on encrypted individual data
items to get encrypted aggregate results, and finally decrypting 3.1.3. Perturbative methods
these results to get the outputs of the same functionalities as if The general idea of a perturbative method is to disturb the
on plain individual data items. original data values, so that the statistical information computed
SMC is a generic cryptographic primitive that allows dis- from the disturbed data does not differ significantly from that
tributed parties to jointly compute a functionality without re- of the original data. There are two data perturbation techniques
vealing their own private inputs. SMC was introduced by Yao which are input perturbation and output perturbation. In the first
in [100] and a great number of studies in theory as well as technique, the input data are perturbed, then the perturbed val-
application-oriented SMC protocols [121] have been conducted ues are published. For the other one, the data are kept secret with
to apply in the emerging technologies such as cloud computing, no transformation, but the noise is added to the answer/output
mobile computing, and the Internet of Thing, etc. for solving of a query to the data. A perturbative method is generally imple-
security and privacy issues in these areas. In privacy preserving mented by synthetic data, which has some statistical attributes
context, SMC can be utilized when: 1- the functions are known, as the original data, and random additive/multiplicative noise
2- the computation is distributed over multiple data owners, addition, which preserves some simple statistical information,
and 3- there is a privacy requirement for each site. Formally, like means and correlations, for data analytics .
SMC can be stated as: There are n parties P1 , . . . , Pn owning One strength of a perturbative method is that it is relatively
n databases DB1 , . . . , DBn . SMC helps compute a joint function simple, efficient, and does not require knowledge of the dis-
f (DB1 , . . . , DBn ) while protecting private input DBi of each party. tribution of the other records when adding noise to a given
SMC is mainly implemented by using a HE scheme. An encryption record, compared to a non-perturbative method which requires
scheme is called homomorphic over an operation ‘⋆’ if it supports the knowledge of the remaining records in order to implement
the equation E(m1 ) ⋆ E(m2 ) = E(m1 ⋆ m2 ), ∀m1 , m2 ∈ M, non-perturbative operations. However this is also its weakness
H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218 211
because outliers are more vulnerable to probabilistic attacks 3.3.1. Privacy metrics
and it is difficult to mask them without reducing data util- In [86] Wager et al. thoroughly scrutinized a selection of over
ity [4]. Other strength but also weakness of a perturbative method 80 existing privacy metrics and proposed categorizations based
is that the perturbed records do not correspond to any real- on four common characteristics which are adversary models, data
world record owners, thereby preventing linking attacks; how- sources, input for computation of metrics, and output measures.
ever, these records are ‘‘not real’’ [92] while non-perturbative Adversary models describe the adversary’s goal and capabilities
operations still preserve the ‘‘truthfulness’’ of the data although he/she is assumed to have. Data sources describe which data
making the data less specific. need to be protected, and how the adversary is assumed to
gain access to the data. Privacy metrics also rely on different
3.2. Models kinds of input data (adversary’s estimate, adversary’s resources,
prior knowledge, ground truth, parameters) to compute different
A number of privacy models have been proposed in the lit- types of the privacy output values (uncertainty, information gain
erature to formalize the above-mentioned privacy protection or loss, data similarity, indistinguishability, adversary’s success
methods. This subsection is a summary of several important pri- probability, error, time, accuracy or precision). The availability
vacy models which are k-anonymity, l-diversity, t-closeness, and of input data and appropriate assumptions determine whether a
ϵ -differential privacy. metric can be used in a specific scenario. However, in most of
scenarios a single metric cannot capture the entire concept of
3.2.1. k-anonymity, l-diversity, t-closeness privacy.
k-anonymity is a well-known anonymous privacy models
against record linkage attacks in which an adversary with the 3.3.2. Utility metrics
help of additional knowledge can uniquely identify the victim’s An important aspect of privacy is the utility trade-off, which
record [73]. The principle of k-anonymity is to hide sensitive means that a privacy-preserving protection method should pro-
information into k − 1 dummies with the same QID. Thus, it is duce outputs that remain useful in terms of application services.
difficult for an adversary to identify the actual record. However, Utility metrics are used to quantify the usefulness of protected
if there is a predominate value on the SA, it is possible for him/her data for obtaining data analytics purposes, and can be classified
to implement attribute linkage attacks which associate the victim into general purposes and specific purposes [31]. For general
to her SA value without having to identify his/her record. To purposes, information loss metrics are often used to quantify
prevent these attacks, Machanavajjhala et al. in [54] proposed the similarity between the original data and the transformed
the l-diversity. The l-diversity requires every QID group to contain data. They are generally measured by the extent to which the
at least l distinct values for the SA. However, the l-diversity has transformed data preserve aggregate statistical information of the
a limitation of implicitly assuming that each SA takes values original data. For a specific purpose such as a machine learning,
uniformly over its domain; therefore, l-diversity cannot prevent data mining, or statistical analysis task, sanitized data is used as
attribute linkage attacks when the overall distribution of a SA the input to the analytics task, then the quality of the result is
is skewed. To solve this problem, Li et al. in [47] proposed the evaluated (mostly by accuracy or error rate) and compared to the
t-closeness model, which requires the distribution of a SA in any case of the original data.
QID group to be close to the distribution of the attribute in the
overall dataset. The closeness between two distributions of the 4. Privacy-preserving big data analytics taxonomy
SA value is measured by a specific function such that its value is
within t. However, enforcing t-closeness would greatly degrade In this section, we propose a novel taxonomy privacy-
the data utility because it requires the distribution of SA values preserving big data analytics built based on a scenario-based
to be the same in all QID groups. approach. First of all, we present typical scenarios of big data
analytics with a variety of privacy requirements and assumptions
3.2.2. ϵ -Differential privacy on data and models. These scenarios are motivating factors of
Differential privacy (DF) is another privacy model based on studies in the field.
perturbation approach. This model was proposed by Dwork in
[25]. The aim of DF is to mask the differences in computation 4.1. Motivating privacy-related scenarios
results of a function f on neighboring datasets which differ on
at most one data item. DF acquires the intuition that releasing Consider privacy-related scenarios with three main actors:
aggregate results should not reveal too much information about ‘‘data owners’’ who own the original data, ‘‘data holders’’ who
any individual data item which contributes to these results. collect the data from the data owners, and ‘‘data consumers’’
Differential privacy defines privacy with very loose assump- who carry out data analytics. Data holders may be trusted, semi-
tion about background knowledge of an adversary. Therefore trusted, or untrusted. Trusted data holders will implement nec-
compared to previous privacy modes, differential privacy is able essary privacy protection methods on the data. If data holders
to resist most of privacy attacks, including linkage attacks. Be- are semi-trusted/untrusted, data owners will do this job. Data
sides, differential privacy can provide a provable privacy guaran- consumer plays a role of honest data analysts or adversaries.
tee and quantitatively analyze the risk of privacy disclosure based Among many big data sources, Online Social Networks (OSNs)
on probability statistical models [27,28]. However, if the data is contribute considerable amount of data. Although the massive
correlated, such as time-series data, applying differential privacy social data offer valuable merits, they raise strong privacy con-
is challenged to guarantee utility [26,122]. cerns [65]. Some of generic privacy concerns will be revealed
from the given scenarios below with illustrated specific examples
3.3. Metrics in social networks and other emerging technologies.
Scenario 1: Privacy-preserving big social network data pub-
To assess the efficacy of a privacy-preserving method, privacy lishing
metrics and utility metrics are employed to measure the level of A data holder collects data from data owners and publishes
privacy provided as well as utility guaranteed. the data to data consumers without knowing specific data
212 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
analysis techniques the data consumers will carry out. The data Scenario 5: Privacy-preserving model evaluating
holder needs to implement protection methods to guarantee A data holder is assumed to play a data consumer’s role. A
the privacy of the data owners from privacy risks, including data owner has to send her/his data to the data holder in order to
identity disclosure risks which link a real data owner to a specific use learning services the data holder provided. The requirement
node/vertex in the released anonymous social graph data, link is that the data holder could not know about the raw input data
disclosure risks which reveal private relationships between the as well as the output prediction.
data owners, and content disclosure risks occurring when a sen- Example: An individual wants to use a cloud learning service
sitive attribute associated with each vertex or edge in the released (such as Google Cloud Machine Learning, Azure Machine Learning
social network data is compromised. A compromised attribute service) with the condition of keeping her/his data and the result
can help pinpoint a specific data owner in the anonymous data. prediction private.
Example: Facebook operator stores data collected from Face-
book users and shares the data with data consumers such as SNA 4.2. Proposed taxonomy
researchers. Although having access to the published dataset (di-
rect or indirect through outputs of queries), the data consumers Fig. 3 gives an overview of our proposed privacy-preserving
do not know the identity of each data item in the analyzed big data analytics taxonomy. This taxonomy includes primary
protected dataset, or in the other word, they are prevented from aspects in the field.
linking each data item to a specific real individual (the case Big data analytics, in terms of analytics operations dealing
of identity privacy), investigating private follower–followee rela- with characteristics of high dimensionality, massive volume, fast
tionship which can reveal personal information such as political velocity, and large variety, has challenged existing privacy-
or sexual tendency (link privacy), or knowing user data content preserving techniques. In [24], the authors analyzed the im-
(such as text-based comments or status). pact of big data on current anonymization practices, mainly on
Scenario 2: Secure Private Outsourced Data Search k-anonymity models. They found that, in big data, any attribute
Data owners outsource their private data to a data holder. can be a QID, thus k-anonymity would be with many QID at-
However, the data holder is untrusted (such as a public cloud). tributes in a straightforward approach, and this would lead to
Therefore, the data owners encrypt their data before sending the ‘‘curse of dimensionality’’, which means a lot of informa-
them to the data holder. The data owners send search queries to tion will be lost when sanitizing such QIDs. From that, they
the data holder. The problem is how to keep their data and their proposed alternatives to adapt the current k-anonymity algo-
search keywords unknown by the data holder while guaranteeing rithms to the big data context, so that they can remain effective
the service for them. with structured data in a big data environment. However, semi-
Example: Individuals outsource their encrypted data to a cloud structured and unstructured data formats were not considered.
service provider (e.g., Dropbox, Google Drives). How the cloud Other studies have been focused on proposing privacy-preserving
service provider can process search queries on their encrypted big data analytics techniques for various semi-structured and
data, while their data and search keywords are kept private. unstructured data such as stream sensor readings [72,109], trans-
Scenario 3: Privacy-preserving learning over outsourced data actional data [59], social network graph data [82,98,108], and
Assume that a data holder also plays a data consumer’s role, text data [94]. To deal with scalability of privacy-preserving tech-
which means a data holder provides both storage and computing niques in big data environment, MapReduce has been employed
services. When a data owner wants to use these services, she/he in a large number of studies [20,58,63,67,107,110,118,119].
has to outsource her/his data to the data holder in order to let Privacy-preserving big data analytics can be considered as
the data holder perform the computational learning tasks on the an appropriate combination of big data analytics and privacy-
data. However, the data owner requires to make sure that she/he preserving mechanisms. The key challenge is how to adapt basic
receives good services as well as her/his own private outsourced privacy-preserving mechanisms to deal with not only the com-
data are not exposed to the data holder. putational complexity (both storage and time complexity) of big
Example: Having very large and complex data about business data analytics operations but also the much more privacy risks
and customers, a commercial company chooses to adopt a cloud (especially from many forms of linkage attacks) boosted in big
machine learning service (e.g., Microsoft Azure Machine Learning data environments, while guaranteeing the utility of big data
Service) to build a learning model for the company’s intelligent analytics intelligent applications. In the previous sections, we
big data analytics application. However, at the same time, it have discussed generic big data analytics and privacy-preserving
is required that the company’s proprietary data must be kept mechanisms. In the following sections, we will investigate studies
private from the public service provider. trying to solve this challenging problem with our scenario-based
Scenario 4: Privacy-preserving collaborative learning with approach for each fundamental data analytics operation (data
secure aggregation preparation, data exploration, data mining) which is mostly on
More than two data owners need to implement the same big social media network data.
learning task on their own data. In order to benefit rich data from
multiple sources, thus avoiding over-fitting and getting better 4.2.1. Privacy-preserving data preparation
results, they collaboratively build a shared master model which is As we mentioned before, OSNs contribute considerable
expected to be better than their own local models. The problem amount of data and play a role as one of important big data
is how to protect their own data in a jointly learning process. sources. Preserving privacy of big social network data publish-
Example: Mobile users use Google’s text message applications ing is much more challenging than that of the conventional
with an intelligent function of predicting the next word when tabular data (structured data) in a relational database. This is
they compose a text message. Each user securely maintains her because of the diversity and complexity of the huge graph data
private database of her text messages on her own mobile device (semi-structured) as well as massive multimedia data (mostly
because text messages often contain personal sensitive infor- unstructured) in social networks. From the view of an adversary,
mation. This is the case that each mobile user contributes to a she/he can take advantage of massive and diverse data which
deep neural network model under the coordination of a cen- are publicly available in social network environment to build
tral Google server in a privacy-preserving collaborative learning better background information about the target and carry out
setting. privacy-related attacks, especially the linkage attacks.
H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218 213
Scenario 1: Privacy-preserving big social network data pub- scheme to generate an anonymized graph by a randomization
lishing process of editing the graph structure (adding/removing edges
In the literature, there are a number of studies on designing randomly) with a certain probability p. Focusing on community
different privacy models and algorithms for generating sanitized subgraph anonymization, the study in [81] proposed the k-SDA
social network datasets which satisfy the proposed privacy mod- model for privacy-preserving static social graph data publishing.
els against the three main privacy disclosure risks mentioned This model requires that for each vertex v there exist at least
above (identity, link and content disclosure). Most of these mod- k communities such that each community contains at least one
els are k-anonymity based ones, which primarily differ in the
vertex with the same degree as the v ’s degree. In the subsequent
assumption of adversaries’ prior knowledge for re-identification
study [80], the authors introduced a community anonymization
attacks [1]. The background knowledge coming from hetero-
method with the kw -SDA model developed from the k-SDA model
geneous big social network data can be generally categorized
as structural attributes (e.g., degree, neighborhood) and non- for privacy-preserving sequential social graph data publishing
structural attributes which describe a social network user and in a dynamic social network, in which w is the time period of
are assigned to a vertex or an edge (e.g., name, address, age, monitoring for addressing degree-trail attacks [57].
status, comments). Identity (vertex) privacy is the protection Besides the line of research aiming at vertex privacy, link
of anonymous user identities in a published social graph from privacy protection guarantees sensitive edges/links between ver-
being linked to the corresponding users in the real social network tices in a social graphs not being exposed to the adversary.
(identity disclosure risk). With the assumption that the adversary There are two approaches to the problem of preserving link
can take advantage of the background knowledge from graph privacy in social networks: anonymization-based approach [40]
structural-related attributes such as degrees, subgraphs, neigh- and randomization-based approach [30,97,102]. Jing et al. in [40]
bors, communities to realize re-identification attack scenarios, presented the k-sensitive edge anonymity model, which requires
some variants of k-anonymity models were proposed to obtain that the number of the sensitive edges must be at least k for
different types of anonymization, such as degree anonymiza- each node that has at least a sensitive edge in the graph. Some
tion with k-degree anonymity [17,18,53,55]; subgraph anonymiz- studies proposed grouping vertices and edges into subgraphs
ation with k-automorphism [99,120,123], k-isomorphic [22],
first and then implementing anonymized subgraphs. In [102],
k-symmetry [95]; neighborhood subgraph anonymization with
Ying et al. adopted the edge randomization approach via ad-
k-neighborhood anonymity [44,89,112]; community subgraph
dition/deletion edges. They focused on to what extent a given
anonymization with k-structural diversity anonymity (k-SDA)
[81] for a static network scenario and kw -structural diversity sensitive edge can be breached by exploiting proximity measure
anonymity (kw -SDA) [80] for a dynamic network scenario with values of node pairs. Fard et al. in their paper [30] proposed
sequential publications. To provide k-neighborhood anonymity the neighborhood randomization scheme which is a structure-
with less lower anonymization cost due to less strong privacy aware randomization technique. In their scheme, a social network
condition, the reference [89] defined a new privacy concept called is considered a directed graph and randomized graph obtained
‘‘probability indistinguishability’’ in lieu of isomorphism and de- by probabilistically randomizing the destination of a link within
signed a heuristic indistinguishable group anonymization (HIGA) a local neighborhood, thereby hiding a sensitive link. In terms
214 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
of privacy-preserving methods resilient to the maximum back- data searching and sharing among a group of social network
ground knowledge of an adversary, protection methods satisfy- users while preserving search keyword privacy and user’s identity
ing differential privacy have been developed to release statistics privacy. In the paper [39], Jiang et al. proposed a high per-
about social graph data [11,69]. formance and privacy-preserving query scheme over encrypted
Different from the two above privacy scenarios with privacy multidimensional big metering data to address the challenge
attacks based on the graph structure, content privacy scenar- of outsourcing the metering data to heterogeneous distributed
ios with user-linkage attacks [5,8] based on the disclosure of systems. Their approach is based on locality sensitive hashing and
social media user data (such as text data) have been recently enhanced ciphertext-policy attribute-based encryption to secure
investigated in [16,111]. Almishari et al. in [5] tried to map an similarity search and control access to the search results.
anonymous user to the real user given a subset of reviews made
by the anonymous user. To prevent such user-linkage attacks, 4.2.3. Privacy-preserving data mining
Zhang et al. in [111] proposed users’ textual data perturbation Privacy-preserving data mining (PPDM) mainly adopts cryp-
which develops a protection framework satisfying differential pri- tographic and perturbation methods to address three objectives:
vacy condition with ‘‘ϵ -Text Indistinguishability’’ privacy concept 1- privacy of the input data which is used for learning a model or
for textual user data. The authors also considered the ‘‘curse using/evaluating an existing model, 2- privacy of the model, and
of dimensionality’’ problem when original differential privacy is 3- privacy of the model’s output. The privacy-preserving solutions
deployed on high-dimensional textual data and then proposed are often tightly coupled with the data mining algorithms and
the idea of limiting the sensitivity of the whole text vector in focus on preserving the features at the dataset level which can
the user–keyword matrix to the norm of the vector, instead of be obtained from the original data or the synthetic data.
the individual element to solve the root cause. Similarly, Beigi The existing literature on PPDM mostly targets conventional
et al. in [9] perturbed the textual information. Cai et al. in [16] machine learning algorithms, such as decision trees, regression,
proposed the PEATSE. By adding Laplacian noise, these studies association rules, Naive Bayes, k-means, etc. [6,35,46,60,62,76,91].
guarantee differential privacy and prevent content disclosure as In recent years, deep learning proves its efficiency in situations
well as user linkage attacks. related to the complex of the model in terms of the large number
of the model parameters. This means if there are more training
4.2.2. Privacy-preserving data exploration
data, which is the case of big data contexts (e.g., social networks
Scenario 2: Secure Private Outsourced Data Search
data, sensor data), then adopting deep learning will possibly lead
Processing high-dimensional queries is crucial to perform
to better results. Therefore, with the prevalence of big data, deep
privacy-preserving big data analytics operations, such as
learning also has significant development to benefit from such
similarity-oriented services which serve as a foundation in a wide
data. In the past few years, deep learning has played an important
range of data analytic applications. However, existing solutions
role in big data analytic solutions [116].
for privacy-preserving querying or searchable encryption in the
a. Privacy-preserving model training with secure aggrega-
context of big data still face significant challenges in terms of
tion and big data feature learning
computational complexity [34,70,114]. In recent years, there are
Scenario 3: Privacy-preserving feature learning over out-
some efforts to address such obstacles [19,39,51,87,88,90].
sourced data
In [19], Cash et al. theoretically designed a dynamic symmetric
In 2016, Zhang et al. [115] presented a deep computation
searchable encryption scheme supporting fully parallel single-
model to improve the efficiency of big data feature learning
keyword search with asymptotically optimal server index size.
while preserving private data. This paper is one of solutions
This scheme is implemented for secure search on very large
encrypted databases with tens of billions of record–keyword pairs for the problem of training model from encrypted data. The
and achieves quantified minimal leakage. Two terabyte-scale proposed model is based on Tensor Auto Encoder (TAE) and
datasets are used to evaluate the performance of their method. the BGV encryption scheme [15] which is a leveled fully ho-
Wang et al. [88] proposed a searchable encryption scheme for momorphic encryption scheme to protect sensitive information.
nearest neighbor search, which is one of the most fundamental Employing the similar idea of protecting privacy as in [115],
queries on massive datasets, over encrypted data on semi-trusted Zhang et al. in the reference [117] proposed a privacy-preserving
clouds. They modified the search algorithm of nearest neigh- double-projection deep computation model by replacing each
bors with tree structures to adapt to lightweight cryptographic hidden layer of the conventional deep computation model with a
primitives without increasing the original search complexity. The double-projection layer by projecting the input into two separate
experimental results on Amazon EC2 show that their scheme nonlinear subspaces.
can be practical over massive datasets. For similarity search, the Besides the conventional mechanism of implementing deep
study [51] presented the EncSIM which is an encrypted and learning schemes with a central server to store data and learn
scalable similarity search service for distributed high-dimensional model, there has been an emerging approach of implementing
datasets based on a variant of the state-of-the-art similarity deep learning in a collaborative manner in a distributed environ-
search algorithm called all-pairs locality-sensitive hashing and a ment. Collaborative deep learning is a deep learning framework
novel encrypted index construction. In the reference [87], Wang that enables input sharing and jointly building a deep learning
et al. proposed a scheme for privacy-preserving similarity search model. In order to utilize big data from multi-sources while
over encrypted feature-rich multimedia data and supporting effi- keeping data private for each data provider, it is the emerg-
cient file and index updates. They considered the search criteria ing trend to adopt the idea of collaborative deep learning with
as a high-dimensional feature vector instead of a keyword and privacy-preserving schemes for jointly learning process [105].
built solutions on fuzzy Bloom filters which utilize locality sen- There are two approaches for collaborative learning: sharing data
sitive hashing to encode an index associating the file identifiers and sharing model. In the former, each user needs to upload sen-
and feature vectors. sitive local data to the server; therefore, in this process, privacy-
The references [39,90] are some of the studies in the at- preserving protection schemes is to enable that the user’s data
tempt to solve the problem of searchable encryption in specific cannot be recognized by the server while retaining utility for
applications such as Facebook and Smart grids. In 2017, Wang learning computation. To do this, cryptographic methods with
et al. [90] introduced the ID-based multi-user searchable encryp- HE are mainly adopted. In the latter, the need of user data
tion (IDB-MUSE) scheme to solve the problem of secure encrypted encryption can be eliminated because data can be kept locally
H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218 215
for building a local model. Hence, it is only the case of sharing classification phase needs to be privacy-preserving in terms of
local models from multiple users to build a global model and the fact that the client should learn classification result for his/her
implementing appropriate privacy protection methods to prevent input but nothing else about the model of the server, while the
inference attacks which can infer the original data from the server should not learn anything about the client’s input or the
model’s parameters [50,66,77]. classification result. Their primary technique is to identify a set
Scenario 4: Privacy-preserving collaborative learning with of core operations (comparison, argmax, and dot product) over
secure aggregation encrypted data that compose many classification protocols.
Shokri et al. introduced in [77] a privacy-preserving deep
learning system that enables multiple parties to jointly learn a 5. Summary and open research directions
neural network model without sharing their input datasets. In
their design, each party trains independently a local model on In this paper, we have surveyed the latest developments in the
their own datasets and then shares selected small subsets of their field of privacy-preserving big data analytics. We have provided
model parameters during the training process. They used the dif- a comprehensive and systematic coverage of generic big data
ferential privacy motivated by some previous research [64,68,79] analytics and privacy-preserving schemes before presenting a
to prevent indirect leakage of individual data point in the training novel taxonomy of privacy-preserving big data analytics coupled
dataset from the model parameters. The reference [50] proposed with motivating privacy scenarios in the big data context. We
a collaborative privacy-preserving supervised deep learning sys- specifically focus on three main privacy-preserving data analyt-
tem in a mobile environment based on the idea in [77]. However, ics problems including data publishing, data querying, and data
Phong et al. in [66] demonstrated that even a small portion of mining. The proposed taxonomy is expected to give a systematic
gradients stored in the honest-but-curious server, local data can and multi-dimensional picture of emerging topics in the field.
be extracted from the gradients. From that, they introduced a Despite recent years both industry and academia have put a lot
privacy-preserving deep learning system bridging deep learning of efforts into privacy-preserving big data analytics, this field is
and cryptography via additive HE to protect model parameters still irrefutably challenging and it leaves a room for improving
(gradients) over an honest-but-curious cloud server. the existing schemes as well as developing new novel approaches
Federated learning is one of the latest approaches being ex- with regard to an increase of performance and privacy level. Some
plored in machine learning, which enables multiple parties to potential open research problems are listed below:
collaboratively learn a shared prediction model while keeping all
• Developing application-oriented privacy-preserving
the training data on their own devices, decoupling the ability
schemes for specific emerging technologies with complex
to train models from the need to store the data in the cloud.
big data (e.g, stream data, trajectory data) under different
In this context, Bonawitz et al. in [12] designed a protocol for
threat models.
secure aggregation of high-dimensional data while guarantee-
• With the raising cloud machine learning services in the big
ing communication-efficient and failure-robust requirements. In
data era, research on secure machine learning as service
this protocol, a server computes the sum of large and high-
would be a promising but challenging problem.
dimensional vectors from mobile devices in a secure manner,
• Research on privacy-preserving collaborative/federated
which can be used in a federated learning setting, to aggregate
deep learning for applications in a distributed environment
user-provided model updates for a master deep neural network.
with numerous resource-limited devices like mobile social
b. Privacy-preserving model using/evaluating
networks, wireless sensor networks, smart metering should
Scenario 5: Privacy-preserving model using/evaluating
be considered.
Graepel et al. in [33] proposed a confidential protocol for
• Developing new techniques for deep learning and refined
machine learning tasks, called ML Confidential, based on HE. They
analysis of privacy costs within the framework of differential
defined a new class of confidential machine learning algorithms
privacy.
for binary classification based on low-degree polynomial versions
• Designing new deep learning models over encrypted data
of classification algorithms, in which the algorithm’s predictions,
based on approximately transforming non-linear operations
viewed as functions of the input data, can be expressed as poly-
with regard to the employed cryptographic tools but not
nomials of bounded degree. In [32], Gilad-Bachrach presented the
compromising the performance.
CryptoNets, which is a convolution neural networks (CNN) that
• Developing modern lightweight cryptographic tools (e.g.,
can be applied to encrypted data to make encrypted predictions.
lattice-based cryptography) to solve the current problems of
To make a network compatible with HE, some modifications
high overhead and practical performance.
are needed, including using polynomial activation functions and
• Constructing privacy-preserving functionality building
scaled mean pooling instead of non-polynomial functions (sig-
blocks which can be flexibility integrated to solve a general
moid, rectified linear) and max pooling respectively. By some
privacy-preserving big data analytics framework.
such modifications and employing HE presented in [13], which is
• Research on anonymization and perturbation techniques
a leveled FHE scheme, although the utility server does not have
which are robust to different types of privacy attacks with
access to the keys needed to decrypt clients’ encrypted data, it is
various assumptions about adversary’s knowledge.
still capable of applying the CryptoNets to make encrypted pre-
• Designing privacy models, such as adopting game theory
dictions which can be sent back later to the owner who has secret
with adversarial models to solve the trade-off between data
keys to decrypt them. Therefore, the utility server does not gain
utility and privacy.
any information about the raw data nor about the predictions it
made. All of the above-mentioned open research problems should
Bost et al. in [14] introduced privacy-preserving classification be considered and solved from both theoretical and practical
protocols based on secure two-party computation in semi-honest perspectives in order to have a strong semantically secure guar-
model for three classifiers (hyperplane decision, Naive Bayes, and antee as well as efficacy in the big data era. The prevalence of
decision trees) by employing three additively homomorphic cryp- distributed data sources and outsourced computing, the develop-
tographic systems. In the considered cases, the learning models ment of machine learning as service, collaborative learning have
are computed by the server and kept secret from client after motivated cutting-edge solutions for privacy-related issues in the
running the training phase on plain texts as usual. Only the future.
216 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
Declaration of competing interest [22] J. Cheng, A.W.-c. Fu, J. Liu, K-isomorphism: privacy preserving network
publication against structural attacks, in: Proceedings of the 2010 ACM
SIGMOD International Conference on Management of Data, ACM, 2010,
No author associated with this paper has disclosed any po-
pp. 459–470.
tential or pertinent conflicts which may be perceived to have [23] S. Desai, R. Alhadad, N. Chilamkurti, A. Mahmood, A survey of pri-
impending conflict with this work. For full disclosure statements vacy preserving schemes in iot enabled smart grid advanced metering
refer to https://doi.org/10.1016/j.jpdc.2019.08.007. infrastructure, Cluster Comput. (2018) 1–27.
[24] J. Domingo-Ferrer, J. Soria-Comas, Anonymization in the time of big data,
Acknowledgments in: Privacy in Statistical Databases: UNESCO Chair in Data Privacy, in:
Lecture Notes in Computer Science, vol. 9867, 2016, pp. 57–68.
[25] C. Dwork, Differential privacy, in: Automata, Languages and Programming,
This work is financially supported by ARC grants with project in: Lecture Notes in Computer Science, vol. 4052, Springer-Verlag Berlin,
IDs LP180100663 and DP190103662. 2006, pp. 1–12.
[26] C. Dwork, Differential privacy: A survey of results, in: International Con-
References ference on Theory and Applications of Models of Computation, Springer,
2008, pp. 1–19.
[1] J.H. Abawajy, M.I.H. Ninggal, T. Herawan, Privacy preserving social [27] C. Dwork, A firm foundation for private data analysis, Commun. ACM
network data publication, IEEE Commun. Surv. Tutor. 18 (3) (2016) 54 (1) (2011) 86–95.
1974–1997. [28] C. Dwork, A. Roth, The algorithmic foundations of differential privacy,
[2] A. Acar, H. Aksu, A. Uluagac, M. Conti, A survey on homomorphic Found. Trends Theor. Comput. Sci. 9 (3–4) (2013) 211–487.
encryption schemes: Theory and implementation, ACM Comput. Surv. [29] W. Fang, X.Z. Wen, Y. Zheng, M. Zhou, A survey of big data security and
51 (4) (2018) 79. privacy preserving, IETE Tech. Rev. 34 (5) (2017) 544–560.
[3] C.C. Aggarwal, S.Y. Philip, A general survey of privacy-preserving data [30] A.M. Fard, K. Wang, Neighborhood randomization for link privacy in social
mining models and algorithms, in: Privacy-Preserving Data Mining, network analysis, Word Wide Web-Internet Web Inf. Syst. 18 (1) (2015)
Springer, 2008, pp. 11–52. 9–32.
[4] C. Aggarwal, P. Yu, Privacy-preserving Data Mining: Models and [31] B. Fung, K. Wang, R. Chen, P. Yu, Privacy-preserving data publishing: A
Algorithms, Springer US, 2008, pp. 431–460. survey of recent developments, ACM Comput. Surv. 42 (4) (2010).
[5] M. Almishari, G. Tsudik, Exploring linkability of user reviews, in: Eu- [32] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, J. Werns-
ropean Symposium on Research in Computer Security, Springer, 2012, ing, Cryptonets: Applying neural networks to encrypted data with
pp. 307–324. high throughput and accuracy, in: International Conference on Machine
[6] Y. Aono, T. Hayashi, L. Trieu Phong, L. Wang, Scalable and secure logistic Learning, 2016, pp. 201–210.
regression via homomorphic encryption, in: Proceedings of the Sixth ACM [33] T. Graepel, K. Lauter, M. Naehrig, ML Confidential: Machine Learning
Conference on Data and Application Security and Privacy, ACM, 2016, on Encrypted Data, in: Lecture Notes in Computer Science (including
pp. 142–144. subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
[7] S. Babu, H. Herodotou, et al., Massively parallel databases and mapreduce Bioinformatics), vol. 7839 LNCS, Seoul, 2013, pp. 1–21.
systems, Found. Trends⃝ R
Databases 5 (1) (2013) 1–104.
[34] F. Han, J. Qin, J. Hu, Secure searches in the cloud: A survey, Future Gener.
[8] G. Beigi, H. Liu, Identifying novel privacy issues of online users on social
Comput. Syst. Int. J. Escience 62 (2016) 66–75.
media platforms, SIGWEB Newsl. (Winter) (2019) 4:1–4:7.
[35] A. Hekmatyar, N. Nematbakhsh, M. Dehkordi, A survey on association
[9] G. Beigi, K. Shu, Y. Zhang, H. Liu, Securing social media user data: An
rule hiding in privacy preserving data mining, Majlesi J. Electr. Eng.
adversarial approach, in: Proceedings of the 29th on Hypertext and Social
10 (4) (2016) 39–48.
Media, ACM, New York, NY, USA, 2018, pp. 165–173.
[10] E. Bertino, E. Ferrari, Big data security and privacy, in: A Comprehensive [36] J. Hu, A.V. Vasilakos, Energy big data analytics and security: challenges
Guide Through the Italian Database Research over the Last 25 Years, and opportunities, IEEE Trans. Smart Grid 7 (5) (2016) 2423–2436.
Springer, 2018, pp. 425–439. [37] IBM, P. Zikopoulos, C. Eaton, Understanding Big Data: Analytics for
[11] J. Blocki, A. Blum, A. Datta, O. Sheffet, Differentially private data analysis Enterprise Class Hadoop and Streaming Data, first ed., McGraw-Hill
of social networks via restricted sensitivity, in: Proceedings of the 4th Osborne Media, 2011.
Conference on Innovations in Theoretical Computer Science, ACM, 2013, [38] S. Ji, P. Mittal, R. Beyah, Graph data anonymization, de-anonymization
pp. 87–96. attacks, and de-anonymizability quantification: A survey, IEEE Commun.
[12] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, Surv. Tutor. 19 (2) (2017) 1305–1326.
D. Ramage, A. Segal, K. Seth, Practical secure aggregation for privacy- [39] R. Jiang, R. Lu, K.-K.R. Choo, Achieving high performance and privacy-
preserving machine learning, in: Proceedings of the 2017 ACM SIGSAC preserving query over encrypted multidimensional big metering data,
Conference on Computer and Communications Security, ACM, 2017, Future Gener. Comput. Syst. 78 (2018) 392–401.
pp. 1175–1191. [40] L. Jing, H. Jianmin, L. Fangwei, J. Xiaoqiang, K-sensitive edge anonimity
[13] J.W. Bos, K. Lauter, J. Loftus, M. Naehrig, Improved security for a model for sensitive relationship preservation on publishing social net-
ring-based fully homomorphic encryption scheme, in: IMA International work, in: 3rd International Conference on Information Technology and
Conference on Cryptography and Coding, Springer, 2013, pp. 45–64. Computer Science (ITCS 2011), 2011, pp. 146–149.
[14] R. Bost, R.A. Popa, S. Tu, S. Goldwasser, Machine learning classification [41] T. Karle, D. Vora, Privacy preservation in big data using anonymiza-
over encrypted data, in: NDSS, 2015. tion techniques, in: 2017 1st IEEE International Conference on Data
[15] Z. Brakerski, C. Gentry, V. Vaikuntanathan, (Leveled) fully homomorphic Management, Analytics and Innovation (ICDMAI), 2017, pp. 340–343.
encryption without bootstrapping, ACM Trans. Comput. Theory 6 (3) [42] S. Khalifa, Y. Elshater, K. Sundaravarathan, A. Bhat, P. Martin, F. Imam,
(2014) 13. D. Rope, M. Mcroberts, C. Statchuk, The six pillars for building big data
[16] H. Cai, F. Ye, Y. Yang, Y. Zhu, J. Li, Towards privacy-preserving data trading analytics ecosystems, ACM Comput. Surv. 49 (2) (2016) 33.
for web browsing history, in: Proceedings of the International Symposium
[43] B. Kitchenham, Procedures for performing systematic reviews, Keele UK
on Quality of Service, ACM, New York, NY, USA, 2019, pp. 25:1–25:10.
Keele Univ. 33 (2004) (2004) 1–26.
[17] J. Casas-Roma, J. Herrera-Joancomarti, V. Torra, An algorithm for k-degree
[44] L. Lan, S. Liu, H. Jin, Personalized privacy preservation in social networks,
anonymity on large networks, in: 2013 IEEE/ACM International Confer-
J. Comput. Inf. Syst. 8 (3) (2012) 1297–1309.
ence on Advances in Social Networks Analysis and Mining (ASONAM),
[45] S. LaValle, E. Lesser, R. Shockley, M.S. Hopkins, N. Kruschwitz, Big data,
2013, pp. 677–681.
analytics and the path from insights to value, MIT Sloan Manage. Rev.
[18] J. Casas-Roma, J. Herrera-Joancomarti, V. Torra, K-degree anonymity and
52 (2) (2011) 21.
edge selection: improving data utility in large networks, Knowl. Inf. Syst.
50 (2) (2017) 447–474. [46] T. Li, J. Li, Z. Liu, P. Li, C. Jia, Differentially private naive bayes learning
[19] D. Cash, J. Jaeger, S. Jarecki, C.S. Jutla, H. Krawczyk, M.-C. Rosu, M. Steiner, over multiple data sources, Inform. Sci. 444 (2018) 89–104.
Dynamic searchable encryption in very-large databases: data structures [47] N. Li, T. Li, S. Venkatasubramanian, T-closeness: Privacy beyond
and implementation, in: NDSS, Vol. 14, 2014, pp. 23–26. k-anonymity and l-diversity, in: 2007 IEEE 23rd International Conference
[20] A. Chakravorty, C. Rong, K.R. Jayaram, S. Tao, Scalable, efficient on Data Engineering, 2007, pp. 106–115.
anonymization with INCOGNITO - framework & algorithm, in: 2017 IEEE [48] X. Li, Z. Yan, P. Zhang, A Review on Privacy-preserving Data Mining,
6th International Congress on Big Data, 2017, pp. 39–48, http://dx.doi. Institute of Electrical and Electronics Engineers Inc., 2014, pp. 769–774.
org/10.1109/BigDataCongress.2017.15. [49] L. Liu, O. De Vel, Q.-L. Han, J. Zhang, Y. Xiang, Detecting and preventing
[21] H. Chen, R.H.L. Chiang, V.C. Storey, Business intelligence and analytics: cyber insider threats: a survey, IEEE Commun. Surv. Tutor. 20 (2) (2018)
From big data to big impact, MIS Q. 36 (4) (2012) 1165–1188. 1397–1417.
H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218 217
[50] M. Liu, H. Jiang, J. Chen, A. Badokhon, X. Wei, M.-C. Huang, A collaborative [79] S. Song, K. Chaudhuri, A.D. Sarwate, Stochastic gradient descent with
privacy-preserving deep learning system in distributed mobile environ- differentially private updates, in: Global Conference on Signal and
ment, in: Computational Science and Computational Intelligence (CSCI), Information Processing (GlobalSIP), IEEE, 2013, pp. 245–248.
2016 International Conference on, IEEE, 2016, pp. 192–197. [80] C.-H. Tai, P.-J. Tseng, S.Y. Philip, M.-S. Chen, Identities anonymization in
[51] X. Liu, X. Yuan, C. Wang, EncSIM: An encrypted similarity search ser- dynamic social networks, in: 2011 IEEE 11th International Conference on
vice for distributed high-dimensional datasets, in: 2017 IEEE/ACM 25th Data Mining, IEEE, 2011, pp. 1224–1229.
International Symposium on Quality of Service (IWQOS), 2017. [81] C.-H. Tai, P.S. Yu, D.-N. Yang, M.-S. Chen, Structural diversity for pri-
[52] D. Lv, S. Zhu, H. Xu, R. Liu, A review of big data security and privacy vacy in publishing social networks, in: Proceedings of the 2011 SIAM
protection technology, in: 2018 IEEE 18tjTH International Conference on International Conference on Data Mining, SIAM, 2011, pp. 35–46.
Communication Technology (ICCT), 2018, pp. 1082–1091. [82] T. Tassa, D.J. Cohen, Anonymization of centralized and distributed social
[53] T. Ma, Y. Zhang, J. Cao, J. Shen, M. Tang, Y. Tian, A. Al-Dhelaan, networks by sequential clustering, IEEE Trans. Knowl. Data Eng. 25 (2)
M. Al-Rodhaan, KDVEM: a k-degree anonymity with vertex and edge (2013) 311–324.
modification algorithm, Computing 97 (12) (2015) 1165–1184. [83] D.S. Terzi, R. Terzi, S. Sagiroglu, A survey on security and privacy issues in
[54] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, big data, in: 2015 10th International Conference for Internet Technology
l-diversity: privacy beyond k-anonymity, in: 22nd International Confer- and Secured Transactions (ICITST), 2015, pp. 202–207.
ence on Data Engineering (ICDE’06), 2006, p. 24. [84] C.-W. Tsai, C.-F. Lai, H.-C. Chao, A.V. Vasilakos, Big data analytics: a survey,
[55] K.R. Macwan, S.J. Patel, K-degree anonymity model for social network J. Big Data 2 (1) (2015) 21.
data publishing, Adv. Electr. Comput. Eng. 17 (4) (2017) 117–124. [85] S. Vennila, J. Priyadarshini, Scalable privacy preservation in big data
[56] S. Matwin, Privacy-preserving data mining techniques: Survey and a survey, in: Big Data, Cloud and Computing Challenge, in: Procedia
challenges, Stud. Appl. Phil. Epistemol. Ration. Ethics 3 (2013) 209–221. Computer Science, vol. 50, 2015, pp. 369–373.
[57] N. Medforth, K. Wang, Privacy risk in graph stream publishing for social [86] I. Wagner, D. Eckhoff, Technical privacy metrics: a systematic survey, ACM
network data, in: 2011 IEEE 11th International Conference on Data Comput. Surv. 51 (3) (2018) 57.
Mining, IEEE, 2011, pp. 437–446. [87] Q. Wang, M. He, M. Du, S.S.M. Chow, R.W.F. Lai, Q. Zou, Searchable
[58] B.B. Mehta, U.P. Rao, Privacy preserving big data publishing: a scalable encryption over feature-rich data, IEEE Trans. Dependable Secure Comput.
k-anonymization approach using mapreduce, IET Softw. 11 (5, SI) (2017) 15 (3) (2018) 496–510.
271–276. [88] B. Wang, Y. Hou, M. Li, Practical and secure nearest neighbor search on
[59] N. Memon, G. Loukides, J. Shao, A parallel method for scalable anonymiza- encrypted large-scale data, in: Computer Communications, IEEE INFOCOM
tion of transaction data, in: 2015 14th International Symposium on 2016-the 35th Annual IEEE International Conference on, IEEE, 2016, pp.
Parallel and Distributed Computing (ISPDC), in: International Symposium 1–9.
on Parallel and Distributed Computing, 2015, pp. 235–241. [89] G. Wang, Q. Liu, F. Li, S. Yang, J. Wu, Outsourcing privacy-preserving social
[60] F. Meskine, S.N. Bahloul, Privacy preserving K-means clustering: A survey networks to a cloud, in: 2013 Proceedings IEEE INFOCOM, IEEE, 2013,
research, Int. Arab J. Inf. Technol. 9 (2) (2012) 194–200. pp. 2886–2894.
[61] M.M. Najafabadi, F. Villanustre, T.M. Khoshgoftaar, N. Seliya, R. Wald,
[90] X. Wang, Y. Mu, R. Chen, Privacy-preserving data search and sharing
E. Muharemagic, Deep learning applications and challenges in big data
protocol for social networks through wireless applications, Concurr.
analytics, J. Big Data 2 (1) (2015) 1.
Comput.: Pract. Exper. 29 (7) (2017) e3870.
[62] G. Navale, S. Mali, A Survey on Sensitive Association Rules Hiding
[91] A. Wang, C. Wang, M. Bi, J. Xu, A Review of Privacy-Preserving Machine
Methods, Institute of Electrical and Electronics Engineers Inc., 2018.
Learning Classification, in: Lecture Notes in Computer Science (including
[63] J.J.V. Nayahi, V. Kavitha, Privacy and utility preserving data clustering for
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
data anonymization and distribution on hadoop, Future Gener. Comput.
Bioinformatics), vol. 11066 LNCS, Springer Verlag, 2018, pp. 671–682.
Syst. Int. J. Esscience 74 (2017) 393–408.
[92] T. Wang, Z. Xu, D. Wang, H. Wang, Influence of data errors on differential
[64] M. Pathak, S. Rane, B. Raj, Multiparty differential privacy via aggrega-
privacy, Cluster Comput. (2017) 1–8.
tion of locally trained classifiers, in: Advances in Neural Information
[93] T. Wang, Z. Zheng, M.H. Rehmani, S. Yao, Z. Huo, Privacy preservation in
Processing Systems, 2010, pp. 1876–1884.
big data from the communication perspective: A survey, IEEE Commun.
[65] V. Pham, S. Yu, K. Sood, L. Cui, Privacy issues in social networks and
Surv. Tutor. (2018).
analysis: A comprehensive survey, IET Networks 7 (2) (2018) 74–84.
[94] M. Willemsen, Anonymizing unstructured data to prevent privacy leaks
[66] L.T. Phong, Y. Aono, T. Hayashi, L. Wang, S. Moriai, Privacy-preserving
during data mining, in: Proceedings of 25th Twenty Student Conference
deep learning via additively homomorphic encryption, IEEE Trans. Inf.
on IT, 2016.
Forensics Secur. 13 (5) (2018) 1333–1345.
[95] W. Wu, Y. Xiao, W. Wang, Z. He, Z. Wang, K-symmetry model for
[67] N.K.S. Prasaad, T.R. Pratheek, Providing anonymity using top down spe-
identity anonymization in social networks, in: Proceedings of the 13th
cialization on big data using hadoop framework, in: 2015 Annual IEEE
International Conference on Extending Database Technology, ACM, 2010,
India Conference (INDICON), in: Annual IEEE India Conference, 2015.
[68] A. Rajkumar, S. Agarwal, A differentially private stochastic gradient pp. 111–122.
descent algorithm for multiparty classification, in: Artificial Intelligence [96] L. Xu, C. Jiang, J. Wang, J. Yuan, Y. Ren, Information security in big data:
and Statistics, 2012, pp. 933–941. Privacy and data mining, IEEE Access 2 (2014) 1151–1178.
[69] V. Rastogi, M. Hay, G. Miklau, D. Suciu, Relationship privacy: output [97] L. Yang, Y. Fei, X. Zhang, Protecting link privacy for large correlated social
perturbation for queries with joins, in: Proceedings of the Twenty-Eighth networks, in: 2016 7th International Conference on Cloud Computing and
ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Big Data (CCBD), 2016, pp. 203–208.
Systems, ACM, 2009, pp. 107–116. [98] Y. Yang, J. Lutes, F. Li, B. Luo, P. Liu, Stalking online: on user privacy in
[70] H. Ren, H. Li, Y. Dai, K. Yang, X. Lin, Querying in internet of things with social networks, in: Proceedings of the Second ACM Conference on Data
privacy preserving: Challenges, solutions and opportunities, IEEE Network and Application Security and Privacy, ACM, 2012, pp. 37–48.
(2018). [99] J. Yang, B. Wang, X. Yang, H. Zhang, G. Xiang, A secure k-automorphism
[71] S. Sagiroglu, D. Sinanc, Big data: A review, in: 2013 International privacy preserving approach with high data utility in social networks,
Conference on Collaboration Technologies and Systems (CTS), 2013, Secur. Commun. Netw. 7 (9, SI) (2014) 1399–1411.
pp. 42–47. [100] A.C. Yao, Protocols for secure computations, in: Foundations of Com-
[72] A. Sakpere, A. Kayem, A State-of-the-art Review of Data Stream puter Science, 1982. SFCS’08. 23rd Annual Symposium on, IEEE, 1982,
Anonymization Schemes, IGI Global, 2014, pp. 24–50. pp. 160–164.
[73] P. Samarati, L. Sweeney, Protecting privacy when disclosing information: [101] H. Ye, X. Cheng, M. Yuan, L. Xu, J. Gao, C. Cheng, A survey of secu-
k-anonymity and its enforcement through generalization and suppression, rity and privacy in big data, in: 2016 16th International Symposium
Tech. rep., SRI International (1998). on Communications and Information Technologies (ISCIT), 2016, pp.
[74] A. Senosi, G. Sibiya, Classification and Evaluation of Privacy Preserving 268–272.
Data Mining: A Review, Institute of Electrical and Electronics Engineers [102] X. Ying, X. Wu, On link privacy in randomizing social networks, Knowl.
Inc., 2017, pp. 849–855. Inf. Syst. 28 (3, SI) (2011) 645–663.
[75] J.A. Shamsi, M.A. Khojaye, Understanding privacy violations in big data [103] S. Yu, Big privacy: Challenges and opportunities of privacy study in the
systems, IT Prof. 20 (3) (2018) 73–81. age of big data, IEEE Access 4 (2016) 2751–2763.
[76] S. Sharma, S. Ahuja, Privacy preserving data mining: A review of the state [104] H. Zakerzadeh, C.C. Aggarwal, K. Barker, Managing dimensionality in data
of the art, Adv. Intell. Syst. Comput. 741 (2018) 1–15. privacy anonymization, Knowl. Inf. Syst. 49 (1) (2016) 341–373.
[77] R. Shokri, V. Shmatikov, Privacy-preserving deep learning, in: Proceedings [105] D. Zhang, X. Chen, D. Wang, J. Shi, A survey on collaborative deep learning
of the 22nd ACM SIGSAC Conference on Computer and Communications and privacy-preserving, in: 2018 IEEE Third International Conference on
Security, Denver, CO, USA, October 12-16, 2015, 2015, pp. 1310–1321. Data Science in Cyberspace (DSC), IEEE, 2018, pp. 652–658.
[78] D. Singh, C.K. Reddy, A survey on platforms for big data analytics, J. Big [106] J. Zhang, X. Chen, Y. Xiang, W. Zhou, J. Wu, Robust network traffic
Data 2 (1) (2015) 8. classification, IEEE/ACM Trans. Netw. 23 (4) (2015) 1257–1270.
218 H.-Y. Tran and J. Hu / Journal of Parallel and Distributed Computing 134 (2019) 207–218
[107] X. Zhang, W. Dou, J. Pei, S. Nepal, C. Yang, C. Liu, J. Chen, Proximity- [120] L.J. Zhang, Y.H. Zhang, J.P. Zhang, J. Yang, L. Guo, An enhanced privacy-
aware local-recoding anonymization with mapreduce for scalable big preserving method based on K-automorphism mode, in: International
data privacy preservation in cloud, IEEE Trans. Comput. 64 (8) (2015) Conference on Advances in Management Engineering and Information
2293–2307. Technology (AMEIT 2015), 2015, pp. 224–232.
[108] L. Zhang, X.-Y. Li, K. Liu, T. Jung, Y. Liu, Message in a sealed bottle: Privacy [121] C. Zhao, S. Zhao, M. Zhao, Z. Chen, C.-Z. Gao, H. Li, Y.-A. Tan, Secure
preserving friending in mobile social networks, IEEE Trans. Mob. Comput. multi-party computation: Theory, practice and applications, Inform. Sci.
14 (9) (2015) 1888–1902. 476 (2019) 357–372.
[109] J. Zhang, H. Li, X. Liu, Y. Luo, F. Chen, H. Wang, L. Chang, On efficient [122] T. Zhu, G. Li, W. Zhou, P. Yu, Differentially private data publishing and
and robust anonymization for privacy protection on massive streaming analysis: A survey, IEEE Trans. Knowl. Data Eng. 29 (8) (2017) 1619–1638.
Categorical information, IEEE Trans. Dependable Secure Comput. 14 (5) [123] L. Zou, L. Chen, M.T. Özsu, K-automorphism: A general framework for
(2017) 507–520. privacy preserving network publication, Proc. VLDB Endowment 2 (1)
[110] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, J. Chen, Combining top- (2009) 946–957.
down and bottom-up: Scalable sub-tree anonymization over big data
using mapreduce on cloud, in: 2013 12th IEEE International Confer-
ence on Trust, Security and Privacy in Computing and Communications
(TRUSTCOM 2013), 2013, pp. 501–508. Ms. Hong-Ten Tran is a Ph.D. student at the Uni-
[111] J. Zhang, J. Sun, R. Zhang, Y. Zhang, X. Hu, Privacy-preserving social media versity of New South Wales Canberra, Australia. Her
data outsourcing, in: IEEE INFOCOM 2018-IEEE Conference on Computer research interest is in the field of privacy-preserving
Communications, IEEE, 2018, pp. 1106–1114. data analytics and cyber security.
[112] W. Zhang, X.-R. Wang, J. Wang, Y.-F. Chen, Privacy preservation in
dynamic social networks based on k-neighborhood isomorphism, in:
Nanjing Youdian Daxue Xuebao (Ziran Kexue Ban)/Journal of Nanjing
University of Posts and Telecommunications (Natural Science), Vol. 34
(5), 2014, pp. 9–16.
[113] J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, Y. Guan, Network traffic
classification using correlation information, IEEE Trans. Parallel Distrib.
Syst. 24 (1) (2012) 104–117.
[114] R. Zhang, R. Xue, L. Liu, Searchable encryption for healthcare clouds: A
Prof. Jiankun Hu received the Ph.D. degree in control
survey, IEEE Trans. Serv. Comput. 11 (6) (2018) 978–996, http://dx.doi.
engineering from the Harbin Institute of Technology,
org/10.1109/TSC.2017.2762296.
China, in 1993, and the master’s degree in computer
[115] Q. Zhang, L.T. Yang, Z. Chen, Privacy preserving deep computation model
science and software engineering from Monash Univer-
on cloud for big data feature learning, in: International Conference on
sity, Australia, in 2000. He was a Research Fellow with
Machine Learning, IEEE Trans. Comput. 65 (5) (2016) 1351–1362.
Delft University, The Netherlands, from 1997 to 1998,
[116] Q. Zhang, L.T. Yang, Z. Chen, P. Li, A survey on deep learning for big data,
and The University of Melbourne, Australia, from 1998
Inf. Fusion 42 (2018) 146–157.
to 1999. His main research interest is in the field of
[117] Q. Zhang, L.T. Yang, Z. Chen, P. Li, M.J. Deen, Privacy-preserving double-
cyber security, including biometrics security, where he
projection deep computation model with crowdsourcing on cloud for big
has published many papers in high-quality conferences
data feature learning, IEEE Internet Things J. 5 (4, SI) (2018) 2896–2903.
and journals including the IEEE TRANSACTIONS ON
[118] X. Zhang, L.T. Yang, C. Liu, J. Chen, A scalable two-phase top-down
PATTERN ANALYSIS AND MACHINE INTELLIGENCE. He has served on the editorial
specialization approach for data anonymization using mapreduce on
boards of up to seven international journals including IEEE transactions on
cloud, IEEE Trans. Parallel Distrib. Syst. 25 (2) (2014) 363–373.
Information Forensics and Security, and served as a Security Symposium Chair
[119] X. Zhang, C. Yang, S. Nepal, C. Liu, W. Dou, J. Chen, A mapreduce based
of the IEEE Flagship Conferences of IEEE ICC and IEEE GLOBECOM. He has
approach of scalable multidimensional anonymization for big data privacy
obtained eight Australian Research Council (ARC) Grants. He has served on the
preservation on cloud, in: 2013 IEEE Third International Conference on
prestigious Panel of Mathematics, Information and Computing Sciences, ARC ERA
Cloud and Green Comptuting (CGC 2013), 2013, pp. 105–112.
(The Excellence in Research for Australia) Evaluation Committee.