0% found this document useful (0 votes)
6 views15 pages

PPFL Privacy Preserving FL With TEE

The document presents PPFL, a Privacy-preserving Federated Learning framework that utilizes Trusted Execution Environments (TEEs) to enhance privacy in federated learning by preventing data leakage during model training. It implements a greedy layer-wise training approach to accommodate TEE memory limitations while maintaining model utility and significantly reducing privacy risks from various attacks. The framework shows promising results with minimal overhead in CPU, memory, and energy consumption, making it a practical solution for mobile systems.

Uploaded by

ibrahem.tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

PPFL Privacy Preserving FL With TEE

The document presents PPFL, a Privacy-preserving Federated Learning framework that utilizes Trusted Execution Environments (TEEs) to enhance privacy in federated learning by preventing data leakage during model training. It implements a greedy layer-wise training approach to accommodate TEE memory limitations while maintaining model utility and significantly reducing privacy risks from various attacks. The framework shows promising results with minimal overhead in CPU, memory, and energy consumption, making it a practical solution for mobile systems.

Uploaded by

ibrahem.tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

PPFL: Privacy-preserving Federated Learning with Trusted

Execution Environments
Fan Mo∗ Hamed Haddadi Kleomenis Katevas
Imperial College London Imperial College London Telefónica Research

Eduard Marin Diego Perino Nicolas Kourtellis


Telefónica Research Telefónica Research Telefónica Research
ABSTRACT 1 INTRODUCTION
We propose and implement a Privacy-preserving Federated Learn- Training deep neural networks (DNNs) on multiple devices locally
ing (𝑃𝑃𝐹 𝐿) framework for mobile systems to limit privacy leak- and building an aggregated global model on a server, namely feder-
ages in federated learning. Leveraging the widespread presence of ated learning (FL), has drawn significant attention from academia
Trusted Execution Environments (TEEs) in high-end and mobile (e.g., [17, 27, 42]) and industry, and is even being deployed in real
devices, we utilize TEEs on clients for local training, and on servers systems (e.g., Google Keyboard [7]). Unlike traditional machine
for secure aggregation, so that model/gradient updates are hidden learning (ML), where a server collects all user data at a central
from adversaries. Challenged by the limited memory size of current point and trains a global model, in FL, users only send the locally
TEEs, we leverage greedy layer-wise training to train each model’s updated model parameters to the server. This allows training a
layer inside the trusted area until its convergence. The performance model without the need for users to reveal their data, thus preserv-
evaluation of our implementation shows that 𝑃𝑃𝐹 𝐿 can signifi- ing their privacy. Unfortunately, recent works have shown that
cantly improve privacy while incurring small system overheads adversaries can execute attacks to retrieve sensitive information
at the client-side. In particular, 𝑃𝑃𝐹 𝐿 can successfully defend the from the model parameters themselves [16, 20, 45, 78]. Prominent
trained model against data reconstruction, property inference, and examples of such attacks are data reconstruction [16, 20] and var-
membership inference attacks. Furthermore, it can achieve com- ious types of inference attacks [20, 45]. The fundamental reason
parable model utility with fewer communication rounds (0.54×) why these attacks are possible is because as a DNN learns to achieve
and a similar amount of network traffic (1.002×) compared to the their main task, it also learns irrelevant information from users’
standard federated learning of a complete model. This is achieved training data that is inadvertently embedded in the model [73].
while only introducing up to ∼15% CPU time, ∼18% memory usage, Note that in FL scenarios, such attacks can be launched both at
and ∼21% energy consumption overhead in 𝑃𝑃𝐹 𝐿’s client-side. server and client sides.
Motivated by these attacks, researchers have recently introduced
CCS CONCEPTS several countermeasures to prevent them. Existing solutions can
be grouped into three main categories depending on whether they
· Security and privacy → Privacy protections; Distributed sys-
rely on: (i) homomorphic encryption (e.g., [2, 42]), (ii) multi-party
tems security; · Computing methodologies → Distributed algo-
computation (e.g., [8]), or (iii) differential privacy (e.g., [14, 17, 44]).
rithms.
While homomorphic encryption is practical in both high-end and
mobile devices, it only supports a limited number of arithmetic
ACM Reference Format:
Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, operations in the encrypted domain. Alternatively, the use of fully
and Nicolas Kourtellis. 2021. PPFL: Privacy-preserving Federated Learning homomorphic encryption has been employed to allow arbitrary op-
with Trusted Execution Environments. In The 19th Annual International erations in the encrypted domain, thus supports ML. Yet, this comes
Conference on Mobile Systems, Applications, and Services (MobiSys ’21), June with too much computational overhead, making it impractical for
24-July 2, 2021, Virtual, WI, USA. ACM, New York, NY, USA, 15 pages. mobile devices [51, 63]. Similarly, multi-party computation-based
https://doi.org/10.1145/3458864.3466628 solutions incur significant computational overhead. Also, in some
cases, differential privacy can fail to provide sufficient privacy as
shown in [45]. Furthermore, it can negatively impact the utility
∗ Work performed while at Telefónica Research. and fairness of the model [3, 25], as well as the system perfor-
mance [66, 68]. Overall, none of the existing solutions meets all
requirements, hampering their adoption.
Permission to make digital or hard copies of all or part of this work for personal or More recently, the use of hardware-based Trusted Execution
classroom use is granted without fee provided that copies are not made or distributed Environments (TEEs) has been proposed as a promising way to
for profit or commercial advantage and that copies bear this notice and the full citation preclude attacks against DNN model parameters and gradients.
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or TEEs allow to securely store data and execute arbitrary code on an
republish, to post on servers or to redistribute to lists, requires prior specific permission untrusted device almost at native speed through secure memory
and/or a fee. Request permissions from permissions@acm.org.
compartments. All these advantages ś together with the recent
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. commoditization of TEEs both in high-end and mobile devices ś
ACM ISBN 978-1-4503-8443-8/21/07. . . $15.00 make TEEs a suitable candidate to allow fully privacy-preserving
https://doi.org/10.1145/3458864.3466628

94
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

ML modeling. However, in order to keep the Trusted Computing only the code running in the secure world needs to be trusted.
Base (TCB) as small as possible, current TEEs have limited memory. Another key aspect of TEEs is that they allow arbitrary code to run
This makes it impossible to simultaneously place all DNN layers inside almost at native speed. In order to keep the TCB as small as
inside the TEE. As a result, prior work has opted for using TEEs possible, current TEEs have limited memory; beyond this, TEEs are
to conceal only the most sensitive DNN layers from adversaries, required to swap pages between secure and unprotected memory,
leaving other layers unprotected [18, 49]. While this approach was which incurs a significant overhead and hence must be prevented.
sufficient to mitigate some attacks against traditional ML where Over the last few years, significant research and industry efforts
clients obtain only the final model, in FL scenarios the attack surface have been devoted to developing secure and programmable TEEs
is significantly larger. FL client devices are able to observe distinct for high-end devices (e.g., servers1 ) and mobile devices (e.g., smart-
snapshots of the model throughout the training, allowing them to phones). In our work, we leverage Intel Software Guard Extensions
realize attacks at different stages [20, 45]. Therefore, it is of utmost (Intel SGX) [13] at the server-side, while in the client devices we rely
importance to protect all DNN layers using the TEE. on Open Portable Trusted Execution Environment (OP-TEE) [40].
In this paper, we propose Privacy-preserving Federated Learn- OP-TEE is a widely known open-source TEE framework that is sup-
ing (PPFL), the first practical framework to fully prevent private ported by different boards equipped with Arm TrustZone. While
information leakage at both server and client-side under FL scenar- some TEEs allow the creation of fixed-sized secure memory regions
ios. PPFL is based on greedy layer-wise training and aggregation, (e.g., of 128MB in Intel SGX), some others (e.g., ARM TrustZone) do
overcoming the constraints posed by the limited TEE memory, and not place any limit on the TEE size. However, creating large TEEs
providing comparable accuracy of complete model training at the is considered to be bad practice since it has proven to significantly
price of a tolerable delay. Our layer-wise approach supports sophis- increase the attack surface. Therefore, the TEE size must always
ticated settings such as training one or more layers (block) each be kept as small as possible independently of the type of TEEs and
time, which can potentially better deal with heterogeneous data at devices being used. This principle has already been adopted by
the client-side and speed up the training process. industry, e.g., in the HiKey 960 board the TEE size is only 16MiB.
To show its feasibility, we implemented and evaluated a full
prototype of PPFL system including server-side (with Intel SGX), 2.2 Privacy Risks in FL
client-side (with Arm TrustZone) elements of the design, and the Below we give a brief overview of the three main categories of
secure communication between them. Our experimental evaluation privacy-related attacks in FL: data reconstruction, property infer-
shows that PPFL provides full protection against data reconstruc- ence, and membership inference attacks.
tion, property inference, and membership inference attacks, whose
outcomes are degraded to random guessing (e.g., white noise im- Data Reconstruction Attack (DRA). The DRA aims at recon-
ages or 50% precision scores). PPFL is practical as it does not add structing original input data based on the observed model or its gra-
significant overhead to the training process. Compared to regular dients. It works by inverting model gradients based on generative
end-to-end FL, PPFL introduces a 3× or higher delay for completing adversarial attack-similar techniques [2, 16, 78], and consequently
the training of all DNN layers. However, PPFL achieves comparable reconstructing the corresponding original data used to produce the
ML performance when training only the first few layers, meaning gradients. DRAs are effective when attacking DNN’s early layers,
that it is not needed to train all DNN layers. Due to this flexibility and when gradients have been only updated on a small batch of
of layer-wise training, PPFL can provide a similar ML model util- data (i.e., less than 8) [16, 49, 78]. As the server typically observes
ity as end-to-end FL, with fewer communication rounds (0.54×), updated models of each client in plaintext, it is more likely for this
and a similar amount of network traffic (1.002×), with only ∼15% type of leakages to exist at the server. By subtracting updated mod-
CPU time, ∼18% memory usage, and ∼21% energy consumption els with the global model, the server obtains gradients computed
overhead at client-side. w.r.t. clients’ data during the local training.
Property Inference Attack (PIA). The goal of PIAs is to infer
2 BACKGROUND AND RELATED WORK the value of private properties in the input data. This attack is
achieved by building a binary classifier trained on model gradients
In this section, we provide the background needed to understand
updated with auxiliary data and can be conducted on both server
the way TEEs work (Sec. 2.1), existing privacy risks in FL (Sec. 2.2),
and client sides [45]. Specifically, property information, which also
privacy-preserving ML techniques using TEEs (Sec. 2.3), as well as
refers to the feature/latent information of the input data, is easier
core ideas behind layer-wise DNN training for FL (Sec. 2.4).
to be carried in stronger aggregation [47]. Even though clients in
FL only observe multiple snapshots of broadcast global models that
2.1 Trusted Execution Environments (TEE) have been linearly aggregated on participating clients’ updates,
A TEE enables the creation of a secure area on the main processor property information can still be well preserved, providing attack
that provides strong confidentiality and integrity guarantees to any points to client-side adversaries.
data and code it stores or processes. TEEs realize strong isolation Membership Inference Attack (MIA). The purpose of MIAs is
and attestation of secure compartments by enforcing a dual-world to learn whether specific data instances are present in the train-
view where even compromised or malicious system (i.e., privileged) ing dataset. One can follow a similar attack mechanism as PIAs
software in the normal world ś also known as the Rich Operating
System Execution Environment (REE) ś cannot gain access to the 1 Recently, cloud providers also offer TEE-enabled infrastructure-as-a-service solu-
secure world. This allows for a drastic reduction of the TCB since tions to their customers (e.g., Microsoft Azure Confidential).

95
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

to build a binary classifier when conducting MIAs [52], although We consider two types of (passive) adversaries: (i) users of client
there are other methods, e.g., using shadow models [64]. The risk devices who have access to distinct snapshots of the global model
of MIAs can exist on both the server and client sides. Moreover, and (ii) the server’s owner (e.g., a cloud or edge provider) who has
because membership is ‘high-level’ latent information, adversaries access to the updated model gradients. Adversaries are assumed
can perform MIAs on the final (well-trained) model and its last to be honest-but-curious, meaning that they allow FL algorithms
layer [52, 64, 73]. to run as intended while trying to infer as much information as
possible from the global model or gradients. Adversaries can have
2.3 Privacy-preserving ML using TEEs full control (i.e., root privileges) of the server or the client device,
and can perform their attacks against any DNN layer. However,
Running ML inside TEEs can hide model parameters from REE
attacks against the TEE, such as side-channel attacks (e.g., Volt-
adversaries and consequently preserve privacy, as already used
pillager [12]), physical attacks (e.g., Platypus [41]) and those that
for light data analytics on servers [54, 62] and for heavy computa-
exploit weaknesses in TEEs (e.g., [38]) and their SDKs (e.g., [71])
tions such as DNN training [18, 23, 49, 70]. However, due to TEEs’
are out of scope for this paper.
limited memory size, previous studies run only part of the model
(e.g., sensitive layers) inside the TEE [18, 48, 49, 70]. In the on-device Assumptions. We assume that the server and enough participating
training case, DarkneTZ [49] runs the last layers with a Trusted FL client devices have a TEE whose memory size is larger than the
Application inside TEEs to defend against MIAs, and leaves the first largest layer of the DNN to be trained. This is the case in current
layers unprotected. DarkneTZ’s evaluation showed no more than FL DNNs. However, in the unlikely case that a layer does not fit
10% overhead in CPU, memory, and energy on edge-like devices, in available TEEs, the network design needs to be adjusted with
demonstrating its suitability for client-side model updates in FL. In smaller, but more layer(s), or a smaller training batch size. We also
an orthogonal direction, several works leveraged clients’ TEEs for assume that there is a secure way to bootstrap trust between the
verifying the integrity of local model training [10, 75], but did not server TEE and each of the client device TEE (e.g., using a slightly
consider privacy. Considering a broader range of attacks (e.g., DRAs modified version of the SIGMA key exchange protocol [32, 77], or
and PIAs), it is essential to protect all layers instead of the last layers attested TLS [30]), and that key management mechanisms exist to
only, something that PPFL does. update and revoke keys when needed [55]. Finally, we assume that
the centralized server will forward data to/from its TEE. Yet, it is
2.4 Layer-wise DNN Training for FL important to note that if the server was malicious and would not
do this, this would only affect the availability of the system (i.e., the
Instead of training the complete DNN model in an end-to-end fash-
security and privacy properties of our solution remain intact). This
ion, one can train the model layer-by-layer from scratch, i.e., greedy
type of Denial-of-Service (DoS) attack is hard to defend against and
layer-wise training [6, 35]. This method starts by training a shal-
is not considered within the standard TEE threat model.
low model (e.g., one layer) until its convergence. Next, it appends
one more layer to the converged model and trains only this new
layer [5]. Usually, for each greedily added layer, the model developer 4 PPFL FRAMEWORK
builds a new classifier on top of it in order to output predictions In this section, we first present an overview of the proposed system
and compute training loss. Consequently, these classifiers provide and its functionalities (Sec. 4.1), and then detail how the framework
multiple early exits, one per layer, during the forward pass in infer- employs layer-wise training and aggregation in conjunction to
ence [29]. Furthermore, recently this method was shown to scale TEEs in FL (Sec. 4.2).
for large datasets such as ImageNet and to achieve performance
comparable to regular end-to-end ML [5]. Notably, all previous
studies on layer-wise training focused on generic ML.
4.1 System Overview
We propose a Privacy-preserving Federated Learning framework
Contribution. Our work is the first to build a DNN model in a FL
which allows clients to collaboratively train a DNN model while
setting with privacy-preserving guarantees using TEEs, by lever-
keeping the model’s layers always inside TEEs during training.
aging the greedy layer-wise training, and to train each DNN layer
Figure 1 provides an overview of the framework and the various
inside each FL client’s TEE. Thus, PPFL satisfies the constraint of
steps of the greedy layer-wise training and aggregation. In general,
TEE’s limited memory while protecting the model from the afore-
starting from the first layer, each layer is trained until convergence,
mentioned privacy attacks. Interestingly, the classifiers built atop
before moving to the next layer. In this way, PPFL aims to achieve
each layer may also provide personalization opportunities for the
full privacy preservation without significantly increasing system
participating FL clients.
cost. PPFL’s design provides the following functionalities:

3 THREAT MODEL AND ASSUMPTIONS Privacy-by-design Guarantee. PPFL ensures that layers are al-
ways protected from adversaries while they are being updated.
Threat model. We consider a standard FL context where multiple Privacy risks depend on the aggregation level and frequency with
client devices train a DNN locally and send their (local) model pa- which they happen, when exposing the model or its layers [16,
rameters to a remote, centralized server, which aggregates these pa- 45, 47]. In PPFL, lower-level information (i.e., original data and at-
rameters to create a global model [7, 27, 43]. The goal of adversaries tributes) is not exposed because updated gradients during training
is to obtain sensitive information embedded in the global model are not accessible from adversaries (they happen inside the TEEs).
through data reconstruction [16, 78] or inference attacks [45, 52]. This protects against DRAs and PIAs. However, when one of such

96
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

Configuration
Server  Clients Move to next block of layers
after convergence - - 
- - TEE - -- - - 
TEE - 
Public
Know- Reporting
ledge Private  - - - 
Class- - 
Dataset
ifier - 

  Class- Class- Class- Class-


Data transmission
ifier ifier ifier ifier
  Public layers Forward pass
Private layers Backward pass

Figure 1: A schematic diagram of the PPFL framework. The main phases follow the system design in [7].

layers is exposed after convergence, there is a risk of MIAs. We In PPFL, the server can learn from public models. Thus, during
follow a more practical approach based on the observation that initialization, the server first chooses a model pre-trained on public
membership-related information is only sensitive in the last DNN data that have a similar distribution with private data. The server
layer, making it vulnerable to MIAs as indicated in previous re- keeps the first layers, removes the last layer(s), and assembles new
search [47, 49, 52, 59]. To avoid this risk on the final model, PPFL layer(s) atop the reserved first ones. These first layers are transferred
can keep the last layer inside the clients TEEs after training. to clients and are always kept frozen (step ○, 1 Fig. 1). New layers,
Device Selection. After the server and a set of TEE-enabled clients attached to the reserved layers, are trained inside each client’s TEE,
agree on the training of a DNN model via FL, clients inform the and then aggregated inside the server’s TEE (steps ○∼ 2 ○, 6 Fig. 1).
server about their TEE’s memory constraints. The server then In learning from public datasets, the server first performs an initial
(re)constructs a DNN model suitable for this set of clients and se- training to build the model based on public datasets.
lects the clients that can accommodate the model layers within their Local Training. After model transmission and configuration using
TEE. In each round, the server can select new clients and the device secure channels, each client starts local training on their data on
selection algorithm can follow existing FL approaches [21, 53]. each layer via a model partitioned execution technique (step ○, 4
Secure Communication Channels. The server establishes two Fig. 1). We detail this step in Sec. 4.2.
secure communication channels with each of its clients: (i) one Reporting and Aggregation. Once local training of a layer is
from its REE to the client’s REE (e.g., using TLS) to exchange data completed inside TEEs, all participating clients report the layer
with clients and (ii) a logical one from its TEE to the client’s TEE for parameters to the server through secure channels (step ○,
5 Fig. 1).
securely exchanging private information (e.g., model layer training Finally, the server securely aggregates the received parameters
information). In the latter case, the transmitted data is encrypted within its TEE and applies FedAvg [43], resulting in a new global
using cryptographic keys known only to the server and client TEEs model layer (step ○,
6 Fig. 1).
and is sent over the REE-REE channel. It is important to note that
the secure REE-REE channel is only an additional security layer.
All privacy guarantees offered by PPFL are based on the hardware- 4.2 Layer-wise Training and Aggregation
backed cryptographic keys stored inside TEEs. In order to address the problem of limited memory inside a TEE
Model Initialization and Configuration. The server configures when training a DNN model, we modify the greedy layer-wise
the model architecture, decides the layers to be protected by TEEs, learning technique proposed in [6] for general DNN training [5],
and then initializes model parameters inside the TEE (step ○, 2 Fig. 1). to work in the FL setting. The procedure of layer-wise training and
The latter ensures clients’ local training starts with the same weight aggregation is detailed in the following Algorithms 1 and 2.
distribution [43, 72]. In addition, the server configures other train- Algorithm 1. This algorithm details the actions taken by PPFL on
ing hyper-parameters such as learning rate, batch size, and epochs, the server side. When not specified, operations are carried out out-
before transmitting such settings to the clients (step ○, 3 Fig. 1). side the TEE (i.e., in the REE). First, the server initializes the global
In cases of typical ML tasks such as image recognition where DNN model with random weights or public knowledge (steps ○- 1
public knowledge is available such as pre-trained DNN models 2 Fig. 1). Thus, each layer 𝑙 to be trained is initialized (𝜽𝑙 ) and
○,
or public datasets with features similar to the client private data, prepared for broadcast. The server checks all available devices and
the server can transfer this knowledge (especially in cross-device constructs a set of participating clients whose TEE is larger than the
FL [27]) in order to bootstrap and speed up the training process. In required memory usage of 𝑙. Then, it broadcasts the model’s layer
both cases, this knowledge is contained in the first layers. Thus, the to these participating clients (step ○,3 Fig. 1), via ClientUpdate()
clients leave the first layers frozen and only train the last several (see Algorithm 2). Upon receiving updates from all participating
layers of the global model. This training process is similar to the clients, the server decrypts the layer weights, performs secure layer
concept of transfer learning [9, 56, 69], where, in our case, public aggregation and averaging inside its TEE (step ○), 6 and broadcasts
knowledge is transferred in a federated manner. the new version of 𝑙 to the clients for the next FL round. Steps ○∼2 ○ 6
are repeated until the training of 𝑙 converges, or a fixed number of

97
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

Algorithm 1: PPFL-Server with TEE Algorithm 2: ClientUpdate(𝑙, 𝜽𝑙 ) with TEEs


Input: Initialization:
• Number of all clients: 𝑁 • Local dataset X: data {𝒙} and labels {𝒚}
• TEE memory size of Client 𝑛: 𝑆 (𝑛) • Trained final parameters of all previous layers,
• Memory usage of layers {1, ..., 𝐿} in training (forward and i.e., 𝜽 10, 𝜽 20, ..., 𝜽𝑙−1
0
backward pass in total): {𝑆 1, ..., 𝑆𝐿 } • Number of local training epochs: 𝐸
• Communication rounds: 𝑅 • Activation function: 𝜎 () and loss function: ℓ
Output: Aggregated final parameters: {𝜽 10, ..., 𝜽𝐿0 } • Classifier: 𝐶 ()
% Layer-wise client updates Input:
for 𝑙 ∈ {1, ..., 𝐿} do • Target layer: 𝑙
• Broadcast parameters of layer 𝑙: 𝜽𝑙
% Select clients with enough TEE memory
Initialize participating client list J = {} Output: Updated parameters of layer 𝑙: 𝜽𝑙
for 𝑛 ∈ {1, ..., 𝑁 } do % Weights and biases of layers 1, ..., (𝑙 − 1) and 𝑙
if 𝑆 (𝑛) > 𝑆𝑙 then for 𝑖 ∈ {1, ..., 𝑙 − 1} do
J←− J ∪ {𝑛} {𝑾𝑖 , 𝒃𝑖 } ←− 𝜽𝑖0
Initialize 𝜽𝑙 (parameters of layers 𝑙) in TEE − 𝜽𝑙 in TEE
{𝑾𝑙 , 𝒃𝑙 } ←
for 𝑟 ∈ {1, ..., 𝑅} do % Training process
for 𝑗 ∈ J do for 𝑒 ∈ {1, .., 𝐸} do
% clients’ local updating: see Algorithm 2 for {𝒙, 𝒚} ∈ X do
( 𝑗) % Forward pass
𝜽𝑙 =ClientUpdate(𝑙, 𝜽𝑙 )
Intermediate representation 𝑻0 = 𝒙
% FedAvg with Secure Aggregation
( 𝑗)
for 𝑖 ∈ {1, ..., 𝑙 − 1} do
1
𝜽𝑙 = size(J)
Í
in TEE
𝑗 ∈J 𝜽𝑙 𝑻𝑖 = 𝜎 (𝑾𝑖 𝑻𝑖−1 + 𝒃𝑖 )
Save 𝜽𝑙 from TEE as 𝜽𝑙0 in REE 𝑻𝑙 = 𝜎 (𝑾𝑙 𝑻𝑙−1 + 𝒃𝑙 )
ℓ← − ℓ (𝐶 (𝑻𝑙 ), 𝒚)
return {𝜽 10, ..., 𝜽𝐿0 }
% Backward pass
𝜕ℓ in TEE
𝜕𝐶 to update parameters of 𝐶
% Updating layer l
𝑾𝑙 ← − 𝑾𝑙 + 𝜕𝑾𝜕ℓ ; 𝒃 ← 𝜕ℓ
rounds are completed. Then, this layer is considered fully trained 𝑙 𝑙 − 𝒃𝑙 + 𝜕𝒃𝑙
(𝜽𝑙0 ), it is passed to the REE, and is broadcasted to all clients to 𝜽𝑙 = {𝑾𝑙 , 𝒃𝑙 } in TEE
be used for training the next layer. Interestingly, PPFL also allows
return 𝜽𝑙
grouping multiple layers into blocks and training each block inside
client TEEs in a similar fashion as the individual layers. This option
allows for better utilization of the memory space available inside
each TEE and reduces communication rounds for the convergence
of more than one layer at the same time. Model Partitioned Execution. The above learning process is
Algorithm 2. This algorithm details the actions taken by PPFL on based on a technique that conducts model training (including both
the client side. Clients load the received model parameters from the forward and backward passes) across REEs and TEEs, namely model
server and decrypt and load the target training layer 𝑙 inside their partitioned execution. The transmission of the forward activations
TEEs. More specifically, in the front, this new layer 𝑙 connects to the (i.e., intermediate representation) and updated parameters happens
previous pre-trained layer(s) that are frozen during training. In the between the REE and the TEE via shared memory. On a high level,
back, the clients attach on 𝑙 their own derived classifier, which con- when a set of layers is in the TEE, activations are transferred from
sists of fully connected layers and a softmax layer as the model exit. the REE to the TEE (see Algorithm 2). Assuming global layer 𝑙 is
Then, for each epoch, the training process iteratively goes through under training, the layer with its classifier 𝐶 (.) are executed in the
batches of data and performs both forward and backward passes [36] TEE, and the previous layers (i.e., 1 to 𝑙 − 1) are in the REE.
to update both the layer under training and the classifier inside Before training, layer 𝑙’s parameters are loaded and decrypted
the TEE (step ○, 4 Fig. 1). During this process, a model partitioned securely within the TEE. During the forward pass, local data 𝒙 are
execution technique is utilized, where intermediate representations inputted, and the REE processes the previous layers from 1 to 𝑙 − 1
of the previously trained layers are passed from the REE to the and invokes a command to transfer the layer 𝑙 − 1’s activations
TEE via shared memory in the forward pass. After local training is (i.e., 𝑇𝑙−1 ) to the secure memory through a buffer in shared memory.
completed (i.e., all batches and epochs are done), each client sends The TEE switches to the corresponding invoked command in order
via the secure channel the (encrypted) layer’s weights from its TEE to receive layer 𝑙 − 1’s activations and processes the forward pass
to the server’s TEE (step ○).5 of layer 𝑙 and classifier 𝐶 (.) in the TEE.

98
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

During the backward pass, the TEE computes the 𝐶 (.)’s gradients Table 1: DNNs used in the evaluation of PPFL.
based on received labels 𝒚 and outputs of 𝐶 (.) (produced in the
forward pass) and uses them to compute the gradients of the layer DNN Architecture
𝑙 in the TEE. The training of this batch of data (i.e., 𝒙) finishes LeNet [37, 43] C20-MP-C50-MP-FC500-FC10
here, and there is no need to transfer 𝑙’s errors from the TEE to the AlexNet [5, 34] C128×3-AP16-FC10
REE via shared memory, as previous layers are frozen outside the VGG9 [65, 72] C32-C64-MP-C128×2-MP-D0.05-C256×2
-MP-D0.1-FC512×2-FC10
TEE. After that, the parameters of layer 𝑙 are encrypted and passed
VGG16 [65] C64×2-MP-C128×2-MP-C256×3-C512×3
to the REE, ready to be uploaded to the server, corresponding to
-MP-FC4096×2-FC1000-FC10
the 𝐹𝑒𝑑𝑆𝐺𝐷 [11]. Further, 𝐹𝑒𝑑𝐴𝑣𝑔 [43] which requires multiple MobileNetv2 [61] 68 layers, unmodified refer to [61] for details
batches to be processed before updating, repeats the same number
Architecture notation: Convolution layer (C) with a given number of filters; filter size
of forward and backward passes across the REE and the TEE for is 5 × 5 in LeNet and 3 × 3 in AlexNet, VGG9, and VGG16. Fully Connected (FC) with
each batch of data. a given number of neurons. All C and FC layers are followed by ReLU activation
functions. MaxPooling (MP). AveragePooling (AP) with a given stride size. Dropout
Algorithmic Complexity Analysis. Next, we analyze the algo- layer (D) with a given dropping rate.
rithmic complexity of PPFL and compare it to standard end-to-end
FL. For the global model’s layers 𝑙 ∈ {1, . . . , 𝐿}, we denote the for- based on Microsoft OpenEnclave [46] with Intel SGX. For this, an
ward and backward pass cost on layer 𝑙 as F𝑙 and B𝑙 , respectively. Intel Next Unit of Computing (ver.NUC8BEK, i3-8109U CPU, 8GB
The corresponding cost on the classifier is denoted as F𝑐 and B𝑐 . DDR4-2400MHz) was used with SGX-enabled capabilities.
Then, in end-to-end FL, the total training cost for one client is: Besides, we developed a set of bash shell scripts to control the FL
𝐿 process and create the communication channels. For the communi-
!
Õ
(F𝑙 + B𝑙 ) + F𝑐 + B𝑐 · 𝑆 · 𝐸 (1) cation channels between server and client to be secure, we employ
𝑙=1 standard cryptographic-based network protocols such as SSH and
where 𝑆 is the number of steps in one epoch (i.e., number of samples SCP. All data leaving the TEE are encrypted using the Advanced
inside local datasets divided by the batch size). As in PPFL all layers Encryption Standard (AES) in Cipher Block Chaining (CBC) mode
before the training layer 𝑙 are kept frozen, the cost of training layer with random Initialization Values (IV) and 128-bit cryptographic
𝑙 is ( 𝑙𝑘=1 F𝑘 + F𝑐 + B𝑙 + B𝑐 ) · 𝑆 · 𝐸. Then, by summation, we get the
Í keys. Without loss of generality, we opted for manually hardcod-
total cost of all layers as: ing the cryptographic keys inside the TEEs ourselves. Despite key
management in TEE-to-TEE channels being an interesting research
𝐿 Õ 𝑙 𝐿
!
Õ Õ problem, we argue that establishing, updating, and revoking keys
F𝑘 + B𝑙 + 𝐿 · (F𝑐 + B𝑐 ) · 𝑆 · 𝐸 (2)
do not happen frequently and hence the overhead these tasks in-
𝑙=1 𝑘=1 𝑙=1
troduce is negligible compared to one from the DNN training.
By comparing Equations 1 and 2, we see the overhead of PPFL comes The implementation of PPFL server and client is available for
from: (i) repeated forward pass in previous layers (𝑙 ∈ {1, . . . , 𝑙 − 1}) replication and extension: https://github.com/mofanv/PPFL.
when training layer 𝑙, and (ii) repeated forward and backward pass
for the classifier atop layer 𝑙. 5.2 Models and Datasets
We focus on Convolutional Neural Networks (CNNs) since the pri-
5 IMPLEMENTATION & EVALUATION SETUP vacy risks we consider (Sec. 3 and 4.1) have been extensively studied
In this section, we first describe the implementation of the PPFL on such DNNs [45, 52]. Also, layer-based learning methods mostly
system (Sec. 5.1), and then detail how we assess its performance on aim at CNN-like DNNs [5]. Specifically, in our PPFL evaluation, we
various DNN models and datasets (Sec. 5.2) using different metrics employ DNNs commonly used in the relevant literature (Table 1).
(Sec. 5.3). We follow common setups of past FL systems [43, 72] and For our experimental analysis, we used 𝑀𝑁 𝐼𝑆𝑇 and 𝐶𝐼 𝐹𝐴𝑅10,
on-device TEE works [1, 49]. two datasets commonly employed by FL researchers. Note that
in practice, FL training needs labeled data locally stored at the
5.1 PPFL Prototype clients’ side. Indeed, the number of labeled examples expected to be
We implement the client-side of PPFL by building on top of Dark- present in a real setting could be fewer than what these datasets may
neTZ [49], in order to support on-device FL with Arm TrustZone. allocate per FL client. Nonetheless, using them allows comparison
In total, we changed 4075 lines of code of DarkneTZ in C. We run of our results with state-of-art end-to-end FL methods [16, 39, 72].
the client-side on a HiKey 960 Board, which has four ARM Cortex- Specifically, LeNet is tested on MNIST [37] and all other models
A73 and four ARM Cortex-A53 cores configured at 2362MHz and are tested on CIFAR10 [33]. The former is a handwritten digit image
533MHz, respectively, as well as a 4GB LPDDR4 RAM with 16MiB (28×28) dataset consisting of 60𝑘 training samples and 10𝑘 test
TEE secure memory (i.e., TrustZone). Since the CPU power/fre- samples with 10 classes. The latter is an object image (32×32×3)
quency setting can impact the TrustZone’s performance [1], we dataset consisting of 50𝑘 training samples and 10𝑘 test samples with
execute the on-device FL training with full CPU frequency. In or- 10 classes. We follow the setup in [43] to partition training datasets
der to emulate multiple device clients and their participation in FL into 100 parts, one per client, in two versions: i) Independent and
rounds, we use the HiKey board in a repeated, iterative fashion, one Identically Distributed (IID) where a client has samples of all classes;
time per client device. We implement the server-side of PPFL on ii) Non-Independent and Identically Distributed (Non-IID) where a
generic Darknet ML framework [58] by adding 751 lines of C code client has samples only from two random classes.

99
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

5.3 Performance Metrics Table 2: Results of three privacy-related attacks (DRA, PIA
and MIA) on PPFL vs. end-to-end (E2E) FL. Average score re-
The evaluation of PPFL prototype presented in the next section
ported with 95% confidence interval in parenthesis.
focuses on assessing the framework from the point of view of (i)
privacy of data, (ii) ML model performance, and (iii) client-side
system cost. Although ML computations (i.e., model training) have Learning Privacy-related Attack
Model
the same precision and accuracy no matter in REEs or TEEs, PPFL Method DRA, in MSE 𝛼 PIA, in AUC 𝛿 MIA, in Prec. 𝜖
changes the FL model training process into a layer-based training. AlexNet 0.017 (0.01) 0.930 (0.03) 0.874 (0.01)
This affects ML accuracy and the number of communication rounds E2E
VGG9 0.008 (<0.01) 0.862 (0.05) 0.765 (0.04)
needed for the model to converge (among others). Thus, we devise AlexNet 0.506 (0.01)
several metrics and perform extensive measurements to assess PPFL ∼1.3 ∼0.5
VGG9 0.507 (<0.01)
overall PPFL performance. We conduct system cost measurements 𝛼 MSE (mean-square error) measures the difference between constructed images and
only on client devices since their computational resources are more target images (range is [0, ∞) , and the lower MSE is, the more privacy loss); 𝛿 AUC
limited compared to the server. All experiments are done with refers to the area under receiver operating curve; 𝜖 Prec. refers to Precision. The range
of both AUC and Prec. is [0.5, 1] (assuming 0.5 is for random guesses), and the higher
10% of the total number of clients (i.e., 10 out of 100) participating AUC or Prec. is, the more privacy loss).
in each communication round. We run FL experiments on our
PPFL prototype (Sec. 5.1) to measure the system cost. To measure
privacy risks and ML model performance, we perform simulations with a high-performance power setting can lead to high temper-
on a cluster with multiple NVIDIA RTX6000 GPUs (24GB) nodes ature and consequently under-clocking. Thus, we run each trial
running PyTorch v1.4.0 under Python v3.6.0. with 2000 steps continuously, starting with 120s cooling time.
Model Performance. We measure three metrics to assess the per-
formance of the model and PPFL-related process: 6 EVALUATION RESULTS
In this section, we present the experimental evaluation of PPFL
(1) Test Accuracy: ML accuracy of test data on a given FL model, for
aiming to answer a set of key questions.
a fixed number of communication rounds.
(2) Communication Rounds: Iterations of communication between
server and clients needed to achieve a particular test accuracy.
6.1 How Effectively does PPFL Thwart Known
(3) Amount of communication: Total amount of data exchanged to Privacy-related Attacks?
reach a test accuracy. Transmitted data sizes may be different To measure the exposure of the model to known privacy risks, we
among communication rounds when considering different lay- conduct data reconstruction, property inference, and membership
ers’ sizes in layer-wise training. inference attacks (i.e., DRAs, PIAs, and MIAs) on the PPFL model.
While training AlexNet and VGG9 models on CIFAR10 in an IID
Privacy Assessment. We measure privacy risk of PPFL by apply- setting. We compare the exposure of PPFL to these attacks against a
ing three FL-applicable, privacy-related attacks: standard, end-to-end FL-trained model. Table 2 shows the average
performance of each attack in the same way it is measured in
(1) Data Reconstruction Attack (DRA) [78] literature [45, 52, 78]: Mean-Square-Error (MSE) for the DRA, Area-
(2) Property Inference Attack (PIA) [45] Under-Curve (AUC) for the PIA, and Precision for the MIA.
(3) Membership Inference Attack (MIA) [52] From the results, it becomes clear that, while these attacks can
successfully disclose private information in regular end-to-end FL,
We follow the proposing papers and their settings to conduct each
they fail in PPFL. As DRAs and PIAs rely on intermediate training
attack on the model trained in FL process.
models (i.e., gradients) that remain protected, PPFL can fully defend
Client-side System Cost. We monitor the efficiency of client on- against them. The DRA can only reconstruct a fully noised image
device training, and measure the following device costs for PPFL- for any target image (i.e., an MSE of ∼1.3 for the specific dataset),
related process information: while the PIA always reports a random guess on private properties
(i.e., an AUC of ∼0.5). Regarding the MIA on final trained models, as
(1) CPU Execution Time (s): Time the CPU was used for processing PPFL keeps the last layer and its outputs always protected inside the
the on-device model training, including time spent in REE and client’s TEE, it forces the adversary to access only previous layers,
the TEE’s user and kernel time, which is reported by using which significantly drops the MIA’s advantage (i.e., Precision≈0.5).
function getrusage(RUSAGE_SELF). Thus, PPFL fully addresses privacy issues raised by such attacks.
(2) Memory Usage (MB): We add REE memory (the maximum resi-
dent set size in RAM, accessible by getrusage()) and allocated 6.2 What is the PPFL Communication Cost?
TEE memory (accessible by mdbg_check(1)) to get the total
memory usage. Predefined ML Performance. Next, we measure PPFL’s commu-
(3) Energy Consumption (J): Measured by all energy used to perform nication cost to complete the FL process, when a specific ML per-
one on-device training step when the model runs with/without formance is desired. For this, we first execute the standard end-
TEEs. For this, we use the Monsoon High Voltage Power Moni- to-end FL without TEEs for 150 rounds and record the achieved
tor [50]. We configure the power to HiKey board as 12V voltage ML performance. Subsequently, we set the same test accuracy as a
while recording the current in a 50𝐻𝑧 sampling rate. Training requirement, and measure the number of communication rounds

100
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

and amount of communication required by PPFL to achieve this Table 3: Communication overhead (rounds and amount) of
ML performance. PPFL to reach the same accuracy as end-to-end FL system.
In this experiment, we set the number of local epochs at clients as
10. We use SGD as the optimization algorithm and set the learning Baseline Comm. Comm.
Model Data
rate as 0.01, with a decay of 0.99 after each epoch. Momentum is Acc.𝛼 Rounds Amount
set to 0.5 and the batch size to 16. When training each layer locally, LeNet IID 98.93% 56 (0.37×)𝛿 0.38 ×
Non-IID 97.06%𝜖 - -
we build one classifier on top of it. The classifier’s architecture
AlexNet IID 68.50% 97 (0.65×) 0.63 ×
follows the last convolutional (Conv) layer and fully-connected
Non-IID 49.49% 79 (0.53×) 0.53 ×
(FC) layers of the target model (e.g., AlexNet or VGG9). Thus, the
VGG9 IID 63.09% 171 (1.14×) 2.87 ×
training of each global model’s layer progresses until all Conv layers Non-IID 46.70% 36 (0.24×) 0.60 ×
are finished. We choose AlexNet and VGG9 on CIFAR10, because 𝛼 Acc.: Test accuracy of 150 communication rounds in end-to-end FL;
MNIST is too simple for testing. Then, the classifier atop all Conv 𝛿 1× refers to no overhead; 𝜖 PPFL reaches a maximum of 95.99%.
layers is finally trained to provide outputs for the global model.
Note that we also aggregate the client classifiers while training one
global layer to provide the test accuracy after each communication Table 4: Time duration of FL phases in one communication
round. We perform these experiments on IID and Non-IID data. round, when training LeNet, AlexNet and VGG9 models with
Overall, the results in Table 3 show that, while trying to reach the PPFL and end-to-end (E2E) FL.
ML performance achieved by the standard end-to-end FL system,
PPFL adds small communication overhead, if any, to the FL process. Duration of FL phases (s)
Model Method
B.cast𝛼 Training Upload Aggr.𝛿 Total
In fact, in some cases, it can even reduce the communication cost,
E2E 4.520 2691.0 6.645 0.064 2702.3

LeNet
while preserving privacy when using TEEs. As expected, using
PPFL 18.96 6466.2 7.535 1.887 6496.5
Non-IID data leads to lower ML performance across the system, - layer 1 4.117 1063.3 1.488 0.426 1069.8
which also implies less communication cost for PPFL as well. - layer 2 4.670 2130.6 1.627 0.692 2138.3
The reason why in many cases PPFL has reduced communication - layer 3 5.332 2315.2 1.745 0.676 2323.6
cost, while still achieving comparable ML performance, is that - clf.𝜖 4.845 957.16 2.675 0.093 964.87
training these models on datasets such as CIFAR10 may not require E2E 14.58 3772.0 6.122 0.061 3792.8
AlexNet

training the complete model. Instead, during the early stage of PPFL 57.24 14236 16.89 3.290 14316
PPFL’s layer-wise training (e.g., first global layer+classifier), it can - layer 1 16.20 2301.8 4.690 0.129 2322.9
already reach good ML performance, and in some cases even better - layer 2 12.56 4041.1 4.777 0.174 4058.8
than training the entire model. We explore this aspect further in - layer 3 10.31 4609.4 5.388 0.243 4625.6
- clf. 18.17 3283.8 2.033 2.744 3309.5
the next subsection. Consequently, and due to the needed rounds
E2E 14.10 2867.1 8.883 0.067 2890.2
VGG9

being fewer, the amount of communication is also reduced.


PPFL 353.5 21389 173.8 4.066 21924
The increased cost when training VGG9 is due to the large num- - layer 1 127.5 4245.7 95.58 0.375 4469.5
ber of neurons in the classifier’s FC layer connected to the first - layer 2 77.22 2900.6 24.82 0.207 3003.1
Conv layer. Thus, even if the number of total layers considered (one - layer 3 79.18 3703.1 24.84 0.223 3807.6
global layer + classifier) is smaller compared to the latter stages - layer 4 27.05 2987.9 12.15 0.235 3027.6
(multiple global layers + classifier), the model size (i.e., number of - layer 5 21.47 2404.4 9.137 0.347 2435.7
parameters) can be larger. - layer 6 10.95 2671.0 4.768 0.571 2687.9
Indeed, we are aware that by training any of these models on - clf. 10.11 2476.4 2.478 2.108 2493.2
𝛼 B.cast: Broadcast; 𝛿 Aggr.: Aggregation; 𝜖 clf.: Classifier.
CIFAR10 [43] for more communication rounds, either the PPFL
or the regular end-to-end FL can reach higher test accuracy such
as 85% with standard 𝐹𝑒𝑑𝐴𝑣𝑔. However, the training rounds used
here are sufficient for our needs, as our goal is to evaluate the each model training. Indeed, we confirmed that increasing batch
performance of PPFL (i.e., what is the cost for reaching the same size for small models that allow it (e.g., AlexNet with batch size=16),
accuracy), and not to achieve the best possible accuracy on this incrementally reduces the duration of phases.
classification task. Table 4 shows the break-down of time taken for each phase, for
Communication Duration of FL Phases. In the next experi- three models and two datasets (LeNet on MNIST; AlexNet and VGG9
ment, we investigate the wall-clock time needed for running PPFL’s on CIFAR10) and IID data. As expected, layer-wise FL increases the
phases in one communication round: broadcast of the layer from total time compared to end-to-end FL because each layer is trained
server to clients, training of the layer at the client device, upload separately, but the previously trained and finalized layers still need
the layer to the server, aggregate all updates from clients and ap- to be processed in the forward pass. In fact, these results are in
ply 𝐹𝑒𝑑𝐴𝑣𝑔. Depending on each layer’s size and TEE memory size, line with the complexity analysis shown earlier in Sec. 4.2, i.e., to
batch size can start from 1 and go as high as the TEE allows. How- finish the training of all layers, layer-wise training introduces a
ever, since our models are uneven in layer sizes (with VGG9 being 3× or higher delay, similar to the number of layers. On the one
the largest), we set the batch size to 1 to allow comparison, and also hand, we argue that applications can tolerate this additional delay
capture an upper bound on the possible duration of each phase in if they are to be protected from privacy-related attacks, despite the
execution time increase being non-negligible and up to a few hours

101
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

of training. Indeed, models can be (re)trained on longer timescales (a) LeNet on MNIST
100%
(e.g., weekly, monthly), and rounds can have a duration of 10s of ●




●●
●●●
● ●●●●●●●
●● ●

●●●●●●●●●●●●●●●● ●●●
●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●● ●●
●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

minutes, while being executed in an asynchronous manner. On the 90% ●

Test Acc.
other hand, training one layer in PPFL costs similar time to the 80%
end-to-end FL training of the complete model. This highlights that Learning methods

E2E & IID
70%
the minimum client contribution time is the same as end-to-end FL: E2E & Non−IID
PPFL & IID

clients can choose to participate in portions of an FL round, and in 60%


PPFL & Non−IID

For PPFL: layer 1 layer 2 layer 3 classifier


just a few FL rounds. For example, a client may contribute to the
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
model training for only a few layers in any given FL round. Communication rounds
Among all FL phases, local training costs the most, while the (b) AlexNet on CIFAR10
time spent in server aggregation and averaging is trivial, regardless 90%
80%
if it is non-secure (i.e., end-to-end FL) or secure (PPFL). Regarding 70%
●●●●●
● ●●●●●
●●●● ●
●●●●●●●●●
●●●●●●●●
● ●●●●●●
●●●●●●●● ●
● ●●●●●●●●●●●●●●●
● ●
●●●●●●●●●● ●●●●●●●●●●●● ●●
●●●●●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●● ●●●●
●●●●●●●●●●●●●●●●●

Test Acc.
●●●

60%
●●●●
●●●●

VGG9, layer-wise training of early layers significantly increases the


●●●●
●●●
●●
●●
●●●

●●
●●

50% ●
●●

●●
●●

●●

communication time in broadcast and upload, because the Conv



40%





30% ●

layers are with a small number of filters and consequently the 20%
10%
following classifier’s FC layer has a large size. This finding hints 0%
that selecting suitable DNNs to be trained in PPFL (e.g., AlexNet vs. For PPFL: layer 1 layer 2 layer 3 classifier

VGG9) is crucial for practical performance. Moreover, and according 1 10 20 30 40 50 60 70 80 90 100


Communication rounds
110 120 130 140 150 160 170 180 190 200

to the earlier FL performance results (also see Table 2), it may not (c) VGG9 on CIFAR10
be necessary to train all layers to reach the desired ML utility. 90%
80%
70%
Test Acc.
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

60%
●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●● ●●●●●●●●●●●● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●● ●●●●●●●●●
●●●●● ●●●●●
●●●●●●

6.3 Is the PPFL ML Performance Comparable


●●●●●●●
●●●●●
●●
●●
●●

50%
●●
●●●
●●


40%

to State-of-art FL? 30%


20%



In these experiments, we reduce the number of communication 10% ●●●

0%
rounds that each layer in PPFL is trained to 50, finish the training For PPFL: layer 1 layer 2 layer 3 layer 4 layer 5 layer 6 classifier
process per layer, and compare its performance with centralized 1 25 50 75 100 125 150 175 200 225 250 275 300 325 350
Communication rounds
layer-wise training, as well as regular end-to-end FL. The latter
trains the full model for all rounds up to that point. For example, if
Figure 2: Test accuracy of training LeNet, AlexNet, and VGG9
PPFL trains the first layer for 50 rounds, and then the second layer
models on IID and Non-IID datasets when using PPFL. Hor-
for 50 rounds, the end-to-end FL will train all the model (end-to-end)
izontal dashed lines refer to the accuracy that the central-
for 100 rounds.
ized training reaches after every 50 epochs. Note: end-to-end
As shown in Figure 2, training LeNet on the łeasyž task of MNIST
(E2E) FL trains the complete model rather than each layer,
data (IID or not) leads quickly to high ML performance, regardless of
and the ‘Layer No.’ at x-axis are only applicable to PPFL.
the FL system used. Training AlexNet on IID and Non-IID CIFAR10
data can lead to test accuracy of 74% and 60.78%, respectively, while
centralized training reaches 83.34%. Training VGG9, which is a more
complex model on IID and Non-IID CIFAR10 data leads to lower training the first layers can quit before participating in further com-
performances of 74.60% and 38.35%, respectively, while centralized munication rounds. Overall, the layer-wise training outperforms
training reaches 85.09%. We note the drop of performance in PPFL end-to-end FL during the training of the first or second layer.
when every new layer is considered into training. This is to be We further discuss possible reasons for PPFL’s better ML perfor-
expected, since PPFL starts from scratch with the new layer, leading mance compared to end-to-end FL. On the one hand, this could be
to a significant performance drop in the first FL rounds. Of course, due to some DNN architectures (e.g., VGG9) being more suitable for
towards the end of the 50 rounds, PPFL performance matches and layer-wise FL. For example, training each layer separately may al-
in some cases surpasses that of end-to-end FL. low PPFL to overcome possible local optima at which the backward
In general, with more layers being included in the training, the propagation can łget stuckž in end-to-end FL. On the other hand,
test accuracy increases. Interestingly, in more complex models hyper-parameter tuning may help improve performance in both
(e.g., VGG9) with Non-IID data, PPFL can lead to a drop in ML layer-wise and end-to-end FL, always with the risk of overfitting
performance when the number of layers keeps increasing. In fact, the data. Indeed, achieving the best ML performance possible was
in these experiments, it only reaches ∼55% after finishing the sec- not our focus, and more in-depth studying is needed in the future,
ond layer and drops. One possible reason for this degradation is to understand under what setups layer-wise can perform better
that the first layers of VGG9 are small and maybe not capable of than end-to-end FL.
capturing heterogeneous features among Non-IID data, which con-
sequently has a negative influence on the training of latter layers. 6.4 What is the PPFL Client-Side System Cost?
On the other hand, this reminds us that we can have early exits for We further investigate the system performance and costs on the
greedy layer-wise PPFL on Non-IID data. For example, clients that client devices with respect to CPU execution time, memory usage,
do not have enough data, or already have high test accuracy after and energy consumption. Figure 3 shows the results for all three

102
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

LeNet AlexNet VGG9 LeNet AlexNet VGG9 LeNet AlexNet VGG9

Energy consumption (J)


1.0 3.5

Memory usage (MB)


0.9 70
0.8 60 3.0
CPU time (s)

2.62%
0.7 50 2.5
0.6 14.6% 9.45% 7.89%
40 2.0 21.19%
0.5 13.22%
0.4 30 18.31% 1.5
8.3% 3.44%
0.3 20 1.0
0.2
0.1 10 0.5
0.0 0 0.0
1 2 3 c 1 2 3 c 1 2 3 4 5 6 c 1 2 3 c 1 2 3 c 1 2 3 4 5 6 c 1 2 3 c 1 2 3 c 1 2 3 4 5 6 c
Layer of models Layer of models Layer of models

Figure 3: System performance of the client devices when training LeNet, AlexNet, and VGG9 using PPFL, measured on one
step of training (i.e., one batch of data). The light grey bar ( ) refers to learning without TEEs, and the black bar ( ) refers to
overhead when the layer under training is inside the TEE. Percentage (%) of the overhead (averaged on one model) is shown
above these bars. Horizontal dashed lines signify the cost of end-to-end FL. In x-axis, ‘c’ refers to ‘classifier’.

(a) AlexNet on CIFAR10 CPU time (s) Memory usage (MB) Energy consum.(J)
90%
80% AlexNet VGG9 AlexNet VGG9 AlexNet VGG9
70% ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

0.6 2.00
28
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
Test Acc.

● ● ●

60% ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●

8.46% 1.75
0.5 14.47%
● ●
● ● ●

● ●

50% ● ●

● ●
● ●

24



● ●

● ●

3.17% 1.50
40%





Learning methods 0.4 13.24% 20 32.71% 1.25


7.63%
30% E2E & IID 16

20% E2E & Non−IID 0.3 1.00


10% PPFL & IID 12 0.75
PPFL & Non−IID 0.2
0% 8 0.50
For PPFL: layer 1&2 layer 3 classifier 0.1 4 0.25
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 0.0 0 0.00
Communication rounds 1 2 c 1 2 3 c 1 2 c 1 2 3 c 1 2 c 1 2 3 c
(b) VGG9 on CIFAR10 Block of layers Block of layers Block of layers
90%
80%
70%
Figure 5: System performance of the client devices when
Test Acc.

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

60%
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●● ●●●●●●
●●●●●●●●●
● ●●●●●
● ●●●●●● ●
● ● ●●● ●●
●●●● ● ●
●●●●

training AlexNet and VGG9 models on CIFAR10 when using


●●
●●

50%
●●
●●●




40%

PPFL with blocks of two layers in TEE (same settings as in


30%


20%
10% ●●●
Figure 4), measured on one step of training. The light grey
0% bar ( ) refers to learning without TEEs, and the black bar ( )
For PPFL: layer 1&2 layer 3&4 layer 5&6 classifier
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
refers to overhead when the block’s layers under training
Communication rounds are inside the TEE. Percentage (%) of the overhead is shown
above these bars. Horizontal dashed lines refer to the cost of
Figure 4: Test accuracy of training AlexNet and VGG9 mod- end-to-end FL. ‘c’ refers to ‘classifier’.
els on CIFAR10 (IID and Non-IID) when using PPFL with
blocks of two layers in TEE (Note: horizontal dashed lines
refer to the accuracy that the end-to-end (E2E) FL reaches 6.5 What is the PPFL ML and System Costs if
after 50 communication rounds). Blocks of Layers were Trained in Clients?
As explained in Algorithm 1 of Sec. 4.2, if the TEEs can hold more
than one layers, it is also possible to put a block of layers inside
the TEE for training. Indeed, heterogeneous devices and TEEs can
have different memory sizes, thus supporting a wide range of block
sizes. For these experiments, we assume all devices have the same
metrics, when training LeNet on MNIST, AlexNet and VGG9 on TEE size and construct 2-layer blocks, and measure the system’s
CIFAR10, on IID data. The metrics are computed for one step of test accuracy and ML performance on CIFAR10. The performance
training (i.e., one batch of data). More training steps require analo- of three or more layers inside TEEs could be measured in a similar
gously more CPU time and energy, but do not influence memory fashion (if the TEE’s memory can fit them). We do not test LeNet
usage since the memory allocated for the model is reused for all on MNIST because it can easily reach high accuracy (around 99%)
subsequent steps. Here, we compare PPFL with layer-wise training as shown earlier and in previous studies [43, 72].
without TEEs, to measure the overhead of using the TEE. Among Results in Figure 4 indicate that training blocks of layers can
the trained models, the maximum overhead is 14.6% for CPU time, reach similar or even better ML performance compared to training
18.31% for memory usage, and 21.19% for energy consumption. In each layer separately (i.e., see Fig. 2). It can also improve the test
addition, when training each layer, PPFL has comparable results accuracy of complex models such as VGG9, for which we noted a
with end-to-end training (i.e., horizontal dashed lines in Figure 3). degradation of ML performance caused by the first layer’s small size

103
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

Table 5: Reduction of communication rounds and amount 90% 90% ●●●●


●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

when training 2-layer instead of 1-layer blocks. 80%

●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 80%
70% 70%

Test Acc.

Test Acc.
60% 60%
Model Data Comm. Rounds Comm. Amount 50% 50%
40% 40% Learning methods
AlexNet IID 0.65× →
− 0.18× 0.63× →
− 0.27× ● Centralized
30% 30%
Non-IID 0.53× →
− 0.29× 0.53× →
− 0.44× 1 layer & IID
20% 20% 1 layer & Non−IID
VGG9 IID 1.14× →
− 0.43× 2.87× →
− 1.07× 10% 10% 3 layers & IID
3 layers & Non−IID
Non-IID 0.24× →
− 0.11× 0.60× →
− 0.27× 0% 0%
1 5 10 15 20 25 30 35 40 45 50 1 5 10 15 20 25 30 35 40 45 50
Communication rounds Communication rounds
and incapacity to model the data (see Fig. 2). In addition, compared
to training one layer at a time, training 2-layer blocks reduces the (a) Transfer from MobileNetv2 (b) Transfer from VGG16
total required communication to reach the desired ML performance.
In fact, while aiming to reach same baseline accuracy as in Table 3, Figure 6: Test accuracy of training on CIFAR10 (IID and
training 2-layer blocks requires half or less of communication cost Non-IID) with public models MobileNetv2 and VGG16, pre-
than 1-layer blocks see Table 5. Also, layer-wise training outper- trained on ImageNet). Both models are trained and tested
forms end-to-end FL for similar reasons as outlined for Figure 2. with 1 and 3 FC layers attached at the end of each model.
Regarding the system cost, results across models show that the
maximum overhead is 13.24% in CPU time, 32.71% in memory usage, 225
1.8 6

Energy consumption (J)


and 14.47% in energy consumption (see Fig. 5). Compared to training

Memory usage (MB)


1.6 2.1% 6.8%
200
1.1% 0.7% 5 <0.1% <0.1%
175
one layer at a time, training layer blocks does not always increase 1.4

CPU time (s)


1.2 150 0.4% 1.3% 4
the overhead. For example, overhead when running VGG9 drops 1.0 <0.1%
3.1% 125 <0.1%
5.4%
3
from 13.22% to 8.46% in CPU, from 2.44% to 3.17% in memory usage, 0.8 100
0.6 75 2
and from 21.19% to 14.47% in energy consumption. One explanation 0.4 50
1
is that combining layers into blocks amortizes the cost of łexpensivež 0.2 25
0.0 0 0
with łcheapž layers. Interestingly, PPFL still has a comparable cost

VG 1
s3

s1

s3

s1

s3
s
1

3
with end-to-end FL training.
G

G
N

N
VG

VG

VG

VG

VG
M

M
Models Models Models
6.6 Can Bootstrapping the PPFL with Public
Knowledge Help? Figure 7: System performance of client devices when train-
We investigate how the backend server of PPFL can use existing, ing with transferred public models on CIFAR10, measured
public models to bootstrap the training process for a given task. For on 1 step of training. Light grey bar ( ): learning without
this purpose, we leverage two models (MobileNetv2 and VGG16) TEEs; Black bar ( ): overhead when layers under training
pre-trained on ImageNet to the classification task on CIFAR10. are in TEE. Percentage (%) of overhead shown above bars.
Because these pre-trained models contain sufficient knowledge MN1: MobileNetv2 with one layer for training (i.e., ‘1 layer’
relevant to the target task, training the last few layers is already in Figure 6a). VGGs: a small size of VGG16.
adequate for a good ML performance. Consequently, we can freeze
all Conv layers and train the last FC layers within TEEs, thus pro-
tecting them as well. By default, MobileNetv2 has one FC layer, Client-side System Cost. In order to measure client-side cost un-
and VGG16 has three FC layers at the end. We test both cases that der this setting, we need to do some experimental adjustments. The
one and three FC layers are attached and re-trained for these two VGG16 (even the last FC layers) is too large to fit in TEEs. Thus, we
models, respectively. CIFAR10 is resized to 224 × 224 in order to reduce the batch size to 1 and proportionally scale down all layers
fit with the input size of these pre-trained models. We start with a (e.g., from 4096 to 1024 neurons for one FC layer). Indeed, scaling
smaller learning rate of 0.001 to avoid divergence and a momentum layers may lead to biases in results, but the actual performance can-
of 0.9 because the feature extractors are well-trained. not be worse than this estimation. As shown in [49], larger models
Test Accuracy. Figure 6 shows that the use of pre-trained first have less overhead because the last layers are relatively smaller
layers (i.e., feature extractors) to bootstrap the learning process compared to the complete size of the model.
can help the final PPFL models reach test accuracy similar to cen- Interestingly, results shown in Figure 7 indicate that when we
tralized training. Interestingly, transferring pre-trained layers from train and keep the last FC layers inside the client’s on-device TEEs,
VGG16 can reach higher test accuracy than MobileNetv2. This is ex- there is only a small overhead incurred in terms of CPU time (6.9%),
pected because VGG16 contains many more DNN parameters than memory usage (1.3%), and energy consumption (5.5%) in either
MobileNetv2, which provides better feature extraction capabilities. model. These results highlight that transferring knowledge can be a
Surprisingly, attaching and training more FC layers at the end of good alternative for bootstrapping PPFL training and keep system
any of the models does not improve test accuracy. This can be due overhead low. In addition, we note that when the server does not
to the bottleneck of the transferred feature extractors, which since have suitable public models, it is possible to first train a model on
they are frozen, they do not allow the model to fully capture the public datasets that have similar distribution with local datasets.
variability of the new data. We refer to Appendix A.1 for more details on experimental results.

104
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

7 DISCUSSION & FUTURE WORK Accelerating Local Training. PPFL uses only the CPU of client
devices for local training. Training each layer does not introduce
Key Findings. PPFL’s experimental evaluation showed that: parallel processing on a device. Indeed, more effective ways to
perform this compute load can be devised. One way is that clients
• Protecting the training process (i.e., gradient updates) inside could use specialized processors (i.e., GPUs) to accelerate training.
TEEs, and exposing layers only after convergence can thwart PPFL’s design can integrate such advances mainly in two ways. First,
data reconstruction and property inference attacks. Also, keeping the client can outsource the first, well-trained, but non-sensitive
a model’s last layer inside TEEs mitigates membership inference layers, to specialized processors that can share computation and
attacks. speed-up local training. Second, recently proposed GPU-based TEEs
• Greedy layer-wise FL can achieve comparable ML utility with can support intensive deep learning-like computation in high-end
end-to-end FL. While layer-wise FL increases the total of com- servers [22, 24]. Thus, such TEEs on client devices can greatly
munication rounds needed to finish all layers, it can reach the speed-up local training. However, as GPU-TEE still requires small
same test accuracy as end-to-end FL with fewer rounds (0.538×) TCB to restrict attack surface, PPFL’s design can provide a way to
and amount of communication (1.002×). leverage limited TEE space for privacy-preserving local training.
• Most PPFL system cost comes from clients’ local training: up
to ∼15% CPU time, ∼18% memory usage, and ∼21% energy con- Federated Learning Paradigms. PPFL was tested with 𝐹𝑒𝑑𝐴𝑣𝑔,
sumption in client cost when training different models and data, but there are other state-of-art FL paradigms that are compatible
compared to training without TEEs. with PPFL. PPFL leverages greedy layer-wise learning but does
• Training 2-layer blocks decreases communication cost by at not modify the hyper-parameter determination and loss function
least half, and slightly increases system overhead (i.e., CPU time, (which have been improved in 𝐹𝑒𝑑𝑃𝑟𝑜𝑥 [39]) or aggregation (which
memory usage, energy consumption) in cases of small models. is neuron matching-based in 𝐹𝑒𝑑𝑀𝐴 [72]). Compared with PPFL
• Bootstrapping PPFL training process with pre-trained models that trains one layer until convergence, 𝐹𝑒𝑑𝑀𝐴, which also uses
can significantly increase ML utility, and reduce overall cost in layer-wise learning, trains each layer for one round, and then moves
communications and system overhead. to the next layer. After finishing all layers, it starts again from the
first. Thus, 𝐹𝑒𝑑𝑀𝐴 is still vulnerable because gradients of one layer
are accessible to adversaries. PPFL could leverage 𝐹𝑒𝑑𝑀𝐴’s neuron-
Dishonest Attacks. The attacks tested here assume the classic matching technique when dealing with heterogeneous (i.e., Non-
‘honest-but-curious’ adversary [57]. In FL, however, there are also IID) data [28]. Besides, our framework is compatible with other
dishonest attacks such as backdoor [4, 67] or poisoning attacks [15], privacy-preserving techniques (e.g., differential privacy) in FL. This
whose goal is to actively change the global model behavior, e.g., for is useful during the model usage phase where some users may not
surreptitious unauthorized access to the global model [26]. In the have TEEs. PPFL can also be useful to systems such as FLaaS [31]
future, we will investigate how TEEs’ security properties can defend that enable third-party applications to build collaborative ML mod-
against such attacks. els on the device shared by said applications.
Privacy and Cost Trade-off. PPFL guarantees ‘full’ privacy by
keeping layers inside TEEs. However, executing computations in
secure environments inevitably leads to system costs. To reduce 8 CONCLUSION
such costs, one can relax their privacy requirements, potentially In this work, we proposed PPFL, a practical, privacy-preserving
increasing privacy risks due inference attacks with higher ładvan- federated learning framework, which protects clients’ private infor-
tagež [76]. For example, clients who do not care about high-level mation against known privacy-related attacks. PPFL adopts greedy
information leakages (i.e., learned model parameters), but want to layer-wise FL training and updates layers always inside Trusted Ex-
protect the original local data, can choose to hide only the first lay- ecution Environments (TEEs) at both server and clients. We imple-
ers of the model in TEEs. We expect that by dropping clients already mented PPFL with mobile-like TEE (i.e., TrustZone) and server-like
achieving good performance when training latter layers, we could TEE (i.e., Intel SGX) and empirically tested its performance. For the
gain better performance. This may further benefit personalization first time, we showed the possibility of fully guaranteeing privacy
and achieve better privacy, utility, and cost trade-offs. and achieving comparable ML model utility with regular end-to-end
Model Architectures. The models tested in our layer-wise FL are FL, without significant communication and system overhead.
linear links cross consecutive layers. However, our framework can
be easily extended to other model architectures that have been stud-
ied in standard layer-wise training. For example, one can perform 9 ACKNOWLEDGMENTS
layer-wise training on (i) Graph Neural Networks by disentan- We acknowledge the constructive feedback from the anonymous
gling feature aggregation and feature transformation [74], and (ii) reviewers. The research leading to these results received partial
Long Short-Term Memory networks (LSTMs), by adding hidden funding from the EU H2020 Research and Innovation programme
layers [60]. There are other architectures that contain skipping under grant agreements No 830927 (Concordia), No 871793 (Accor-
connections to jump over some layers such as ResNet [19]. No dion), No 871370 (Pimcity), and EPSRC Databox and DADA grants
layer-wise training has been investigated for ResNets, but train- (EP/N028260/1, EP/R03351X/1). These results reflect only the au-
ing a block of layers could be attempted by including the jumping thors’ view and the Commission and EPSRC are not responsible
shortcut inside a block. for any use that may be made of the information it contains.

105
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

REFERENCES Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and
[1] Amacher, J., and Schiavoni, V. On the performance of arm trustzone. In IFIP open problems in federated learning. arXiv preprint arXiv:1912.04977 (2019).
International Conference on Distributed Applications and Interoperable Systems [28] Katevas, K., Bagdasaryan, E., Waterman, J., Safadieh, M. M., Birrell, E.,
(2019), Springer, pp. 133ś151. Haddadi, H., and Estrin, D. Policy-based federated learning. arXiv preprint
[2] Aono, Y., Hayashi, T., Wang, L., Moriai, S., et al. Privacy-preserving deep learn- arXiv:2003.06612 (2021).
ing via additively homomorphic encryption. IEEE Transactions on Information [29] Kaya, Y., Hong, S., and Dumitras, T. Shallow-deep networks: Understanding
Forensics and Security 13, 5 (2017), 1333ś1345. and mitigating network overthinking. In International Conference on Machine
[3] Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. Differential privacy has Learning (2019), PMLR, pp. 3301ś3310.
disparate impact on model accuracy. In Advances in Neural Information Processing [30] Knauth, T., Steiner, M., Chakrabarti, S., Lei, L., Xing, C., and Vij, M.
Systems (2019), pp. 15479ś15488. Integrating remote attestation with transport layer security. arXiv preprint
[4] Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to arXiv:1801.05863 (2018).
backdoor federated learning. In International Conference on Artificial Intelligence [31] Kourtellis, N., Katevas, K., and Perino, D. Flaas: Federated learning as a
and Statistics (2020), PMLR, pp. 2938ś2948. service. In Workshop on Distributed ML (2020), ACM CoNEXT.
[5] Belilovsky, E., Eickenberg, M., and Oyallon, E. Greedy layerwise learning [32] Krawczyk, H. Sigma: The ‘sign-and-mac’approach to authenticated diffie-
can scale to imagenet. In International conference on machine learning (2019), hellman and its use in the ike protocols. In Annual International Cryptology
PMLR, pp. 583ś593. Conference (2003), Springer, pp. 400ś425.
[6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise [33] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from
training of deep networks. Advances in neural information processing systems 19 tiny images. Citeseer (2009).
(2006), 153ś160. [34] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with
[7] Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, deep convolutional neural networks. Communications of the ACM 60, 6 (2017),
V., Kiddon, C., Konečnỳ, J., Mazzocchi, S., McMahan, H. B., et al. Towards 84ś90.
federated learning at scale: System design. In Conference on Machine Learning [35] Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. Exploring strategies
and Systems (2019). for training deep neural networks. Journal of machine learning research 10, 1
[8] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, (2009).
S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacy- [36] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015),
preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference 436ś444.
on Computer and Communications Security (2017), pp. 1175ś1191. [37] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning
[9] Brownlee, J. A Gentle Introduction to Transfer Learning for Deep Learning, 2019 applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278ś2324.
(accessed November 11, 2020). [38] Lee, J., Jang, J., Jang, Y., Kwak, N., Choi, Y., Choi, C., Kim, T., Peinado, M., and
[10] Chen, H., Fu, C., Rouhani, B. D., Zhao, J., and Koushanfar, F. Deepattest: An Kang, B. B. Hacking in Darkness: Return-Oriented Programming against Secure
end-to-end attestation framework for deep neural networks. In 2019 ACM/IEEE Enclaves. In Proceedings of the 26th USENIX Conference on Security Symposium
46th Annual International Symposium on Computer Architecture (ISCA) (2019), (2017), pp. 523ś539.
IEEE, pp. 487ś498. [39] Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Fed-
[11] Chen, J., Pan, X., Monga, R., Bengio, S., and Jozefowicz, R. Revisiting dis- erated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127
tributed synchronous sgd. In ICLR Workshop Track (2016). (2018).
[12] Chen, Z., Vasilakis, G., Murdock, K., Dean, E., Oswald, D., and Garcia, F. D. [40] Linaro.org. Open Portable Trusted Execution Environment, 2020 (accessed Sep-
Voltpillager: Hardware-based fault injection attacks against intel SGX enclaves tember 3, 2020).
using the SVID voltage scaling interface. In 30th USENIX Security Symposium [41] Lipp, M., Kogler, A., Oswald, D., Schwarz, M., Easdon, C., Canella, C., and
(Vancouver, B.C., Aug. 2021). Gruss, D. PLATYPUS: Software-based Power Side-Channel Attacks on x86. In
[13] Costan, V., and Devadas, S. Intel sgx explained. IACR Cryptol. ePrint Arch. 2016, 2021 IEEE Symposium on Security and Privacy (SP) (2021), IEEE.
86 (2016), 1ś118. [42] Liu, Y., Kang, Y., Xing, C., Chen, T., and Yang, Q. A secure federated transfer
[14] Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. learning framework. IEEE Intelligent Systems (2020).
Foundations and Trends in Theoretical Computer Science 9, 3-4 (2014), 211ś407. [43] McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A.
[15] Fang, M., Cao, X., Jia, J., and Gong, N. Local model poisoning attacks to Communication-efficient learning of deep networks from decentralized data.
byzantine-robust federated learning. In 29th USENIX Security Symposium (2020), In Artificial Intelligence and Statistics (2017), PMLR, pp. 1273ś1282.
pp. 1605ś1622. [44] McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially
[16] Geiping, J., Bauermeister, H., Dröge, H., and Moeller, M. Inverting private recurrent language models. In International Conference on Learning
gradientsśhow easy is it to break privacy in federated learning? arXiv preprint Representations (2018).
arXiv:2003.14053 (2020). [45] Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended
[17] Geyer, R. C., Klein, T., and Nabi, M. Differentially private federated learning: feature leakage in collaborative learning. In 2019 IEEE Symposium on Security
A client level perspective. arXiv preprint arXiv:1712.07557 (2017). and Privacy (SP) (2019), IEEE, pp. 691ś706.
[18] Gu, Z., Huang, H., Zhang, J., Su, D., Jamjoom, H., Lamba, A., Pendarakis, D., [46] Microsoft. Open Enclave SDK, 2020 (accessed Decemenber 4, 2020).
and Molloy, I. Yerbabuena: Securing deep learning inference data via enclave- [47] Mo, F., Borovykh, A., Malekzadeh, M., Haddadi, H., and Demetriou, S. Layer-
based ternary model partitioning. arXiv preprint arXiv:1807.00969 (2018). wise characterization of latent information leakage in federated learning. ICLR
[19] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recog- Distributed and Private Machine Learning workshop (2021).
nition. In Proceedings of the IEEE conference on computer vision and pattern [48] Mo, F., and Haddadi, H. Efficient and private federated learning using tee. In
recognition (2016), pp. 770ś778. EuroSys (2019).
[20] Hitaj, B., Ateniese, G., and Perez-Cruz, F. Deep models under the gan: infor- [49] Mo, F., Shamsabadi, A. S., Katevas, K., Demetriou, S., Leontiadis, I., Cav-
mation leakage from collaborative deep learning. In Proceedings of the 2017 ACM allaro, A., and Haddadi, H. Darknetz: towards model privacy at the edge
SIGSAC Conference on Computer and Communications Security (2017), pp. 603ś618. using trusted execution environments. In Proceedings of the 18th International
[21] Huang, T., Lin, W., Wu, W., He, L., Li, K., and Zomaya, A. Y. An efficiency- Conference on Mobile Systems, Applications, and Services (2020), pp. 161ś174.
boosting client selection scheme for federated learning with fairness guarantee. [50] Monsoon. Monsoon solutions inc. home page. https://www.msoon.com/, 2020
arXiv preprint arXiv:2011.01783 (2020). (accessed November 12, 2020).
[22] Hunt, T., Jia, Z., Miller, V., Szekely, A., Hu, Y., Rossbach, C. J., and Witchel, [51] Naehrig, M., Lauter, K., and Vaikuntanathan, V. Can homomorphic encryp-
E. Telekine: Secure computing with cloud gpus. In 17th USENIX Symposium on tion be practical? In Proceedings of the 3rd ACM workshop on Cloud computing
Networked Systems Design and Implementation (NSDI’20) (2020), pp. 817ś833. security workshop (2011), pp. 113ś124.
[23] Hunt, T., Song, C., Shokri, R., Shmatikov, V., and Witchel, E. Chiron: Privacy- [52] Nasr, M., Shokri, R., and Houmansadr, A. Comprehensive privacy analysis of
preserving machine learning as a service. arXiv preprint arXiv:1803.05961 (2018). deep learning: Passive and active white-box inference attacks against centralized
[24] Jang, I., Tang, A., Kim, T., Sethumadhavan, S., and Huh, J. Heterogeneous and federated learning. In 2019 IEEE Symposium on Security and Privacy (SP)
isolated execution for commodity gpus. In Proceedings of the Twenty-Fourth (2019), IEEE, pp. 739ś753.
International Conference on Architectural Support for Programming Languages and [53] Nishio, T., and Yonetani, R. Client selection for federated learning with hetero-
Operating Systems (2019), pp. 455ś468. geneous resources in mobile edge. In IEEE International Conference on Communi-
[25] Jayaraman, B., and Evans, D. Evaluating differentially private machine learning cations (ICC) (2019), IEEE, pp. 1ś7.
in practice. In 28th USENIX Security Symposium (USENIX Security 19) (Santa [54] Ohrimenko, O., Schuster, F., Fournet, C., Mehta, A., Nowozin, S., Vaswani,
Clara, CA, Aug. 2019), USENIX Association, pp. 1895ś1912. K., and Costa, M. Oblivious multi-party machine learning on trusted processors.
[26] Jere, M. S., Farnan, T., and Koushanfar, F. A taxonomy of attacks on federated In 25th USENIX Security Symposium (2016), pp. 619ś636.
learning. IEEE Security & Privacy (2020), 0ś0. [55] Paladi, N., Karlsson, L., and Elbashir, K. Trust Anchors in Software Defined
[27] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Networks. In Computer Security (2018), pp. 485ś504.

106
MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis

[56] Pan, S. J., and Yang, Q. A survey on transfer learning. IEEE Transactions on [69] Torrey, L., and Shavlik, J. Transfer learning. In Handbook of research on
knowledge and data engineering 22, 10 (2009), 1345ś1359. machine learning applications and trends: algorithms, methods, and techniques. IGI
[57] Paverd, A., Martin, A., and Brown, I. Modelling and automatically analysing global, 2010, pp. 242ś264.
privacy properties for honest-but-curious adversaries. Tech. Rep. (2014). [70] Tramèr, F., and Boneh, D. Slalom: Fast, verifiable and private execution of
[58] Redmon, J. Darknet: Open source neural networks in c. http://pjreddie.com/ neural networks in trusted hardware. In International Conference on Learning
darknet/, 2013ś2016. Representations (ICLR) (2019).
[59] Sablayrolles, A., Douze, M., Schmid, C., Ollivier, Y., and Jégou, H. White-box [71] Van Bulck, Jo and Oswald, David and Marin, Eduard and Aldoseri, Ab-
vs black-box: Bayes optimal strategies for membership inference. In International dulla and Garcia, Flavio D. and Piessens, Frank. A Tale of Two Worlds:
Conference on Machine Learning (2019), pp. 5558ś5567. Assessing the Vulnerability of Enclave Shielding Runtimes. In Proceedings of the
[60] Sagheer, A., and Kotb, M. Unsupervised pre-training of a deep lstm-based ACM SIGSAC Conference on Computer and Communications Security (CCS) (2019),
stacked autoencoder for multivariate time series forecasting problems. Scientific pp. 1741ś1758.
reports 9, 1 (2019), 1ś16. [72] Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Fed-
[61] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mo- erated learning with matched averaging. arXiv preprint arXiv:2002.06440 (2020).
bilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE [73] Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. Privacy risk in machine
conference on computer vision and pattern recognition (2018), pp. 4510ś4520. learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer
[62] Schuster, F., Costa, M., Fournet, C., Gkantsidis, C., Peinado, M., Mainar- Security Foundations Symposium (CSF) (2018), IEEE, pp. 268ś282.
Ruiz, G., and Russinovich, M. Vc3: Trustworthy data analytics in the cloud using [74] You, Y., Chen, T., Wang, Z., and Shen, Y. L2-gcn: Layer-wise and learned
sgx. In 2015 IEEE Symposium on Security and Privacy (2015), IEEE, pp. 38ś54. efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF
[63] Microsoft SEAL (release 3.5). https://github.com/Microsoft/SEAL, Apr. 2020. Conference on Computer Vision and Pattern Recognition (2020), pp. 2127ś2135.
Microsoft Research, Redmond, WA. [75] Zhang, X., Li, F., Zhang, Z., Li, Q., Wang, C., and Wu, J. Enabling execution
[64] Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference assurance of federated learning at untrusted participants. In IEEE INFOCOM
attacks against machine learning models. In 2017 IEEE Symposium on Security 2020-IEEE Conference on Computer Communications (2020), IEEE, pp. 1877ś1886.
and Privacy (SP) (2017), IEEE, pp. 3ś18. [76] Zhao, B. Z. H., Kaafar, M. A., and Kourtellis, N. Not one but many tradeoffs:
[65] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large- Privacy vs. utility in differentially private machine learning. In Cloud Computing
scale image recognition. arXiv preprint arXiv:1409.1556 (2014). Security Workshop (2020), ACM CCS.
[66] Subramani, P., Vadivelu, N., and Kamath, G. Enabling fast differentially private [77] Zhao, S., Zhang, Q., Qin, Y., Feng, W., and Feng, D. Sectee: A software-based
sgd via just-in-time compilation and vectorization. arXiv preprint arXiv:2010.09063 approach to secure enclave architecture using tee. In Proceedings of the 2019 ACM
(2020). SIGSAC Conference on Computer and Communications Security (2019), pp. 1723ś
[67] Sun, Z., Kairouz, P., Suresh, A. T., and McMahan, H. B. Can you really backdoor 1740.
federated learning? arXiv preprint arXiv:1911.07963 (2019). [78] Zhu, L., Liu, Z., and Han, S. Deep leakage from gradients. In Advances in Neural
[68] Testuggine, D., and Mironov, I. Introducing Opacus: A high-speed library for Information Processing Systems (2019), pp. 14774ś14784.
training PyTorch models with differential privacy, 2020 (accessed January 1, 2021).

107
PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments MobiSys ’21, June 24-July 2, 2021, Virtual, WI, USA

A APPENDIX Test accuracy results are shown in Figure 8. It is indicated that


in general when the server holds more public data, the final global
A.1 Transferring Public Datasets
model can reach a higher test accuracy. This is as expected since the
The server can potentially gather data that have a similar distri- server gathers a larger part of the training datasets. With complete
bution to clients’ private data. In initialization, the server trains training datasets, this process will finally become centralized train-
a global model based on the gathered data rather than using one ing. Nevertheless, this indication is not always held. For example,
existing model. Then, the server broadcasts the trained model to in the IID case (see the two left plots in Figure 8), when training all
clients’ devices. Clients feed their private data into the model but layers, servers with public data of 0.1 fraction outperform servers
update only the last layers inside the TEE during local training. without public data, i.e., the end-to-end FL, while regarding Non-
Also, only the last layers being trained are uploaded to the server IID of CIFAR10, servers with 0.1 fraction cannot outperform that
for secure aggregation. Because the server holds public data, we without public data (see right plots in Figure 8b). One reason for
expect it to retrain the complete model before each communication it is that the first layers, which are trained on public datasets, can-
round in order to keep fine-tuning the first layers. Here, we fix the not represent all features of privacy datasets. We also observe that
communication rounds to 20 and measure only the test accuracy. when the server does not have enough public data (e.g., 0.1 frac-
We expect the system cost to be similar to transferring from models tion), training only the last 1 or 2 layers can lead to extremely low
because, similarly, only the last layers are trained at the client-side. performance or even failure. Still, this is because the first layers
cannot represent sufficiently the clients’ datasets.
100% 100%
Another observation is that the number of training last layers
90% 90%
does not have a significant influence on test accuracy in terms of
80% 80%
Test Acc.

Test Acc.

IID cases, especially when the server holds more public data. This
70% 70%
60% 60%
is because learning from IID public data is able to represent the
50%
Training last
50%
feature space of the complete (private) training datasets. However,
1 layer 2 layers

40% 3 layers All 40%


the results change when it comes to the Non-IID case. The number
of training last layers has a significant influence on test accuracy.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Frac. of public data (IID) Frac. of public data (Non−IID) For instance, regarding VGG9, training only the last 1 or 2 layers at
the client-side performs much worse compared to training 3 or 4
(a) Transfer from public MNIST, to train LeNet. layers (see right plots in Figure 8b). Moreover, training 3 or 4 layers
90% 90% tend to have better test accuracy than training more layers (e.g., all
80% ● ●
80%
70% ●
● ●
● ●

70%
layers). One explanation is that the feature extraction capability of

first layers is good enough when the server has many public data,
Test Acc.

Test Acc.

60% ●
60%
50% 50%
40% 40%


so fine-tuning these first layers at the client (e.g., training all layers)

Training last
30% ● 1 layer 2 layers 3 layers
30% ●

may destroy the model and consequently drop the accuracy.
● ●
20% 20%
10%
4 layers 5 layers 6 layers
10% ●
Overall, by training only the last several layers at the client-
7 layers 8 layers All
0% 0% side, PPFL with public datasets can guarantee privacy, and in the
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 meanwhile, achieve better performance than that of training all
Frac. of public data (IID) Frac. of public data (Non−IID)
layers.
(b) Transfer from public CIFAR10, to train VGG9.

Figure 8: Test accuracy when learning with public datasets.


The short red line ( ) starting from y-axis refers to end-to-
end FL. Each trail runs for 10 times, and error bars refer to
95% confidence interval (Note: In the top left figure, test accu-
racy is very high and almost the same, as the range of y-axis
is set as the same for the same dataset (i.e., MNIST here). In
the bottom right figure (i.e., for CIFAR10), several trails fail
to train and thus corresponding points are not plotted).

108

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy