Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and A New Dataset
Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and A New Dataset
Abstract—We investigate privacy-preserving, video-based action recognition in deep learning, a problem with growing importance in
smart camera applications. A novel adversarial training framework is formulated to learn an anonymization transform for input videos
such that the trade-off between target utility task performance and the associated privacy budgets is explicitly optimized on the
arXiv:1906.05675v6 [cs.CV] 21 Mar 2021
anonymized videos. Notably, the privacy budget, often defined and measured in task-driven contexts, cannot be reliably indicated using
any single model performance because strong protection of privacy should sustain against any malicious model that tries to steal
private information. To tackle this problem, we propose two new optimization strategies of model restarting and model ensemble to
achieve stronger universal privacy protection against any attacker models. Extensive experiments have been carried out and analyzed.
On the other hand, given few public datasets available with both utility and privacy labels, the data-driven (supervised) learning
cannot exert its full power on this task. We first discuss an innovative heuristic of cross-dataset training and evaluation, enabling the
use of multiple single-task datasets (one with target task labels and the other with privacy labels) in our problem. To further address
this dataset challenge, we have constructed a new dataset, termed PA-HMDB51, with both target task labels (action) and selected
privacy attributes (skin color, face, gender, nudity, and relationship) annotated on a per-frame basis. This first-of-its-kind video dataset
and evaluation protocol can greatly facilitate visual privacy research and open up other opportunities. Our codes, models, and the
PA-HMDB51 dataset are available at: https:// github.com/ VITA-Group/ PA-HMDB51.
1 I NTRODUCTION
generally outperforms the others under our framework unauthorized access from attackers. However, they are not
and give intuitive explanations. immediately applicable to preventing authorized agents
• Practical Approximations of “Universal” Privacy Pro- (such as the back-end analytics) from the unauthorized
tection. The privacy budget in our framework cannot be abuse of information, causing privacy breach concerns. A
defined w.r.t. one model that predicts privacy attributes. few encryption-based solutions, such as Homomorphic En-
Instead, the ideal protection of privacy must be univer- cryption (HE) [20], [21], were developed to locally encrypt
sal and model-agnostic, i.e., preventing every possible visual information. The server can only get access to the
attacker model from predicting private information. To encrypted data and conduct a utility task on it. However,
resolve this so-called “∀ challenge”, we propose two ef- many encryption-based solutions will incur high computa-
fective strategies, i.e., restarting and ensembling, to enhance tional costs at local platforms. It is also challenging to gen-
the generalization capability of the learned anonymiza- eralize the cryptosystems to more complicated classifiers.
tion to defend against unseen models. We leave it as our Chattopadhyay et al. [22] combined the detection of regions
future work to find better methods for this challenge. of interest and the real encryption techniques to improve
privacy while allowing general surveillance to continue.
• A New Dataset with Action and Privacy Annotations.
When it comes to evaluating privacy protection on com- Anonymization by Empirical Obfuscations. An alternative
plicated privacy attributes, there is no off-the-shelf video approach towards a privacy-preserving vision system is
dataset with both action (utility) and privacy attributes based on the concept of anonymized videos. Such videos
annotated, either for training or testing. Such a dataset are intentionally captured or processed by empirical obfus-
challenge is circumvented in our previous work [13] by cations to be in special low-quality conditions, which only
using the VISPR [14] dataset as an auxiliary dataset to allow for recognizing some target events or activities while
provide privacy annotations for cross-dataset evaluation avoiding the unwanted leak of the identity information for
(details in Section 3.6). However, this protocol inevitably the human subjects in the video.
suffers from the domain gap between the two datasets: Ryoo et al. [10] showed that even at the extremely low
while the utility was evaluated on one dataset, the privacy resolutions, reliable action recognition could be achieved
was measured on a different dataset. The incoherence in by learning appropriate downsampling transforms, with
utility and privacy evaluation datasets makes the obtained neither unrealistic activity-location assumptions nor extra
utility-privacy trade-off less convincing. To reduce this specific hardware resources. The authors empirically veri-
gap, in this paper, we construct the very first testing fied that conventional face recognition easily failed on the
benchmark dataset, dubbed Privacy-Annotated HMDB51 generated low-resolution videos. Butler et al. [11] used im-
(PA-HMDB51), to evaluate privacy protection and action age operations like blurring and superpixel clustering to get
recognition on the same videos simultaneously. The new anonymized videos, while Dai et al. [12] used extremely low
dataset consists of 515 videos originally from HMDB51. resolution (e.g., 16×12) camera hardware to get anonymized
For each video, privacy labels (five attributes: skin color, videos. Winkler et al. [23] used cartoon-like effects with a
face, gender, nudity, and relationship) are annotated on a customized version of mean shift filtering. Wang et al. [24]
per-frame basis. We benchmark our proposed framework proposed a lens-free coded aperture (CA) camera system,
on the new dataset and justify its effectiveness. producing visually unrecognizable and unrestorable im-
The paper is built upon our prior work [13] with multiple age encodings. Pittaluga & Koppal [25], [26] proposed to
improvements: (1) a detailed discussion and comparison on use privacy-preserving optics to filter sensitive information
three optimization strategies for the proposed framework; from the incident light-field before sensor measurements
(2) a more extensive experimental and analysis section; are made, by k -anonymity and defocus blur. Earlier work
and (3) most importantly, the construction of the new PA- of Jia et al. [27] explored privacy-preserving tracking and
HMDB51 dataset, together with the associated benchmark coarse pose estimation using a network of ceiling-mounted
results. time-of-flight low-resolution sensors. Tao et al. [28] adopted
a network of ceiling-mounted binary passive infrared sen-
2 R ELATED W ORK sors. However, both works [27], [28] handled only a limited
set of activities performed at specific constrained areas in
2.1 Privacy Protection in Computer Vision the room.
With pervasive cameras for surveillance or smart home The usage of low-quality anonymized videos by obfus-
devices, privacy-preserving action recognition has drawn cations was computationally cheap and compatible with
increasing interests from both industry and academia. sensor’s bandwidth constraints. However, the proposed ob-
Transmitting Feature Descriptors. A seemingly reasonable fuscations were not learned towards protecting any visual
and computationally cheaper option is to extract feature privacy, thus having limited effects. In other words, pri-
descriptors from raw images and transmit those features vacy protection came as a “side product” of obfuscation,
only. Unfortunately, previous studies [15]–[19] revealed that and was not a result of any optimization, making the pri-
considerable details of original images could still be recov- vacy protection capability very limited. What is more, the
ered from standard HOG, SIFT, LBP, 3D point clouds, Bag- privacy-preserving effects were not carefully analyzed and
of-Visual-Words or neural network activations (even if they evaluated by human study or deep learning-based privacy
look visually distinctive from natural images). recognition approaches. Lastly, none of the aforementioned
Homomorphic Cryptographic Solutions. Most classical empirical obfuscations extended their efforts to study deep
cryptographic solutions secure communication against learning-based action recognition, making their task per-
2
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 3
formance less competitive. Similarly, the recent progress The visual privacy issues faced by blind people were
of low-resolution object recognition [29]–[31] also put their revealed in [46] with the first dataset in this area. Concrete
privacy protection effects in jeopardy. privacy attributes were defined in [14] with their correla-
Learning-based Solutions. Very recently, a few learning- tion with image content. The authors categorized possible
based approaches have been proposed to address privacy private information in images, and they ran a user study to
protection or fairness problems in vision-related tasks [13], understand privacy preferences. They then provided a siz-
[32]–[40]. Many of them exploited ideas from adversarial able set of 22k images annotated with 68 privacy attributes,
learning. They addressed this problem by learning data on which they trained privacy attributes predictors.
representations that simultaneously reduce the budget cost
of privacy or fairness while maintaining the utility task
performance. 3 M ETHOD
Wu et al. [32] proposed an adversarial training frame-
3.1 Problem Definition
work dubbed Nuisance Disentangled Feature Transform
(NDFT) to utilize the free meta-data (i.e., altitudes, weather Objective. Assume our training data X (raw visual data
conditions, and viewing angles) in conjunction with associ- captured by camera) are associated with a target utility task
ated UAV images to learn domain-robust features for object T and a privacy budget B . Since T is usually a supervised
detection in UAV images. Pittaluga et al. [34] preserved the task, e.g., action recognition or visual tracking, a label set
utility by maintaining the variance of the encoding or favor- YT is provided on X , and a standard cost function LT (e.g.,
ing a second classifier for a different attribute in training. cross-entropy) is defined to evaluate the task performance
Bertran et al. [35] motivated the adversarial learning frame- on T . Usually, there is a state-of-the-art deep neural network
work as a distribution matching problem and defined the fT , which takes X as input and predicts the target labels.
objective and the constraints in mutual information. Roy & On the other hand, we need to define a budget cost function
Boddeti [36] measured the uncertainty in the privacy-related JB to evaluate its input data’s privacy leakage: the smaller
attributes by the entropy of the discriminator’s prediction. JB (·) is, the less private information its input contains.
Oleszkiewicz et al. [41] proposed an empirical data-driven ∗
We seek an optimal anonymization function fA to trans-
privacy metric based on mutual information to quantify the ∗
form the original X to anonymized visual data fA (X), and
privatization effects on biometric images. Zhang et al. [37] an optimal target model fT∗ such that:
presented an adversarial debiasing framework to mitigate • fA∗ has filtered out the private information in X , i.e.,
the biases concerning demographic groups. Ren et al. [38]
learned a face anonymizer in video frames while main-
taining the action detection performance. Shetty et al. [39] JB (fA∗ (X)) JB (X);
presented an automatic object removal model that learns
• the performance of fT is minimally affected when using
how to find and remove objects from general scene images ∗
the anonymized visual data fA (X) compared to when
via a generative adversarial network (GAN) framework.
using the original data X , i.e.,
2.2 Privacy Protection in Social Media/Photo Sharing LT (fT∗ (fA∗ (X)), YT ) ≈ minfT LT (fT (X), YT ).
User privacy protection is also a topic of extensive inter-
est in the social media field, especially for photo sharing. To achieve these two goals, we mathematically formulate
The most common means to protect user privacy in an the problem as solving the following optimization problem:
uploaded photo is to add empirical obfuscations, such as
fA∗ , fT∗ = argmin[LT (fT (fA (X)), YT ) + γJB (fA (X))]. (1)
blurring, mosaicing, or cropping out certain regions (usually Af ,f
T
faces) [42]. However, extensive research showed that such
an empirical approach could be easily hacked [43], [44]. Definition of JB and LT . The definition of the privacy
A recent work [45] described a game-theoretical system in budget cost JB is not straightforward. Practically, it needs
which the photo owner and the recognition model strive to be placed in concrete application contexts, often in a task-
for antagonistic goals of disabling recognition, and better driven way. For example, in smart workplaces or smart
obfuscation ways could be learned from their competition. homes with video surveillance, one might often want to
However, their system was only designed to confuse one avoid disclosure of the face or identity of persons. Therefore,
specific recognition model via finding its adversarial per- to reduce JB could be interpreted as to suppress the success
turbations. Fooling only one recognition model can cause rate of identity recognition or verification. Other privacy-
obvious overfitting as merely changing to another recog- related attributes, such as race, gender, or age, can be simi-
nition model will likely put the learning efforts in vain: larly defined too. We denote the privacy-related annotations
such perturbations cannot even protect privacy from human (such as identity label) as YB , and rewrite JB (fA (X)) as
eyes. The problem setting in [45] thus differs from our target JB (fB (fA (X)), YB ), where fB denotes the privacy budget
problem. Another notable difference is that we usually hope model which takes (anonymized or original) visual data
to generate minimum perceptual quality loss to photos as input and predicts the corresponding private informa-
after applying any privacy-preserving transform to them tion. Different from LT , minimizing JB will encourage
in social photo sharing. There is no such restriction in our fB (fA (X)) to diverge from YB . Without loss of generality,
scenario. We can apply a much more flexible and aggressive we assume both fT and fB to be classification models and
transformation to the image. output class labels. Under this assumption, we choose both
3
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 4
LT and LB as the cross-entropy function, and JB as the minimax objectives like those in GANs, which are often
negative cross-entropy function: interpreted as a two-party competition game. In contrast,
our Eq. (3) is more “hybrid” and can be interpreted as a
JB , −H(YB , fB (fA (X))),
more complicated three-party competition, where (adopting
where H(·, ·) is the cross-entropy function. machine learning security terms) fA is an obfuscator, fT
Two Challenges. Such a supervised, task-driven definition is a utilizer collaborating with the obfuscator, and fB is
of JB poses at least two challenges: (1) Dataset challenge: an attacker trying to breach the obfuscator. Therefore, we
The privacy budget-related annotations, denoted as YB , see no obvious best choice from the off-the-shelf minimax
often have less availability than target utility task labels. algorithms to achieve our objective.
Specifically, it is often challenging to have both YT and YB We are thus motivated to try different state-of-the-
available on the same X ; (2) ∀ challenge: Considering the art minimax optimization algorithms on our framework.
nature of privacy protection, it is not sufficient to merely We tested two state-of-the-art minimax optimization algo-
suppress the success rate of one fB model. Instead, we rithms, namely GRL [48] and K -Beam [47], on our frame-
define a privacy prediction function family work and proposed an innovative entropy maximization
method to solve Eq. (3). We empirically show our entropy
P : fA (X) 7→ YB , maximization algorithm outperforms both state-of-the-art
so that the ideal privacy protection by fA should be reflected minimax optimization algorithms and discuss its advan-
as suppressing every possible model fB from P . That differs tages. In Section 3.3, we present the comparison of three
from the common supervised training goal, where only one methods and hope it will benefit future research on similar
model needs to be found to fulfill the target utility task problems.
successfully.
We address the dataset challenge by two ways: (1) cross 3.2 Basic Framework
dataset training and evaluation (Section 3.4); and more Pipeline. Our framework is a privacy-preserving action
importantly (2) building a new dataset annotated with both recognition pipeline that uses video data as input. It is
utility and privacy labels (Section 5). We defer their discus- a prototype of the in-demand privacy protection in smart
sion to respective experimental paragraphs. camera applications. Figure 1 depicts the basic framework
Handling the ∀ challenge is more challenging. Firstly, we implementing the proposed formulation in Eq. (3). The
re-write the general form in Eq. (1) with the task-driven framework consists of three parts: the anonymization model
definition of JB as follows: fA , the target utility model fT , and the privacy budget
fA∗ , fT∗ = argmin(fA ,fT ) [LT (fT (fA (X)), YT )+ model fB . fA takes raw video X as input, filters out pri-
(2) vate information in X , and outputs the anonymized video
γ supfB ∈P JB (fB (fA (X)), YB )].
fA (X). fT takes fA (X) as input and carries out the target
The ∀ challenge is the infeasibility to directly solve Eq. (2), utility task. fB also take fA (X) as input and try to predict
due to the infinite search space of fB in P . Secondly, we the private information from fA (X). All three models are
propose to solve the following approximate problem by implemented with deep neural networks, and their param-
setting fB as a neural network with a fixed structure: eters are learnable during the training procedure. The entire
fA∗ , fT∗ = argmin(fA ,fT ) [LT (fT (fA (X)), YT )+ pipeline is trained under the guidance of the hybrid loss
(3) of LT and JB . The training procedure has two goals. The
γ maxfB JB (fB (fA (X)), YB )]. ∗
first goal is to find an optimal anonymization model fA
Lastly, we propose “model ensemble” and “model restart- that can filter out the private information in the original
ing” (Section 3.5) to handle the ∀ challenge better and boost video while keeping useful information for the target utility
the experimental results further. task. The second goal is to find a target model that can
Considering the ∀ challenge, the evaluation protocol for achieve good performance on the target utility task using
∗
privacy-preserving action recognition is more intricate than anonymized videos fA (X). Similar frameworks have been
traditional action recognition task. We propose a two-step used in feature disentanglement [49]–[52]. After training,
∗
protocol (as described in Section 3.6) to evaluate fA and the learned anonymization model can be applied on a local
fT∗ on the trade-off they have achieved between target task device (e.g., smart camera), by designing an embedded
utility and privacy protection budget. chipset responsible for the anonymization at the hardware-
Solving the Minimax. Solving Eq. (3) is still challenging level [38]. We can convert raw video to anonymized video
because the minimax problem is hard by its nature. Tra- locally and only transfer the anonymized video through the
ditional minimax optimization algorithms based on alter- Internet to the backend (e.g., cloud) for target utility task
nating gradient descent can only find minimax points for analysis. The private information in the raw videos will be
convex-concave problems, and they achieve sub-optimal unavailable on the backend.
solutions on deep neural networks since they are neither Implementation. Specifically, fA is implemented using the
convex nor concave. Some very recent minimax algorithms, model in [53], which can be taken as a 2D convolution-based
such as K -Beam [47], have been shown to be promising frame-level filter. In other words, fA converts each frame
in none convex-concave and deep neural network appli- in X into a feature map of the same shape as the original
cations. However, these methods rely on heavy parameter frame. We use state-of-the-art human action recognition
tuning and are effective only in limited situations. Besides, model C3D [54] as fT and state-of-the-art image classifica-
our optimization goal in Eq. (3) is even harder than common tion models, such as ResNet [55] and MobileNet [56], as fB .
4
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 5
5
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 6
Algorithm 2: Ours-K -Beam Algorithm privacy attributes, we run into the dataset challenge: in
i K the literature, no existing datasets have both human action
1 Initialize θA , θT and {θB }i=1 ;
labels and privacy attributes provided on the same videos.
2 for t0 ← 1 to max iter do
Given the observation that a privacy attributes predictor
3 /*LT step:*/
trained on VISPR can correctly identify privacy attributes
4 while Acc(fT (fA (X v )), YTv ) ≤ thT do
occurring in UCF101 and HMDB51 videos (examples in the
5 Update θT , θA using Eq. (6a)
Appendix C), we hypothesize that the privacy attributes
6 end
have good “transferability” across UCF101/HMDB51 and
7 /*LB Max step:*/
VISPR. Therefore, we can use a privacy prediction model
8 Update j using Eq. (6b-i)
trained on VISPR to assess the privacy leak risk on
9 for t1 ← 1 to d iter do
UCF101/HMDB51.
10 Update θA using Eq. (6b-ii)
In view of that, we propose to use cross-dataset training
11 end
and evaluation as a workaround method. In brief, we train
12 /*LB Min step:*/
action recognition (target utility task) on human action
13 for i ← 1 to K do
i datasets, such as UCF101 [58] and HMDB51 [59], and train
14 while Acc(fB (fA (X t )), YBt ) ≤ thB do
i privacy protection (budget task) on visual privacy dataset
15 Update θB using Eq. (6c)
VISPR [14], while letting the two interact via their shared
16 end
component - the learned anonymization model. More specif-
17 end
ically, during training, we have two pipelines: one is fA and
18 end
fT trained on UCF101 or HMDB51 for action recognition;
the other is fA and fB trained on VISPR to suppress multi-
ple privacy attribute prediction. The two pipelines share the
where H(·) is the entropy function. Minimizing −HB is same parameters for fA . During the evaluation, we evaluate
equivalent to maximizing entropy, which will encourage model utility (i.e., action recognition) on the testing set of
“uncertain” predictions. We replace JB in Eq. (4a) by −HB , UCF101 or HMDB51 and privacy protection performance
abbreviate HB (fB (fA (X))) as HB (θA , θB ), and propose the on the testing set of VISPR. Such cross-dataset training
following new update scheme: and evaluation shed new possibilities on training privacy-
preserving recognition models, even under the practical
θA ← θA − αA ∇θA (LT (θA , θT ) − γHB (θA , θB )), (7a)
shortages of datasets that have been annotated for both
θT , θA ← θT , θA − αT ∇θT ,θA LT (θA , θT ), (7b) tasks. Notably, “cross-dataset training” and “cross-dataset
θB ← θB − αB ∇θB LB (θA , θB ), (7c) testing (or evaluation)” are two independent strategies used
in this paper; they can be used either together or sepa-
where LT and LB are still cross-entropy loss functions as
rately. Details of our three experiments (SBU, UCF-101, and
in Eq. (4). Unlike in Eq. (4b), where we only update θT
HMDB51) are explained as follows:
when minimizing LT , we train θT and θA in an end-to-end
• SBU (Section 4.1): we train and evaluate our framework
manner as shown in Eq. (7b), since we find it achieves better
on the same video set by considering actor identity as
performance in practice.
a simple privacy attribute. Neither cross-training nor cross-
We denote this method as Ours-Entropy in the following
evaluation is involved.
parts and give the details in Algorithm 3.
• UCF101 (Section 4.2): we perform both cross-training and
cross-evaluation, on UCF-101 + VISPR. Such a method
Algorithm 3: Ours-Entropy Algorithm provides an alternative to flexibly train and test privacy-
1 Initialize θA , θT and θB ; preserving video recognition for different utility/privacy
2 for t0 ← 1 to max iter do combinations, without annotating specific datasets.
3 Update θA using Eq. (7a) • HMDB51 (Section 5.5), we use cross-training on HMDB51
4 while Acc(fT (fA (X v )), YTv ) ≤ thT do + VISPR datasets similarly to the UCF-101 experiment;
5 Update θT , θA using Eq. (7b) but for testing, we evaluate both utility and privacy
6 end performance on the same, newly-annotated PA-HMDB51
7 while Acc(fB (fA (X t )), YBt ) ≤ thB do testing set. Therefore, it involves cross-training, but no
8 Update θB using Eq. (7c) cross-evaluation.
9 end Beyond the above initial attempt, we further construct
10 end a new dataset dedicated to the privacy-preserving action
recognition task, which will be presented in Section 5.
6
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 7
i
3.5.1 Privacy Budget Model Restarting and HB (θA , θB ) respectively. The new parameter updating
scheme is:
Motivation. The max step over JB (fB (fA (X)), YB ) in
i
Eq. (3) leads to the optimizer being stuck in bad local θA ← θA − αA ∇θA (LT + γ maxθBi ∈P̄t −HB (θA , θB )), (9a)
solutions (similar to “mode collapse” in GANs), that will θA , θT ← θA , θT − αT ∇(θA ,θT ) LT (θA , θT ), (9b)
hurdle the entire minimax optimization. Model restarting i i i
θB ← θB − αB ∇θBi LB (θA , θB ), ∀i ∈ {1, . . . , M }. (9c)
provides a mechanism to “bypass” the bad solution when
it occurs, thus enabling the minimax optimizer to explore That’s to say, we only suppress the model fB i
with the
better solution. largest privacy leakage −HB , i.e., the “most confident” one
Approach. At certain point of training (e.g., when the about its current privacy prediction, when updating the
privacy budget LB (fB (fA (X)), YB ) stops decreasing any anonymization model fA . But we still update all M budget
further), we re-initialize fB with random weights. Such a models on the budget task. The formal description of Ours-
random restarting aims to avoid trivial overfitting between Entropy with model restarting and ensemble is given in
i M
fB and fA (i.e., fA is only specialized at confusing the cur- Algorithm 4, where {θB }i=1 is reinitialized every rstrt iter
rent fB ), without requiring more parameters. We then start iterations. Likewise, GRL and Our-K -Beam can also be
to train the new model fB to be a strong competitor, w.r.t. incorporated with restarting and ensemble, whoses details
the current fA (X): specifically, we freeze the training of are shown in Appendix A.
fA and fT , and change to minimizing LB (fB (fA (X)), YB ),
until the new fB has been trained from scratch to become Algorithm 4: Ours-Entropy Algorithm (with Model
a strong privacy prediction model over current fA (X). We Restarting and Model Ensemble)
then resume adversarial training by unfreezing fA and fT , i M
1 Initialize θA , θT and {θB }i=1 ;
as well as switching the loss for fB back to the adversarial
2 for t0 ← 1 to max iter do
loss (negative entropy or negative cross-entropy). Such a
3 if t ≡ 0 (mod rstrt iter) then
random restarting can repeat multiple times. i M
4 Reinitialize {θB }i=1
5 end
3.5.2 Privacy Budget Model Ensemble 6 Update θA using Eq. (9a)
7 while Acc(fT (fA (X v )), YTv ) ≤ thT do
Motivation. Ideally in Eq. (3) we should maximize the error 8 Update θT , θA using Eq. (9b)
over the “current strongest possible” attacker fA from P 9 end
(a large and continuous fB family), over which search- 10 for i ← 1 to M do
i
ing/sampling is impractical. Therefore we propose a pri- 11 while Acc(fB (fA (X t )), YBt ) ≤ thB do
i
vacy budget model ensemble as an approximation strategy, 12 Update θB using Eq. (9c)
where we approximate the continuous P with a discrete set 13 end
of M sample functions. Such a strategy is empirically veri- 14 end
fied in Section 4 and 5 to address the critical “∀ Challenge” 15 end
in privacy protection, i.e., enhancing the defense against
unseen attacker models (compared to the clear “attacker
overfitting” phenomenon when sticking to one fA during
training). 3.6 Two-Step Evaluation Protocol
∗
Approach. Given the budget model ensemble P̄t , The solution to Eq. (2) gives an anonymization model fA
∗
{fBi }M
i=1 , where M is the number of fB s in the ensemble
and a target utility task model fT . We need to evaluate
during training, we turn to minimize the following dis- fA∗ and fT∗ on the trade-off they have achieved between
cretized surrogate of Eq. (2): target task utility and privacy protection in two steps: (1)
whether the learned target utility task model maintains sat-
fA∗ , fT∗ = argminfA ,fT [LT (fT (fA (X)), YT )+ isfactory performance on anonymized videos; (2) whether
(8) the performance of an arbitrary privacy prediction model on
γ maxfBi ∈P̄t JB (fBi (fA (X)), YB )].
anonymized videos will deteriorate.
Suppose we have a training dataset X t with target
The previous basic framework is a special case of Eq. (8) and budget task ground truth labels YTt and YBt , and an
with M = 1. The ensemble strategy can be easily incorpo- evaluation dataset X e with target and budget task ground
rated with restarting. truth labels YTe and YBe . In the first step, when evaluating the
target task utility, we should follow the traditional routine:
3.5.3 Incorporate Budget Model Restarting and Budget compare fT∗ (fA ∗
(X e )) with YTe to get the evaluation accuracy
Model Ensemble with Ours-Entropy on the target utility task, denoted as AT , which we expect to
be as high as possible. In the second step, when evaluating
Budget Model Restarting and Budget Model Ensemble can the privacy protection, it is insufficient if we only observe
∗ ∗
be easily incorporated with all three optimization schemes that the learned fA and fB lead to poor classification ac-
e
described in Section 3.3. We take Ours-Entropy as an ex- curacy on X , because of the ∀ challenge: the attacker can
ample here. When model ensemble is used, we abbreviate select any privacy budget model to steal private information
LB (fBi (fA (X)), YB ) and HB (fBi (fA (X))) as LB (θA , θB
i
) from anonymized videos fA ∗
(X e ). To empirically verify that
7
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 8
fA∗ prohibits reliable privacy prediction for other possible • Box-Black-Body: Boxing and blackening bodies.
budget models, we propose a novel procedure: • Seg-Black-Face: Segmenting and blackening faces.
• We randomly re-sample N privacy budget prediction • Seg-Black-Body: Segmenting and blackening bodies.
i N Box-Blur-Face: Boxing and blurring faces.
models P̄e , {fB }i=1 from P for evaluation. Note that •
these N models used in evaluation P̄e have no overlap • Box-Blur-Body: Boxing and blurring bodies.
with the M privacy budget model ensemble P̄t used in • Seg-Blur-Face: Segmenting and blurring faces.
training, i.e., P̄e ∩ P̄t = ∅. • Seg-Blur-Body: Segmenting and blurring bodies.
• We train these N models P̄e on anonymized training Our Proposed Approaches. The previous two groups
∗
videos fA (X t ) to make correct predictions on private of baselines are compared with our proposed three ap-
i
information, i.e., minfBi LB (fB (fA∗ (X t )), YBt ), ∀fBi ∈ P̄e . proaches:
∗
Note that fA is fixed during this training procedure. • GRL: as described in Section 3.3.1.
i
• After that, we apply each fB on anonymized evaluation • Ours-K -Beam: as described in Section 3.3.2. We have tried
∗ e i
videos fA (X ) and compare the outputs fB (fA∗ (X e )) K = 1, 2, 4, 8.
with YBe to get privacy budget accuracy of the i-th budget • Ours-Entropy: as described in Section 3.3.3. In the privacy
i
model, i.e., Acc(fB (fA∗ (X e )), YBe ). budget model ensemble P̄t , the M models are chosen
• We select the highest accuracy among all N privacy bud- from MobileNet-V2 [62] family with different width mul-
get models and use it as the final privacy budget accuracy tipliers. We have tried M = 1, 2, 4, 8.
ANB , which we expect to be as low as possible. Specifically, All three approaches are evaluated with and without pri-
we have vacy budget model restarting.
AN i ∈P̄ Acc(f (f (X )), Y ).
B = maxfB e
i
B A
∗ e e
B (10) Evaluation. In the two-step evaluation (as described in
Section 3.6), we have used N = 10 different state-of-the-
art classification networks, namely ResNet-V1-{50,101} [55],
4 S IMULATION E XPERIMENTS
ResNet-V2-{50,101} [63], Inception-V1 [64], Inception-
We show the effectiveness of our framework on privacy- V2 [65], and MobileNet-V1-{0.25,0.5,0.75,1} [56], as P̄e . Note
preserving action recognition on existing datasets. that P̄e ∩ P̄t = ∅. All detailed numerical results reported in
Overview of Experiment Settings. The target utility task following sections can be found in Appendix B.
is human action recognition, since it is a highly demanded
feature in smart home and smart workplace applications. 4.1 Identity-Preserving Action Recognition on SBU:
Experiments are carried out on three widely used human Single-Dataset Training
action recognition datasets: SBU Kinect Interaction Dataset
We compare our proposed approaches with the groups of
[61], UCF101 [58] and HMDB51 [59]. The privacy budget
baseline approaches to show our methods’ significant su-
task varies in different settings. In the SBU dataset exper-
periority in balancing privacy protection and model utility.
iments, the privacy budget is to prevent the videos from
We use three different optimization schemes described in
leaking human identity information. In the experiments on
Section 3.3 on our framework and empirically show all three
UCF101 and HMDB51, the privacy budget is to protect vi-
largely outperform the baseline methods. We also show
sual privacy attributes as defined in [14]. We emphasize that
that adding the model ensemble and model restarting, as
the general framework proposed in Section 3.2 can be used
described in Section 3.5, to the optimization procedure can
for a large variety of target utility tasks and privacy budget
further improve the performance of our method.
task combinations, not only limited to the aforementioned
settings. 4.1.1 Experiment Setting
Following the notations in Section 3.2, on all the video SBU Kinect Interaction Dataset [61] is a two-person inter-
action recognition datasets including SBU, UCF101 and action dataset for video-based action recognition. 7 partici-
HMDB51, we set W = 112, H = 112, C = 3, and T = 16 pants performed actions, and the dataset is composed of 21
(C3D’s required temporal length and spatial resolution). sets. Each set uses different pairs of actors to perform all 8
Note that the original resolution for SBU, UCF101 and interactions. However, some sets use the same two actors
HDMB51 are 640×480, 320×240 and 320×240, respectively. but with different actors acting and reacting. For example,
We downsample video frames to resolution 160×120. To re- in set 1, actor 1 is acting, and actor 2 is reacting; in set 4,
duce the spatial resolution to 112×112, we use random-crop actor 2 is acting, and actor 1 is reacting. These two sets
and center-crop in training and evaluation, respectively. have the same actors, so we combine them as one class to
Baseline Approaches. We consistently use two groups of better fit our experimental setting. In this way, we combine
approaches as baselines across the three action recognition all sets with the same actors and finally get 13 different actor
datasets. These two groups of baselines are naive downsam- pairs. This dataset’s target utility task is action recognition,
ples and empirical obfuscations. The group of naive downsam- which could be taken as a classification task with 8 different
ples chooses downsample rates from {1, 2, 4, 8, 16}, where 1 classes. The privacy budget task is to recognize the actor
stands for no down-sampling. The group of empirical obfus- pairs of the videos, which could be taken as a classification
cations includes approaches selected from different combi- task with 13 different classes.
nations in {box, segmentation} × {blurring, blackening} ×
{face, human body}. Details are listed below: 4.1.2 Implementation Details
• Naive Downsamle: Spatially downsample each frame. In Algorithms 1-3, we set step sizes αT = 10−5 , αB = 10−2 ,
• Box-Black-Face: Boxing and blackening faces. αA = 10−4 , accuracy thresholds thT = 85%, thB = 99%
8
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 9
9
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 10
70
65
Action accuracy AT (%)
60
55
50
45
GRL+ Ours-Entropy+
30
25 30 35 40 45 Original M=1 M=1+ M=4+
Fig. 4: The center frame of example videos before (column 1) and after
Privacy Attributes cMAP AN
B (%) (columns 2-4) applying the anonymization transform learned by Ours-
Fig. 3: The trade-off between privacy budget and action utility on UCF- Entropy. The first row shows a frame from a “pushing” video in the
101/VISPR Dataset. For Naive Downsample method, a larger marker SBU dataset; the second row shows a frame from a “handstand” video
means a larger down sampling rate is adopted. For Ours-K -Beam in the UCF101 dataset; the third row shows a frame from a “push-
method, a larger marker means a larger K (number of beams) in up” video in the PA-HMDB51 dataset. Privacy attributes in the last two
Algorithm 2. For Ours-Entropy and Ours-Entropy (restarting), a larger rows include semi-nudity, face, gender, and skin color. Model restarting
marker means a larger M (number of ensemble models) in Algorithm 4. and ensemble settings are indicated below each anonymized image. M
Methods with “+” superscript are incorporated with model restarting. is the number of ensemble models. Methods with a “+” superscript are
Vertical and horizontal purple dashed lines indicate AN incorporated with model restarting.
B and AT on
the original non-anonymized videos, respectively. The black dashed
line indicates where AN B = AT . Detailed experimental settings and
numerical results for each method can be found in Appendix B.
between privacy budget and action utility achieved by a
∗
learned anonymization model fA . To solve this problem,
we annotate and present the very first human action video
4.2.3 Results and Analyses
dataset with privacy attributes labeled, named PA-HMDB51
We present the experimental results in Figure 3. All naive (Privacy-Annotated HMDB51). We evaluate our method
downsample and empirical obfuscation methods cause AT on this newly built dataset and further demonstrate our
to drop dramatically while ANB only drops a little bit, which method’s effectiveness.
means the utility of videos is greatly reduced while the
private information is hardly filtered out. In contrast, with
the help of model restarting and model ensemble, Ours- 5.2 Selecting and Labeling Privacy Attributes
Entropy can decrease AN B by 7% while keeping AT as high
as that on the original raw videos, meaning the privacy A recent work [14] has defined 68 privacy attributes which
is protected at almost no cost on the utility. Hence, Ours- could be disclosed by images. However, most of them sel-
Entropy outperforms all naive downsample and empirical dom make any appearance in public human action datasets.
obfuscation baselines in this experiment. It also shows an We carefully select 5 privacy attributes that are most rel-
advantage over GRL and Ours-K -Beam. evant to our smart home settings out of the 68 attributes
from [14]: skin color, gender, face, nudity, and personal
relationship (only intimate relationships such as friends,
4.3 Anonymized Video Visualization
couples or family members are considered in our setting).
We provide the visualization of the anonymized videos The detailed description of each attribute, their possible
on SBU, UCF101, and our new dataset PA-HMDB51 (see ground truth values, and their corresponding meanings are
Section 5) in Figure 4. To save space, we only show the listed in Table 1. Some annotated frames in our PA-HMDB51
center frame of each anonymized video. The visualization dataset are shown in Table 2 as examples.
shows that the privacy attributes in the anonymized videos Privacy attributes may vary during the video clip. For
are filtered out, but it is still possible to recognize the actions. example, in some frames, we may see a person’s full face,
while in the next frames, the person may turn around, and
5 PA-HMDB51: A N EW B ENCHMARK his/her face is no longer visible. Therefore, we decide to
label all the privacy attributes on each frame 4 .
5.1 Motivation
The annotation of privacy labels was manually per-
There is no public dataset containing both human action and formed by a group of students at the CSE department of
privacy attribute labels on the same videos in the literature. Texas A&M University. Each video was annotated by at least
This poses two challenges. Firstly, the lack of available three individuals and then cross-checked.
datasets has increased the difficulty in employing a data-
driven joint training method. Secondly, this complication 4. A tiny portion of frames in some HMDB51 videos do not contain
has made it impossible to directly evaluate the trade-off any person. No privacy attributes are annotated on those frames.
10
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 11
1.0
5.3 HMBD51 as the Data Source draw sword
kiss
draw sword
kiss
ride bike ride bike
Now that we have defined the 5 privacy attributes, we need dribble
hug
dribble
hug
dive dive
to identify a source of human action videos for annotation. cartwheel cartwheel
drink drink
There are a number of choices available, such as [58], [59], climb
eat
climb
eat
0.8
[67]–[69]. We choose HMDB51 [59] to label privacy at- walk
fall floor
walk
fall floor
stand stand
tributes since it consists of more diverse private information, sword sword
pick pick
especially nudity/semi-nudity and relationship. swing baseball swing baseball
handstand handstand
We provide a per-frame annotation of the selected 5 shoot bow
fencing
shoot bow
fencing
privacy attributes on 515 videos selected from HMDB51. shoot ball
talk
0.6
talk
shoot ball
kick ball kick ball
In this paper, we treat all 515 videos as testing samples5 . push push
pour pour
Our ultimate goal would be to create a larger-scale version wave
ride horse
wave
ride horse
of PA-HMDB51 that allows for both training and testing flic flac
turn
flic flac
turn
shake hands shake hands
coherently on the same benchmark. For now, we use PA- somersault somersault
climb stairs 0.4 climb stairs
HMDB51 to facilitate better testing, while still consider- smoke
golf golf
smoke
ing cross-dataset training as a rough yet useful option to situp
shoot gun
situp
shoot gun
punch punch
train privacy-preserving video recognition (before the larger run run
sit sit
dataset becomes available). smile
kick
smile
kick
sword exercise 0.2 sword exercise
laugh laugh
throw throw
pushup pushup
5.4 Dataset Statistics hit hit
chew chew
5.4.1 Action Distribution clap
pullup
clap
pullup
catch catch
When selecting videos from the HMDB51 dataset, we con- jump
brush hair
jump
brush hair
0.0
sider two criteria on action labels. First, the action labels 0 5 10 15 20 25 0 1 2 3 4 0 1 0 1 2 0 1 2 0 1 2 3
skin color relation face nudity gender
should be balanced. Second (and more implicitly), we select Fig. 5: Left: action distribution of PA-HMDB51. Each bar shows the
more videos with non-trivial privacy labels. For example, number of videos with a certain action. E.g., the last bar shows there
“brush hair” action contains many videos with a “semi- are 25 “brush hair” videos in the PA-HMDB51 dataset; Right: action-
attribute correlation in the PA-HMDB51 dataset. The x-axis are all
nudity” attribute, and “pull-up” action contains many possible values grouped by bracket for each privacy attribute. The y -
videos with a “partially visible face” attribute. Despite their axis are different action types. The color represents ratio of the number
practical importance, these privacy attributes are relatively of frames of some action annotated with a specific privacy attribute
value w.r.t. the total number of frames of the action.
less seen in the entire HMDB51 dataset, so we tend to select
more videos with these attributes, regardless of their action
classes. The resultant distribution of action labels is depicted
0 1 2 3 4
in Figure 5 (left panel), showing a relative class balance.
SC 3 71 14 10 2
5.4.2 Privacy Attribute Distribution RL 84 16
We try to make the label distribution for each privacy
FC 25 38 37
attribute as balanced as possible by manually selecting those
videos containing uncommon privacy attribute values in ND 40 44 15
original HMDB51 to label. For instance, videos with semi-
GR 2 55 36 7
nudity are overall uncommon, so we deliberately select
those videos containing semi-nudity into our PA-HMDB51 Fig. 6: Label distribution per privacy attribute in the PA-HMDB51. SC,
dataset. Naturally, people are reluctant to release data that RL, FC, ND, and GR stand for skin color, relationship, face, nudity,
and gender, respectively. The rounded ratio numbers are shown as
contains privacy concerns to the public, so the privacy at- white text (in % scale). Definitions of label values (0, 1, 2, 3, 4) for each
tributes are highly unbalanced in any public video datasets. attribute are described in Table 1.
Although we have used this method to reduce the data
imbalance, the PA-HMDB51 is still unbalanced. Frame-level
label distributions of all 5 privacy attributes are shown in
Figure 6. “brush hair” since this action is carried out much more often
by females than by males. We show the correlation between
5.4.3 Action-Attribute Correlation privacy attributes and actions in Figure 5 (right panel) and
more details in Appendix D.
If there was a strong correlation between a privacy attribute
and an action, it would be harder to remove the private
information from the videos without much harm to the ac-
6. For “skin color” and “gender,” we allow multiple labels to coexist.
tion recognition task. For example, we would expect a high For example, if a frame showed a black person’s shaking hands with
correlation between the attribute “gender” and the action a white person, we would label “black” and “white” for the “skin
color” attribute. In the visualization, we use “coexisting” to represent
5. Labeling per-frame privacy attributes on a video dataset is ex- the multi-label coexistence and we don’t show in detail whether it is
tremely labor-consuming and subjective (needing individual labeling “white and black coexisting” or “black and yellow coexisting.” For the
then cross-checking). As a result, the current size of PA-HMDB51 is remaining three attributes, we label each attribute using the highest
limited. So far, we have only used PA-HMDB51 as the testing set, and privacy-leakage risk among all persons in the frame. E.g., given a frame
we seek to annotate more data and hopefully expand PA-HMDB51 for where a group of people are hugging, if there is at least one complete
training as future work. face visible, we would label the “face” attribute as “completely visible.”
11
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 12
TABLE 1: Attribute definition in the PA-HMDB51 dataset
and “coexisting” into the one class of “can tell.” This is
Attribute Possible Values & Meaning because we must keep the testing protocol on PA-HMDB51
0: Skin color of the person(s) is/are unidentifiable. consistent with the training protocol on VISPR (a multi-
1: Skin color of the person(s) is/are white. label, “either-or” type binary classification task, so that our
Skin Color 2: Skin color of the person(s) is/are brown/yellow. models cross-trained on UCF101-VISPR can be evaluated
3: Skin color of the person(s) is/are black. directly. We hope to extend training to PA-HMDB51 in the
4: Persons with different skin colors are coexisting.6 future so that the privacy budget task can be formulated
0: Invisible (< 10% area is visible). and evaluated as a multi-label classification problem.
Face 1: Partially visible (≥ 10% and ≤ 70% area is visible). All implementation details are identical with the UCF101
2: Completely visible (> 70% area is visible). case, except that we adjust thT = 0.7 and thB = 0.95.
0: The gender(s) of the person(s) is/are unidentifiable.
1: The person(s) is/are male. 5.5.2 Results and Analysis
Gender
2: The person(s) is/are female. The results on PA-HMDB51 are shown in Figure 7. Our
3: Persons with different genders are coexisting. methods achieve much better trade-off between privacy
0: No-nudity with long sleeves and pants. budget and action utility compared with baseline methods.
Nudity 1: Partial-nudity with short sleeves, skirts, or shorts. When M = 4, our methods can decrease privacy cMAP by
2: Semi-nudity with half-naked body. around 8% with little harm to utility accuracy. Overall, the
Relationship
0: Personal relationship is unidentifiable. privacy gains are more limited compared to the previous
1: Personal relationship is identifiable. two experiments, because no (re-)training is performed; but
the overall comparison trends show the same consistency.
TABLE 2: Examples of the annotated frames in the PA-HMDB51 dataset
Asymmetrical Privacy Attributes Protection Cost. Differ-
Frame Action Privacy Attributes ent privacy attributes have different protection costs. After
applying the learned anonymization optimized by Ours-
• skin color: white
• face: invisible
Entropy (restarting, M =4) on PA-HMDB51, the drop in AP
Brush • gender: female of “face” is much more significant than “gender,” which
hair • nudity: semi-nudity indicates that the “gender” attribute is much harder to sup-
• relationship: unrevealed
press than “face.” Such observation agrees that the gender
attribute can be revealed by face, body, clothing, and even
• skin color: black hairstyle. In future work, we will take such cost asymme-
• face: completely visible
• gender: male
try into account by using a weighted loss combination of
Situp
• nudity: semi-nudity different privacy attributes or training dedicated privacy
• relationship: unrevealed
protector for the most informative private attribute.
Human Study on the Privacy Protection of Our Learned
Anonymization. We use a human study to evaluate
5.5 Benchmark Results on PA-HMDB51: Cross-Dataset the trade-off between privacy budget and action utility
Training achieved by our learned anonymization transform. We take
both privacy protection and action recognition into account
5.5.1 Experiment Setting in the study. We emphasize here that both privacy protection
We train our models using cross-dataset training on and action recognition are evaluated on video level. There
HMDB51 and VISPR datasets as we did in Section 4.2, ex- are 515 videos distributed on 51 actions in the PA-HMDB51.
cept that we use the 5 attributes defined in Table 1 on VISPR For each action in the PA-HMDB51, we randomly pick one
instead of the 7 used in Section 4.2. The trained models video for the human study. Among the 51 selected videos,
are directly evaluated on the PA-HMDB51 dataset7 for both we only keep 30 videos to reduce the human evaluation
target utility task T and privacy budget task B , without any cost. There were 40 volunteers involved in the human
re-training or adaptation. We exclude the videos in the PA- study. In the study, they were asked to label all the privacy
HMDB51 from the HMDB51 to get the training set. Similar attributes and the action type on the raw videos and the
to the UCF101 experiments, the target utility task T (i.e., anonymized videos. According to the experimental results
action recognition) can be taken as a video classification (shown in Appendix E), the actions in the anonymized
problem with 51 classes, and the privacy budget task B videos are still distinguishable to humans, but the privacy
(i.e., privacy attribute prediction) can be taken as a multi- attributes are not recognizable at all. This human study
label image classification task with two classes for each further justifies that our learned anonymization transform
privacy attribute label. Notably, although PA-HMDB51 has can protect the privacy and maintain target utility task
provided concrete multi-class labels with specific privacy performance simultaneously.
attribute classes, we convert them into binary labels during
testing. For example, for “gender” attribute, we have pro- 6 C ONCLUSION
vided ground truth labels “male,” “female,” “coexisting,”
and “cannot tell,” but we only use “can tell” and “cannot We propose an innovative framework to address the newly-
tell” in our experiments, via combining “male,” “female” established problem of privacy-preserving action recogni-
tion. To tackle the challenging adversarial learning process,
7. We only use PA-HMDB51 as the testing set so far, since the current we investigate three different optimization schemes. To fur-
size of PA-HMDB51 is limited for training. ther tackle the ∀ challenge of universal privacy protection,
12
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 13
13
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 14
[34] F. Pittaluga, S. Koppal, and A. Chakrabarti, “Learning privacy [64] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
preserving encodings through adversarial training,” in WACV, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
2019. convolutions,” in CVPR, 2015.
[35] M. Bertran, N. Martinez, A. Papadaki, Q. Qiu, M. Rodrigues, [65] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-
G. Reeves, and G. Sapiro, “Adversarially learned representations thinking the inception architecture for computer vision,” in CVPR,
for information obfuscation and inference,” in ICML, 2019. 2016.
[36] P. C. Roy and V. N. Boddeti, “Mitigating information leakage in [66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
image representations: A maximum entropy approach,” in CVPR, tion,” in ICLR, 2015.
2019. [67] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles,
[37] B. H. Zhang, B. Lemoine, and M. Mitchell, “Mitigating unwanted “Activitynet: A large-scale video benchmark for human activity
biases with adversarial learning,” in AIES, 2018. understanding,” in CVPR, 2015.
[38] Z. Ren, Y. Jae Lee, and M. S. Ryoo, “Learning to anonymize faces [68] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li,
for privacy preserving action detection,” in ECCV, 2018. S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al.,
“Ava: A video dataset of spatio-temporally localized atomic visual
[39] R. R. Shetty, M. Fritz, and B. Schiele, “Adversarial scene editing:
actions,” in CVPR, 2018.
Automatic object removal from weak supervision,” in NeurIPS,
[69] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-
2018.
narasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The
[40] T. Wang, J. Zhao, M. Yatskar, K.-W. Chang, and V. Ordonez, kinetics human action video dataset,” arXiv, 2017.
“Balanced datasets are not enough: Estimating and mitigating
gender bias in deep image representations,” in ICCV, 2019.
[41] W. Oleszkiewicz, P. Kairouz, K. Piczak, R. Rajagopal, and Zhenyu Wu received the M.S. and B.E. de-
T. Trzciński, “Siamese generative adversarial privatizer for bio- grees from the Ohio State University and Shang-
metric data,” in ACCV, 2018. hai Jiao Tong University, in 2017 and 2015
[42] Y. Li, N. Vishwamitra, B. P. Knijnenburg, H. Hu, and K. Caine, respectively. He is currently a Ph.D. student
“Blur vs. block: Investigating the effectiveness of privacy- at Texas A&M University, advised by Prof.
enhancing obfuscation for images,” in CVPRW, 2017. Zhangyang Wang. His research interests include
privacy/fairness in machine learning, efficient vi-
[43] S. J. Oh, R. Benenson, M. Fritz, and B. Schiele, “Faceless person
sion, object detection, and adversarial learning.
recognition: Privacy implications in social media,” in ECCV, 2016.
[44] R. McPherson, R. Shokri, and V. Shmatikov, “Defeating image
obfuscation with deep learning,” arXiv, 2016.
[45] S. J. Oh, M. Fritz, and B. Schiele, “Adversarial image perturbation
for privacy protection a game theory perspective,” in ICCV, 2017. Haotao Wang received the B.E. degree in EE
[46] D. Gurari, Q. Li, C. Lin, Y. Zhao, A. Guo, A. Stangl, and J. P. from Tsinghua University, China, in 2018. He is
Bigham, “Vizwiz-priv: A dataset for recognizing the presence and working toward a Ph.D. degree at the University
purpose of private visual information in images taken by blind of Texas at Austin, under the supervision of Prof.
people,” in CVPR, 2019. Zhangyang Wang. His research interests include
computer vision and machine learning, espe-
[47] J. Hamm and Y.-K. Noh, “K-beam minimax: Efficient optimization
cially in fairness/privacy in machine learning, ad-
for deep adversarial learning,” in ICML, 2018.
versarial robustness, and model compression.
[48] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by
backpropagation,” in ICML, 2015.
[49] X. Xiang and T. D. Tran, “Linear disentangled representation
learning for facial actions,” TCSVT, 2018.
[50] G. Desjardins, A. Courville, and Y. Bengio, “Disentangling factors Zhaowen Wang received the B.E. and M.S.
of variation via generative entangling,” arXiv, 2012. degrees from Shanghai Jiao Tong University,
China, in 2006 and 2009 respectively, and the
[51] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio, “Image-to-
Ph.D. degree in ECE from UIUC in 2014. He
image translation for cross-domain disentanglement,” arXiv, 2018.
is currently a Senior Research Scientist with
[52] S. Reddy, I. Labutov, S. Banerjee, and T. Joachims, “Unbounded the Creative Intelligence Lab, Adobe Inc. His
human learning: Optimal scheduling for spaced repetition,” in research focuses on understanding and enhanc-
SIGKDD, 2016. ing images, videos and graphics via machine
[53] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time learning algorithms, with a particular interest in
style transfer and super-resolution,” in ECCV, 2016. sparse coding and deep learning.
[54] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn-
ing spatiotemporal features with 3d convolutional networks,” in
ICCV, 2015. Hailin Jin is a Senior Principal Scientist at
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Adobe Research. He received his M.S. and
image recognition,” CVPR, 2016. Ph.D. in EE from WUSTL in 2000 and 2003. Be-
[56] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, tween fall 2003 and fall 2004, he was a postdoc
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient researcher at the CS Department, UCLA. His
convolutional neural networks for mobile vision applications,” current research interests include deep learning,
arXiv, 2017. computer vision, and natural language process-
[57] D.-Z. Du and P. M. Pardalos, Minimax and applications, 2013. ing. His work can be found in many Adobe prod-
[58] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 ucts, including Photoshop, After Effects, Pre-
human actions classes from videos in the wild,” arXiv, 2012. miere Pro, and Photoshop Lightroom.
[59] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb:
a large video database for human motion recognition,” in ICCV,
2011. Zhangyang Wang is currently an Assistant Pro-
[60] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transfer- fessor of ECE at UT Austin. He was an Assistant
able architectures for scalable image recognition,” in CVPR, 2018. Professor of CSE, at TAMU, from 2017 to 2020.
[61] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, He received his Ph.D. in ECE from UIUC in
“Two-person interaction detection using body-pose features and 2016, and his B.E. in EEIS from USTC in 2012.
multiple instance learning,” in CVPRW, 2012. Prof. Wang is broadly interested in the fields of
[62] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. machine learning, computer vision, optimization,
Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” and their interdisciplinary applications. His latest
in CVPR, 2018. interests focus on automated machine learning
[63] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep (AutoML), learning-based optimization, machine
residual networks,” in ECCV, 2016. learning robustness, and efficient deep learning.
14
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 15
TABLE 3: Detailed numerical results of all the experiments. AT stands
A PPENDIX A for target utility task (action recognition) accuracy while AN
B stands for
GRL AND O URS -K -B EAM WITH M ODEL R ESTART- privacy budget prediction performance (accuracy in classification task
and cMAP in multi-label classification task). r is the sampling rate for
ING the downsampling baselines. {box: X, segmentation: S} × {blurring:
B, blackening: K} × {face: F, human body: D} are different empirical
In this section, we provide the formal descriptions of GRL obfuscation baselines. K is the number of different sets of budget model
parameters tracked by Ours-K -Beam. M is the number of ensemble
and Ours-K -Beam algorithms with model restarting in Al- budget models used by Ours-Entropy. Methods with “+” superscript
gorithm 5 and Algorithm 6 respectively. are incorporated with model restarting.
A PPENDIX B
D ETAILED NUMERICAL RESULTS A PPENDIX D
M ORE S TATISTICS ON PA-HMDB51 DATASET
In Table 3, we provide detailed numerical results reported Action Distribution The distribution of action labels (as
in Figures 2, 3 and 7. discussed in Section 5.4.3 of main paper) in PA-HMDB51
15
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 16
TABLE 4: Human study on the raw videos and the anonymized
are depicted in Figure 8, showing a relative class balance. videos. Random guess baseline is also provided.“P,” “R,” “F1” stand
Action-Attribute Correlation We show the correlation be- for precision, recall and F1-Score.
tween privacy attributes and actions (as discussed in Sec-
Raw Videos Anonymized Videos Random Guess
tion 5.4.3 of main paper) in Figure 9. P R F1 P R F1 P R F1
Skin Color 0.98 1.00 0.99 0.98 0.12 0.21 0.51 0.94 0.66
Face 0.97 0.99 0.98 0.94 0.40 0.56 0.47 0.66 0.55
draw sword
Gender 0.98 1.00 0.99 0.94 0.36 0.52 0.49 0.91 0.64
kiss Nudity 0.99 0.99 0.99 0.61 0.09 0.16 0.48 0.51 0.49
ride bike Relationship 0.97 0.88 0.92 0.47 0.39 0.43 0.49 0.14 0.22
dribble Micro-Avg 0.98 0.99 0.98 0.86 0.25 0.39 0.49 0.64 0.55
hug Macro-Avg 0.98 0.97 0.97 0.79 0.27 0.38 0.49 0.63 0.51
Weighted-Avg 0.98 0.99 0.98 0.88 0.25 0.37 0.49 0.64 0.52
dive Samples-Avg 0.98 0.99 0.98 0.45 0.24 0.30 0.49 0.61 0.51
cartwheel
drink
climb
eat
account in the study. We emphasize here that both privacy
walk protection and action recognition are evaluated on video
fall floor level.
stand Experiment Setting There are 515 videos distributed on 51
sword
actions in PA-HMDB51. For each action in PA-HMDB51,
pick
swing baseball
we randomly pick one video for human study. Among the
handstand 51 selected videos, we only keep 30 videos to reduce the
shoot bow human evaluation cost. The center frame of these 30 videos
fencing are shown in Figure 10. There are 40 volunteers involved
talk in the human study. In the study, they were asked to label
shoot ball
kick ball
all the privacy attributes and the action type on the raw videos
push and the anonymized videos. The guideline for labeling the
pour privacy attributes in the human study is listed below:
wave • gender: the person(s)’ gender can be told;
Actions
ride horse
• nudity: the persons is/are in semi-nudity (wearing
flic flac
turn shorts/skirts or naked to the waist);
shake hands • relationship: relationships (such as friends, couples, etc.)
somersault between/among the actors/actress can be told;
climb stairs • face: more than 10% of the face is visible;
golf
• skin color: the skin color of the person(s) can be told;
smoke
• no privacy attributes found: you cannot tell any privacy
situp
shoot gun attribute.
punch All the experimental results of multi-label privacy attributes
run
prediction in the human study are shown in Table 4. Table 4
sit
smile
shows the human study results on the raw videos, the
kick anonymized videos, and random guess. We use the same
sword exercise notations in the paper. AT stands for action recognition ac-
laugh curacy, and AB stands for the multi-label privacy attributes
g
throw prediction macro-F1 score. We use ArT , AaT , and AT to
pushup
hit
represent the action recognition accuracy on the raw videos,
chew on the anonymized videos, and by random guess. Likewise,
g
clap we use ArB , AaB and AB to represent the macro-F1 score
pullup of the privacy attributes prediction on the raw videos, on
catch the anonymized videos, and by random guess. AaB (0.38) is
jump g
much lower than ArB (0.97) and close to AB (0.51), justifying
brush hair ∗
0 5 10 15 20 25 the good privacy protection of our learned fA . AaT (0.5783)
Number of Videos
is comparable to ArB (0.9616) and significantly higher than
Fig. 8: Action distribution of PA-HMDB51. Each column shows the AgB (0.0333), justifying the good target utility preserving of
number of videos with a certain action. For example, the first column ∗
our learned fA .
shows there are 25 “brush hair” videos in PA-HMDB51 dataset.
A PPENDIX E
H UMAN S TUDY ON O UR L EARNED A NONYMIZATION
T RANSFORM
We use a human study to evaluate the privacy-utility trade-
∗
off achieved by our learned anonymization transform fA .
We take both privacy protection and action recognition into
16
Hit
dr dr dr dr dr
aw aw aw aw aw
unidentifiable
partial-nudity
completely visible
partially visible
unidentifiable
brown/yellow
unidentifiable
coexisting
semi-nudity
no-nudity
invisible
identifiable
coexisting
female
black
male
white
sw sw sw sw sw
Shoot ball
or or or or or
rid k d rid k d rid k d rid k d rid k d
e iss e iss e iss e iss e iss
dr bike dr bike dr bike dr bike dr bike
ib ib ib ib ib
bl bl bl bl bl
he he he he he
Sit
ca d ug ca d ug ca d ug ca d ug ca d ug
Jump
rtw iv rtw iv rtw iv rtw iv rtw iv
h e h e h e h e h e
Kick
flo flo flo flo flo
Smile
Catch
st or st or st or st or s t or
a a a a a
sw sw nd sw sw nd sw sw nd sw sw nd sw sw nd
in o in o in o in o in o
g p rd g p rd g p rd g p rd g p rd
habase ick habase ick habase ick habase ick habase ick
n b n b n b n b n b
sh dst all sh dst all sh dst all sh dst all sh dst all
oo an oo an oo an oo an o o an
t d t d t d t d t d
Kiss
Chew
fenbow fenbow fenbow fenbow fenbow
c c c c c
sh ing sh ing sh ing sh ing sh ing
Somersault
oo tal oo tal oo tal oo tal oo tal
ki t b k ki t b k ki t b k ki t b k ki t b k
ck all ck all ck all ck all ck all
ba ba ba ba ba
pu ll pu ll pu ll pu ll pu ll
posh posh posh posh posh
Pick
Stand
Climb
rid w ur rid w ur rid w ur rid w ur rid w ur
e av e av e av e av e av
h e h e h e h e h e
17
fli ors fli ors fli ors fli ors fli ors
sh c flae sh c flae sh c flae sh c flae sh c flae
face
ak t c ak t c ak t c ak t c ak t c
nudity
gender
so e h urn so e h urn so e h urn so e h urn so e h urn
cli mer and cli mer and cli mer and cli mer and cli mer and
skin color
m sa s m sa s m sa s m sa s m sa s
relationship
b ul b ul b ul b ul b ul
Dive
Pour
st t st t st t st t st t
Sword
air air air air air
s s s s s
smgolf smgolf smgolf smgolf smgolf
ok ok ok ok ok
sh si e sh si e sh si e sh si e sh si e
oo tu oo tu oo tu o o tu oo tu
tg p tg p tg p tg p tg p
pu un pu un pu un pu un pu un
nc nc nc nc nc
Eat
h h h h h
ru ru ru ru ru
Pullup
n n n n n
Sword Ex.
sw si sw si sw si sw si sw si
or smil t or smil t or smil t or smil t or smil t
d k e d k e d k e d k e d k e
ex ic ex ic ex ic ex ic ex ic
Fig. 10: Example frames from PA-HMDB51 used in the human study.
er k er k er k er k er k
c c c c c
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020
Push
pu row pu row pu row pu row pu row
Throw
Flic-flac
sh sh sh sh sh
up up up up up
h h h h h
ch it ch it ch it ch it ch it
ew ew ew ew ew
c c c c c
pu lap pu lap pu lap pu lap pu lap
ll ll ll ll ll
ca up ca up ca up ca up ca up
Golf
Turn
row “identifiable” and the column “kiss” shows the percentage of frames with “identifiable relationship” label in all kiss frames.
Pushup
us m us m us m us m us m
h p h p h p h p h p
ha ha ha ha ha
ir ir ir ir ir
Run
Walk
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
privacy attribute value w.r.t. the total number of frames of the specific action. For example, in the “relationship” subplot, the intersection block of
Fig. 9: Action-attribute correlation in PA-HMDB51 dataset. The color represents ratio of the number of frames of some action containing a specific
17
Handstand
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020 18
ApplyLipStick HandStand
Kiss Pullup
Pushup ShavingBeard
Sit Sit-up
Fig. 11: Privacy attributes prediction on selected frames from UCF101 and HMDB51. In each example, the overlayed red text lines denote the
privacy attributes (as defined in the VISPR dataset [14]) predicted by the privacy prediction model pretrained on VISPR, showing a high risk
of privacy leak in videos recording daily activities. The common privacy attributes in daily activities include “approximate age,” “approximate
weight,” “hair color,” “skin color,” “partial face,” “complete face,” “race,” “semi-nudity,” “gender,” “personal relationship,” and so on.
18