0% found this document useful (0 votes)
38 views10 pages

LPFS: Learnable Polarizing Feature Selection For Click-Through Rate Prediction

Uploaded by

lingyun.wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views10 pages

LPFS: Learnable Polarizing Feature Selection For Click-Through Rate Prediction

Uploaded by

lingyun.wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

LPFS: Learnable Polarizing Feature Selection for Click-Through

Rate Prediction
Yi Guo∗ Zhaocheng Liu∗ Jianchao Tan
yguocuhk@gmail.com lio.h.zen@gmail.com jianchaotan@kuaishou.com
KuaiShou Technology KuaiShou Technology KuaiShou Technology
Beijing, China Beijing, China Beijing, China

Chao Liao Sen Yang Lei Yuan


liaochao@kuaishou.com senyang.nlpr@gmail.com lyuan0388@gmail.com
KuaiShou Technology KuaiShou Technology KuaiShou Technology
Beijing, China Beijing, China Beijing, China
arXiv:2206.00267v2 [cs.IR] 3 Sep 2022

Dongying Kong Zhi Chen Ji Liu


kongdongying@kuaishou.com chenzhi07@kuaishou.com ji.liu.uwisc@gmail.com
KuaiShou Technology KuaiShou Technology KuaiShou Technology
Beijing, China Beijing, China Beijing, China

ABSTRACT KEYWORDS
In industry, feature selection is a standard but necessary step to Click-Through Rate Prediction, Feature Selection, Smoothed-L0
search for an optimal set of informative feature fields for efficient
and effective training of deep Click-Through Rate (CTR) models. ACM Reference Format:
Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan,
Most previous works measure the importance of feature fields by
Dongying Kong, Zhi Chen, and Ji Liu. 2022. LPFS: Learnable Polarizing
using their corresponding continuous weights from the model, Feature Selection for Click-Through Rate Prediction. In Proceedings of ACM
then remove the feature fields with small weight values. However, Conference (Conference’17). ACM, New York, NY, USA, 10 pages. https:
removing many features that correspond to small but not exact zero //doi.org/10.1145/nnnnnnn.nnnnnnn
weights will inevitably hurt model performance and not be friendly
to hot-start model training. There is also no theoretical guarantee
that the magnitude of weights can represent the importance, thus 1 INTRODUCTION
possibly leading to sub-optimal results if using these methods. Click-Through Rate (CTR) prediction, which aims to estimate the
To tackle this problem, we propose a novel Learnable Polarizing probability of a user clicking on an item, has become a crucial task in
Feature Selection (LPFS) method using a smoothed-ℓ 0 function in industrial applications, such as personalized recommendations and
literature. Furthermore, we extend LPFS to LPFS++ by our newly online advertising [28, 38, 39]. In recent years, significant progress
designed smoothed-ℓ 0 -liked function to select a more informative has been made due to the development of deep learning [8, 10, 16,
subset of features. LPFS and LPFS++ can be used as gates inserted at 31, 35], however, these deep models still require an effective set
the input of the deep network to control the active and inactive state of feature fields as input. In industry, to accurately characterize
of each feature. When training is finished, some gates are exact user preferences, item characteristics, and contextual environments
zero, while others are around one, which is particularly favored by from different aspects, extensive feature engineering is generally
the practical hot-start training in the industry, due to no damage essential for model training, which results in hundreds of feature
to the model performance before and after removing the features fields in real-world datasets.
corresponding to exact-zero gates. Experiments show that our However, we can only feed a subset of feature fields instead of all
methods outperform others by a clear margin, and have achieved feature fields into models for effective and efficient training. On the
great A/B test results in KuaiShou Technology. one hand, prior work [1, 30] points out that the subtle dependen-
cies among relevant features and irrelevant features may inflate the
∗ Both authors contributed equally to this work. error of parameter estimation, resulting in instability of prediction
results. On the other hand, the online feature generation process
consumes significant computing and storage resources. Therefore
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed how to search for an optimal set of effective feature fields from
for profit or commercial advantage and that copies bear this notice and the full citation the real-world dataset is the core concern of both academia and
on the first page. Copyrights for components of this work owned by others than ACM industry, considering both effectiveness and efficiency. Many fea-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a ture selection methods have emerged in the past few decades. Some
fee. Request permissions from permissions@acm.org. methods [2, 9, 18, 19, 21, 36] are proposed to do feature selection
Conference’17, July 2017, Washington, DC, USA for the Logistic Regression (LR) model which is the classical CTR
© 2022 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 prediction model. For modern deep learning-based CTR prediction
https://doi.org/10.1145/nnnnnnn.nnnnnnn models, learning-based feature selection approaches have attracted
Conference’17, July 2017, Washington, DC, USA Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu

is unfriendly to hot-start. Last but not least, there is no theoretical


guarantee that the magnitude of weights can represent the im-
portance, leading to a suboptimal feature subset selected by these
methods.
Inspired by the study of smoothed-ℓ 0 [7, 23, 24, 34] in compressed
sensing, we consider feature selection as an optimization problem
under ℓ 0 -norm constraint, and propose a novel Learnable Polar-
izing Feature Selection (LPFS) method to effectively select highly
informative features. We insert such differentiable function-based
(a) Distribution of weights norm for LASSO (b) Distribution of gate values for LPFS
gates between the embedding and the input layer of the network
to control the active and inactive state of these features. When
Figure 1: The density distribution for group LASSO with training is finished, some gate values are exact zero, while others
proximal-SGD optimization (left) and for LPFS. The group are distributed around one (see Figure 1). Then, we can remove
LASSO method is implemented on the first layer of the net- feature fields with zero-gate, and absorb the non-zero gate to the
work, and proximal SGD is applied. LASSO method outputs embeddings or the weights of the first fully connected layer in the
a continuous distribution and we need to choose a small network. In this way, we can get rid of the threshold choosing, and
threshold to determine whether to remove or keep the fea- there is totally no performance impact on the model performance
tures. While our LPFS method outputs a polarized distribu- before and after the physical removal, which is also very friendly
tion: some gate values are exact zeros, the others are dis- to hot-start training.
tributed around 1. We can remove the feature fields with Furthermore, each feature is unique, although maybe sometimes
exact zero gates, and absorb the non-zero-gates into the em- correlated, if a gate of a feature becomes zero accidentally, other
bedding or the weight of the first fully connected layer, so features may not fully compensate for it. Then the model needs to
as to cause no damage to the model and be friendly to hot- have the ability to bring this feature to be active again. However,
start training. Although there are also exact-zero gates us- the derivative of all smoothed-ℓ 0 functions proposed by previous
ing LASSO, the value of many non-zero gates are very close works at 𝑥 = 0 is zero, which makes it impossible for gradient-
to zero. based optimization methods to make a zero-gate become non-zero
again. We are motivated to propose LPFS++ with a newly designed
smoothed-ℓ 0 -liked function to alleviate this problem.
much attention. Specifically, COLD [33] applies the Squeeze-and- We conduct extensive experiments to verify the effectiveness
Excitation (SE) block [12] to get the importance weights of features of two proposed approaches on both public datasets and large-
and select the most suitable ones. And some prior work [15] pro- scale industrial datasets. Experimental results demonstrate that
poses to select informative features via LASSO, while FSCD [20] LPFS++ outperforms LPFS, and both of them outperform previous
via Gumbel-Softmax-liked sampling method. More recently, UMEC approaches by a significant margin. Our approaches have also been
[29] treats feature selection as a constrained optimization problem. deployed on the offline Kuaishou distributed training platform
Besides, permutation-based feature importance [25] measures the and have been exploited in different scenarios in Kuaishou to do
increase in the prediction error of the model after we permuted the feature selection for their CTR prediction models and have achieved
feature’s values, which also has been widely applied in real-world significant A/B testing results.
feature selection. Despite the significant progress made with these We summarize the contributions of the proposed methods below:
methods, some challenges demanding further explorations: • We are pioneers to apply the idea of smoothed-ℓ 0 gate to
(1) Getting rid of ad-hoc thresholding. In general, previous feature selection for CTR prediction and we propose a novel
methods output a continuous distribution of importance weights feature selection method by utilizing the closed-form prox-
for feature fields to represent the feature importance. However, imal SGD updating for gate parameters, which outputs a
permutation-based, gumbel-softmax-based, or SE-based approaches polarized distribution of feature importance scores and is
cannot output exact-zero importance weights. They all need to naturally friendly to hot-start training.
prune from a skewed yet continuous histogram of importance • We extend the idea of smoothed-ℓ 0 and propose LPFS++
weights, which makes the selection of the pruning threshold critical which is based on a novel smoothed-ℓ 0 -liked function and
yet ad-hoc, taking Figure 1 for example. can select features more robustly.
(2) Hot-start-friendly. Empirically, feature selection usually • Experiments show that the both proposed methods outper-
needs a large range of data to train and generate confident feature form other feature selection methods by a clear margin, and
selection results. To save training cost, the hot-start training is they have achieved great benefits in KuaiShou Technology.
widely adopted in practice, which aims to inherit from the trained
model and reduce the number of repetitions for training. The meth- 2 RELATED WORKS
ods mentioned above remove many features with small but not
exact zero importance weights. This step inevitably causes damage 2.1 Smoothed-ℓ 0 Optimization
to the model’s performance. Even for LASSO with the proximal The idea of smoothed-ℓ 0 optimization was first proposed in [22, 24]
optimization method, which can output exact zero weights, the to obtain sparse solutions for under-determined systems of lin-
remaining weights are very small and very close to zeros, which ear equations. The original optimization objective is min𝑥 ∥𝑥 ∥ 0
LPFS: Learnable Polarizing Feature Selection for Click-Through Rate Prediction Conference’17, July 2017, Washington, DC, USA

s.t. 𝐴𝑥 = 𝑦, where 𝐴 ∈ R𝑚×𝑛 , 𝑥 ∈ R𝑛 , 𝑦 ∈ R𝑚 and 𝑚 < 𝑛. continuous features


This is intractable,
  and they turn around to optimize min𝑥 𝑓𝜖 (𝑥) = fully connected layers
2
1 − exp −𝑥
2𝜖 2
s.t. 𝐴𝑥 = 𝑦. When 𝜖 is large, 𝑓𝜖 (𝑥) is smooth, while dense representation
when 𝜖 is small enough, the function can approximate ℓ 0 -norm embedding

...
...
indefinitely. So, they solve it iteratively, gradually decaying 𝜖. And
multiplication
their experiments show that the method is much faster than ℓ 1 -
gate element
based methods, while achieving the same or better accuracy. Fol-
lowing work [23] studied the convergence properties of the above
smoothed-ℓ 0 and find that under some mild constraints, the con-
Figure 2: This figure takes DLRM [26] on the Terabyte
vergence is guaranteed. Afterwards, various smoothed-ℓ 0 func-
dataset as an example to show where we insert the gate vec-
tions had been proposed andstudied for compressed  sensing, such

|𝑥 | 𝑥
 2 tor. In this experiment, the 13 continuous features are trans-
as 𝑓𝜖 (𝑥) = sin arctan 𝜖 [32], 𝑓𝜖 (𝑥) = tanh 2𝜖 2 [37] and formed by a dense-MLP to a dense representation, we treat
2
𝑥
𝑓𝜖 (𝑥) = 𝑥 2 +𝜖 2 [34]. All these functions have following important the dense representation as a feature for the subsequent fea-
property: ture selection. Then the dense representation and the 26
( categorical embeddings are concatenated as 27 features. We
0, 𝑥 = 0 multiply the 27 features by a gate vector with 27 elements,
lim 𝑔𝜖 (𝑥) = (1)
𝜖→0 1, 𝑥 ≠ 0 where each element control the liveness of each feature. The
multiplication results are the new input to the top-MLP in
and can approximate ℓ 0 -norm well. In this paper, in addition to di-
DLRM.
rectly applying the existing smoothed-ℓ 0 function as gate to control
feature selection, we also propose a new designed smoothed-ℓ 0 -
liked function for feature selection. We use the term "liked" here be interacted with. We will study it through an experiment in Sec-
because what we proposed is an odd function, not an even function tion 4.4.
like all the above functions.
3 METHODS
2.2 Feature Selection 3.1 Problem Formulation
Feature selection is a key component for CTR Prediction and vari-
For most deep-learning-based CTR prediction models, the input
ous methods have been proposed in this area. [2, 9, 18, 19, 21, 36]
data is unusually collected in continuous and categorical forms.
are proposed to do feature selection for the Logistic Regression
The continuous part can be fed into the network directly, while the
(LR) model which is the classical CTR prediction model. In the
categorical part is often pre-processed by mapping each category
area of deep learning, COLD [33] use Squeeze-and-Excitation(SE)
into a dense representation space [3], also known as embeddings.
block [12] to measure the importance of features, while FSCD [20]
The embeddings and continuous features are then fed into the deep
use gumbel-softmax/sigmoid [13]. These two works both output a
network. For description simplicity, we only describe the feature
continuous importance distribution via softmax or sigmoid func-
selection of categorical features in this paper. Suppose there are
tion, and the importance scores can not become exact zero. LASSO is
𝑁 feature fields concatenated as input 𝒆 = [𝒆 1, 𝒆 2, ..., 𝒆 𝑁 ], each
also a common feature selection method in the industry [15]. With
𝒆𝑖 , 𝑖 = 1, 2, ..., 𝑁 is an embedding vector for the 𝑖-th field, and we
the proximal SGD [27] algorithm, the LASSO method can push
want to select a most informative subset from them. The network
a portion of parameters to be exact zeros, but their distribution
can be formulated as a function:
is also continuous. All these methods require choosing a thresh-
old to truncate the distribution. Recently, [29] use the alternating 𝑦 = 𝑓 (𝑤; 𝒆) (2)
direction method of multipliers(ADMM) optimization method to where 𝑤 is all the network parameters, 𝑦 the output of the network.
select features, but we find it a little hard to compress a very large To select features, we can insert a vector-valued function 𝒈(𝑥) =
feature set into very small ones. In addition to this learning-based [𝑔(𝑥 1 ), 𝑔(𝑥 2 ), ..., 𝑔(𝑥 𝑁 )] as gate before the input of the network,
approach, feature permutation [25] is a common method in the where each 𝑥𝑖 , 𝑖 = 1, 2, ..., 𝑁 can be either a new introduced learn-
industry. It first trains a model with all the features, and then keeps able scalar parameter, or the weights norm from first fully connected
other features unchanged, permutes the input data along each fea- layer, or even embedding information in some works [15]. Then
ture axis randomly one by one, and uses the magnitude of the drop the input can be viewed as 𝒆˜ = [𝑔(𝑥 1 )𝒆 1, 𝑔(𝑥 2 )𝒆 2, ..., 𝑔(𝑥 𝑁 )𝒆 𝑁 ]. In
in performance as the metric for the importance of the features. this way, the network becomes:
This greedy method is easy to implement and does not require
parameter tuning, but it ignores the possible correlation between 𝑦 = 𝑓 (𝑤; 𝒈(𝑥)𝒆) (3)
features, and some features may compensate for each other. As shown in Figure 2, during training, the gates, embeddings,
Besides the direct feature selection, there is also some works and all other network parameters are trained together, and we
on the interaction between features, such as AutoCross [36], Aut- also impose some penalties on the gates parameters to implement
ofis [17], Autoint [31]. In essence, we can also regard the cross feature selection. When training terminates, the features with zero-
feature as a common feature, add it to the feature superset, and per- gates can be viewed as irrelevant features and can be removed,
form a general feature selection to discover which features should while others will be kept. Under this formulation, the key to feature
Conference’17, July 2017, Washington, DC, USA Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu

selection becomes how to choose a good gate function. Previous


works [15, 20] use lasso or gumble-softmax-trick [13] as the gate
function, but these kinds of methods often output a continuous
distribution of gates, then remove many feature fields with small
but not exact zero gates. We don’t think this is a good enough way
and is unfriendly to hot-start training.

3.2 LPFS
From the perspective of optimization, feature selection can be es-
(a) ℓ 0 , ℓ 1 and smoothed-ℓ 0 𝑔𝜖 (𝑥) (b) smoothed-ℓ 0 𝑔𝜖′ with different 𝜖
sentially formulated as an optimization problem under ℓ 0 -norm
constraint. Since it’s an NP-hard to solve directly, some previous
works turn around it by relaxing ℓ 0 to ℓ 1 , which is a convex func- Figure 3: The graph of ℓ 0 , ℓ 1 and smoothed-ℓ 0 𝑔𝜖 (𝑥) with
tion closest to ℓ 0 . Inspired by the research field of smoothed-ℓ 0 different 𝜖 (left) and 𝑔𝜖′ (right). 𝑔𝜖 (𝑥) a quasi-convex func-
optimization, we use the following smoothed-ℓ 0 function [34] as tion between ℓ 0 and ℓ 1 function. When 𝜖 is large, 𝑔𝜖 (𝑥) is
our gate for LPFS: smooth and continuously differentiable; while when 𝜖 is
𝑥2 small enough, 𝑔𝜖 (𝑥) can be infinitely close to ℓ 0 . As the 𝜖 ap-
𝑔𝜖 (𝑥) = 2 (4) proaching to zero, 𝑔𝜖′ (𝑥) is zero only at the points where 𝑔𝜖 (𝑥)
𝑥 +𝜖
This function has the following roperty: get exact zeros or near ones, otherwise infinity. Note that the
( 𝑔𝜖′ (𝑥)|𝑥=0.0 is exactly zero whatever 𝜖 is.
= 0, 𝑥 = 0
𝑔𝜖 (𝑥) (5)
≈ 1, 𝑥 ≠ 0
Mathematically, what the function (4) needs to be improved for
The derivative of 𝑔𝜖 (𝑥) w.r.t 𝑥 is: feature selection is that its derivative at 𝑥 = 0 is 0, as easy to see in
2𝑥𝜖 Eq. (6). In this situation, the derivative of the output 𝑦 of network
𝑔𝜖′ (𝑥) = (6) w.r.t 𝑥, from Eq. (3), is
(𝑥 2 + 𝜖) 2
𝜕𝑦
where we choose 𝑥 here as newly introduced learnable parameters, 𝜖 = 𝑓2′ (𝑤; 𝒈𝜖 (𝑥)𝒆)𝒆𝑔𝜖′ (𝑥)|𝑥=0 (7)
𝜕𝑥 𝑥=0
is a small positive number. This is a quasi-convex function between
where 𝑓2′ (𝑎; 𝑏) = 𝜕𝑏 , the subscript 2 mean partial derivative w.r.t the
𝜕𝑓
ℓ 0 and ℓ 1 function, as shown in Figure 3. From the figure, we can see
that when 𝜖 is large, 𝑔𝜖 (𝑥) smooth and continuously differentiable second term. What the problem Eq. (7) will lead to is that whether
and that when 𝜖 decays small enough, 𝑔𝜖 (𝑥) can be infinitely close some outlier samples or change in user behavior causes a gate value
to ℓ 0 . Different from Gumble-softmax/sigmoid with decreasing to go to zero accidentally, the corresponding feature will never be
temperature, function (4) can be exactly zero even when 𝜖 is not resurrected based on gradient-based optimization methods, and
small enough. And different from LASSO with proximal algorithms, it could not be compensated by other features. One solution is to
function (4) outputs a polarized distribution with some gate values add some fading random noise, but we want to solve this from
being exactly zeros, while others have large margins from zeros. the gate function itself. We need to construct a gate function that
These are two very nice properties for a hot start, as when we satisfies properties similar to the smoothed-ℓ 0 function, and whose
remove features with exactly zero gates, the performance of the derivative is not zero at 𝑥 = 0. One heuristic optional solution is:
model remains unchanged. ( 2
𝑥 + 𝛼𝜖 1/𝜏 arctan(𝑥), 𝑥 ≥ 0
𝑔𝜖++ (𝑥) = 𝑥 2 +𝜖 2 (8)
3.3 LPFS++ − 𝑥𝑥2 +𝜖 + 𝛼𝜖 1/𝜏 arctan(𝑥), 𝑥 < 0
Although LPFS has achieved great performance both in open bench- Although it is a piece-wise function, it is also smooth and continu-
marks and KuaiShou’s online A/B tests (as shown in the experiment ously differentiable. The derivative of 𝑔𝜖 (𝑥) w.r.t 𝑥 is
section), there is big room for improvement in the task of feature
selection. GDP [11] applies the idea of smoothed-ℓ 0 in the field of ′ 2|𝑥 |𝜖 𝛼𝜖 1/𝜏
𝑔𝜖++ (𝑥) = 2 2
+ 2 (9)
channel pruning in Computer Vision. But feature selection is much (𝑥 + 𝜖) 𝑥 +1
different from channel pruning in the vision in two ways: Where 𝑥 and 𝜖 have the same meaning as Eq. (4), 𝛼 is a constant
1. Channel pruning is essentially searching for the combination hyper-parameter balancing the two terms. The exponential factor
of the number of neurons for each layer. It only cares about the ′ (𝑥 = 0) with respect to
1/𝜏 is to control the decay rate of 𝑔𝜖++
number of neurons in a certain layer, rather than which neurons. 𝜖. This variable is pretty robust, thus we have taken 𝜏 = 2 in all
However, feature selection is different. We need to obtain a subset of previous private experiments in KuaiShou Technology, which
of features. In addition to caring about how many features this worked very well. Also, the arctangent function is not essential,
subset contains, it is more important to care about which features. because what we need is just an odd function whose derivative at
2. In the industry, user behavior is changing slowly. While main- 𝑥 = 0 is not zero, and whose value tends to be constant for large
stream datasets, such as ImageNet [4], are static, user data flows x. There are many other functions that can satisfy this properties,
dynamically. This phenomenon requires the model to be more ro- and might work better, which could be the future work. The graph
bust to feature selection than to channel pruning. is shown as Figure 4. Note that, unlike all smoothed-ℓ 0 functions,
LPFS: Learnable Polarizing Feature Selection for Click-Through Rate Prediction Conference’17, July 2017, Washington, DC, USA

that we’re looking for is derived from smoothed-ℓ 0 gate in Eq. (4),
not from ℓ 1 regularization on 𝑥, the ℓ 1 regularization is only to
penalize 𝑥 to let smoothed-ℓ 0 gate become polarized like ℓ 0 .
1 ||𝑥 − 𝑥˜ || 2 +
Moreover, for ℓ 0 regularization, the problem 𝑚𝑖𝑛𝑥 { 2𝜂 𝑡
𝜆∥𝑥 ∥ 0 } indeed has closed-form solutions:
√︁


 0, |𝑥˜𝑡 | < 2𝜆𝜂
𝑡 +1

 √︁
𝑥 = 𝑥˜𝑡 , |𝑥˜𝑡 | > 2𝜆𝜂 (12)
 √︁

 0 or 𝑥˜𝑡 ,

|𝑥˜𝑡 | = 2𝜆𝜂
(a) ℓ0 and 𝑔𝜖++ (𝑥) (b) 𝑔𝜖++ with different 𝜖 
but we find it is very sensitive to the initial value of 𝑥√︁in Eq. (10) in
Figure 4: The graph of ℓ 0 and𝑔𝜖++ (𝑥) with different 𝜖 (left) our experiments. If we initialize 𝑥 to be greater than 2𝜆𝜂 acciden-

and 𝑔𝜖++ (right). Now 𝑔𝜖++ (𝑥) is odd, not even like the tally, it can hardly be penalized during the whole training process
smoothed ℓ 0 function. Note that the gradient at 𝑥 = 0 is no under the second case in Eq. (12); otherwise it becomes zero very
longer zero, but a small number decays with 𝜖. quickly under the first case in Eq. (12). Additionally, in each SGD
iteration, 𝑥˜𝑡 is not necessarily optimal so it does not need to be
directly updated by Eq. (12) to be zero. While in Eq. (11), 𝑥 will be
Function (8) is an odd function rather than an even function. This is penalized by a small step 𝜆𝜂 whatever its initial value is.
a necessary consequence because the derivative of a continuously
differentiable even function at 𝑥 = 0 must be 0. It is fine to use
an odd function as the gate, because we can absorb the negative
4 EXPERIMENTS
signs of gates values into the corresponding embeddings or the first In this section, we will demonstrate the superiority of our method
fully connected layer of the network. That is, in Eq. (3) the equality through three experiments. We mainly describe the core part of the
holds: 𝒈(𝑥)𝒆 = 𝑎𝑏𝑠 (𝒈(𝑥))𝑠𝑖𝑔𝑛(𝒈(𝑥))𝒆 = 𝑎𝑏𝑠 (𝒈(𝑥))𝒆 ′ , where 𝒆 ′ = experiments here, and the training details such as hyper-parameter
𝑠𝑖𝑔𝑛(𝒈(𝑥))𝒆 is the final embeddings for downstream tasks. Now we setting are left at the end of this paper.
′ (𝑥 = 0) = 𝛼𝜖 1/𝜏 , which is small number decays with 𝜖.
have 𝑔𝜖++
′ (𝑥 = 0) ≠ 0 does not mean
Note that 𝑔𝜖++
𝜕𝑦 4.1 Datasets
𝜕𝑥 𝑥=0 ≠ 0 because of

the existence of 𝑓2 (𝑤; 𝒈𝜖 (𝑥)𝒆) in Eq. (7). During the offline feature To show that our method can filter out the highly informative fea-
′ (𝑥 = 0) is robust
selection, 𝜖 initializes a large value so that the 𝑔𝜖++ tures, we conducted experiments on a large-scale dataset Criteo AI
enough to outlier samples and the slow change of user behavior. Labs Ad Terabyte dataset 1 and an industrial dataset in KuaiShou.
As training goes on, 𝑔𝜖++ ′ (𝑥 = 0) also decays as 𝜖 decays, so that For Terabyte dataset, it contains 26 categorical features and 13 con-
the feature superset can be stably divided into non-informative and tinuous features. It contains about 4.4 billion click log samples over
informative subset when 𝜖 decays to be small enough. 24 days. Similar to DLRM [26], for negative samples, we randomly
select 12.5% on each day. To be fair, for all the public methods and
3.4 Optimization our method, We pre-train the model by "day 0 ∼ 17", and select
features by "day 18 ∼ 22", in which process, all the model param-
For both LPFS and LPFS++, we optimize the following objective
eters, including newly introduced parameters for FSCD [20] and
function:
our methods, are trained together for learnable methods. When the
ˆ 𝑓 (𝑤; 𝒈(𝑥)𝒆)) + 𝜆∥𝑥 ∥ 1
min F (𝑤; 𝑥) = L (𝑦; (10) feature subsets are selected by these methods, the models are then
𝑤,𝑥
trained from scratch by "day 0 ∼ 22" using this subset, then evalu-
Where 𝑓 (𝑤; 𝒈(𝑥)𝒆)) represents the network, same to Eq. (3); 𝑦ˆ is ated by "day 23" and the best AUC calculated in official DLRM [26]
the ground true(i.e. users click or not click the item); L (·; ·) is the code 2 is reported.
loss function between the ground truth and model prediction; ∥ · ∥ 1 The industrial dataset is collected by Mobile Kuaishou App, we
is ℓ 1 -norm; 𝜆 is the balance factor. 𝑤 and 𝒆 in L (𝑦;ˆ 𝑓 (𝑤; 𝒈(𝑥)𝒆)) select 9 days for offline features selections, and take 10% of the
are updated by Adam [14] or Adagrad [6] optimizer, same to base- negative samples at random. All the positive and negative click log
line. And 𝑥 is updated by proximal-SGD. It is equivalent to solve samples add up to nearly 1.2 billion over the 9 days. This dataset
min𝑥 { 2𝜂1 ||𝑥 − 𝑥˜ || 2 + 𝜆∥𝑥 ∥ }, whose closed-form solution is:
𝑡 1 contains 250 user categorical feature fields, 46 item categorical fea-
ture fields, 96 combine categorical feature fields, and 25 continuous
 𝑥˜ − 𝜆𝜂, 𝑥˜𝑡 ≥ 𝜆𝜂 features. Similar to the configuration of Terabyte, we pre-train the
 𝑡


𝑡 +1

𝑥 = 0, −𝜆𝜂 < 𝑥˜𝑡 < 𝜆𝜂, (11) model by "day 1 ∼ 6", and select features by "day 7 ∼ 8", in which

𝑥˜𝑡 + 𝜆𝜂, 𝑥˜𝑡 ≤ −𝜆𝜂
 process all the parameters are trained together. When obtaining

feature subsets, the models are trained by cold start by "day 1 ∼ 8"
Where 𝜂 is learning rate for 𝑥, 𝑥˜𝑡 is the value of 𝑥 after one step using these subsets, then are evaluated by "day 9".
ˆ 𝑓 (𝑤; 𝒈(𝑥)𝒆)) with the Momentum optimizer. The
of updating L (𝑦;
core code for updating 𝑥 is shown in supplementary material.
It is worth pointing out that we use ℓ 1 -norm instead of ℓ 0 -norm to 1 https://labs.criteo.com/2013/12/download-terabyte-click-logs/

regularize 𝑥 in Eq.(10). In fact, the polarization effect (ℓ 0 property) 2 https://github.com/facebookresearch/dlrm


Conference’17, July 2017, Washington, DC, USA Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu

4.2 Network Settings


For the public Terabyte dataset, we made some minor modifications
to the DLRM [26] to serve as our baseline network. The 13 con-
tinuous features in DLRM are transformed by an MLP to a dense
representation with the same length as embeddings. Then we treat
this dense representation as an embedding, with the same status as
the other 26 embedding features. In this way, we have 27 features
in the feature superset. Since the 27 features are still very small and
the subset of features selected by various methods is similar, we
also conducted another experiment with all the crossed features
in addition to this direct selection, which will be explained in the
following subsections. Same to [29], we set the hidden dimensions
of the three-layer MLP prediction model as 256 and 128 for both
crossed and non-crossed versions.
For the industrial dataset, every categorical feature is 16 dimen-
sion, while the different dense feature has different dimensions,
ranging from 1 to 128. Every user feature field and every item Figure 5: Performance comparison of our method with other
feature field is crossed by element-wise multiplication. Then the methods. Where the abscissa is large(i.e. only a few features
crossed features, combine features, and continuous features are are removed), there is little difference between all these
concatenated as the input for the network. The network contains methods. We find in experiments that the feature subsets
a share bottom layer, then some auxiliary branches for different selected by these methods are almost the same. However,
auxiliary tasks. We focus on one main branch. where the abscissa is small (i.e. around half of the features
are removed), our method has clear superiority over other
methods. The baseline for 27 features is best AUC 0.797948,
best ACC 0.811079, best Loss 0.423315.
4.3 Terabyte Without Cross Features
In this experiment, we treat the 27 features(including a dense rep-
4.4 Terabyte With Cross Features
resentation transformed by 13 dense features, as described in the
last subsection) as a feature superset for feature selection. Each In this experiment, we take the experiment a step further, partly be-
feature is 16 dimensions. These 27 features are multiplied by the cause 27 features were so few, and partly because we want to show
corresponding gates, then concatenated as a vector, and fed into that our method can be easy to be applied to study feature inter-
the three-layer MLP prediction branch (or called the top-MLP in the actions. In this experiment, every two embeddings corresponding
original paper), as shown in Figure 2. We compared our method with to two different feature fields are crossed by element-wise multi-
FSCD [20], UMEC [29], feature permutation, and group LASSO. For plication, then the original features and the crossed features are
UMEC, all the hyper-parameters are set as the paper reported; for concatenated as the input of the top-MLP of DLRM. So, the total
number of features can be viewed as 27 + 𝐶 27 2 = 378, and we treat
FSCD, we set all the regularization weights to the same values, for
group LASSO, we regularize the weights of the first fully connected these 378 features equally to select from this much larger superset.
layer in the top-MLP, and train them by proximal-SGD; for feature So, if an original feature should be removed, the features that inter-
permutation, we randomly shuffle the feature to be evaluated in a act with it are not necessarily removed. Except for the cross-feature
mini-batch. and the resulting larger number of input units for the first fully con-
The result is shown in Figure 5. When only a small number nected layer of top-MLP of DLRM, other experiment configurations
of features are supposed to be removed, we find in experiments are the same as the last subsection. In this way, we can both select
that all these methods select almost the same subsets, so there is first and second-order features. The result is shown in Figure 6.
little difference in AUC obtained by these methods (look at the Compared with previous works [17, 36], our methods can also be
upper right corner of Figure 5). While when we want to remove used to select higher-order features effectively.
about a half, where the number of combinations becomes much
14 ≈ 2 × 107 ), our method can cope with this challenge
larger(𝐶 27 4.5 Industrial Dataset
well, so as to select the most informative subset (look at the left In this experiment, we apply our methods to a large-scale industrial
half of Figure 5). The feature subset our method selects is attached dataset. All the 392 categorical feature fields (250 users, 46 items,
in the supplementary material for future reference. Although we 96 combined) mentioned above are treated equally for the feature
compare AUC rather than ACC here for UMEC, the ACC for UMEC selection, although the network does not treat these features equally.
in our experiment is still higher than the result in the original paper Due to the existence of cross features between user and item, even if
under the same number of remaining features. In the following the same number of features is kept, the computational amount may
experiments for a large feature sets, we do not compare with UMEC, not be the same. Still, we only care about the feature subsets, rather
partly because of its poor performance here and partly because it than the computation cost. Different from the experiment 4.4, we
is difficult to compress to a very small subset. insert gates immediately after the embeddings, rather than after the
LPFS: Learnable Polarizing Feature Selection for Click-Through Rate Prediction Conference’17, July 2017, Washington, DC, USA

Figure 7: Performance comparison of our methods with


other methods for KuaiShou industrial dataset. The total
Figure 6: Performance comparison for Terabyte with cross number of feature field is 392 (250 user, 46 item, 96 combine),
features. Every two different feature embeddings are and we treat all these features equally. In the network, every
crossed by element-wise multiplication. Then the original user feature and every item feature is interacted by element-
first-order features and the crossed second-order features wise multiplication, and there are five auxiliary tasks to help
are concatenated as the input of the top-MLP of DLRM. So, improve the performance of the main task. Our LPFS++ has
2 = 378.
the total number of features can be viewed as 27 + 𝐶 27 more obvious advantages in industrial large-scale datasets
We can see that our methods, LPFS and LPFS++, have many and challenging network structure.
advantages over other methods, especially when the num-
ber of remaining features is small. It can be seen that our
methods can also be used to mine higher-order features.

crossed features. So, when a feature is supposed to be removed, all


cross-features that interact with it will also be removed. The result
is shown in Figure 7. As can be seen from the figure, our method
has more obvious advantages in industrial large-scale datasets and
challenging complex network structures.

4.6 Ablation Studies and Experimental


Analysis
The influence of 𝜖 ′𝑠 decaying rate on performance has been studied
by [11] for channel pruning, and similar to it, the decaying rate of 𝜖
in (4) and (8) is not much important for offline feature selection, and
our experiments found that the final value of 𝜖 between 1𝑒 − 4 and
Figure 8: Ablation study for different 𝛼. We can see that in
1𝑒 −8 has no significant effect on the performance of the model. The
general, the larger the 𝛼 value, the more informative the se-
key component of the gate function (8) for LPFS++ is the second
lected feature subset.
arctangent function part and its balancing factor 𝛼. For simplicity,
we will focus on 𝛼.
By Function (9), since the derivative at 𝑥 = 0 is 𝑔𝜖++′ (𝑥 = 0) =
accordingly, as shown in Table 1. This is also to be expected, because
1/𝜏
𝛼𝜖 , we should fix the schedule of 𝜖 first before studying 𝛼. In a larger value of 𝛼 makes the model more resistant to compression.
the industrial experiments, we decay 𝜖 by 0.986 every 500 steps, Figure 9 shows the number of features with zero-gate changes over
and the minimal value is 1.0𝑒 − 4. As shown in Figure 8, in general, the training step, and it contains a wealth of interesting information.
the larger the 𝛼 value, the more informative the selected feature First of all, let’s look at the beginning of the training, where zero-
subset. This is to be expected, a larger value of 𝛼 results in a larger gates first appear (i.e. the ordinate of the curve becomes non-zero
′ (𝑥 = 0) = 𝛼𝜖 1/𝜏 and thus a greater fault
value of the derivative 𝑔𝜖++ for the first time). In our experiments, all the 𝑥 in Function (4) and (8)
tolerance of the model. We also find in experiments that a larger are initialized to 1.0. So, at the beginning of training, the first case
𝛼 will result in a smaller number of features to be removed. when (i.e. 𝑥˜𝑡 ≥ 2𝜆𝜂) in (11) dominates, and 𝑥 will be subtracted by 2𝜆𝜂 at
other hyper-parameters kept unchanged. Therefore, in order to each step. Then as the training goes on, 𝑥 becomes zero gradually,
select roughly the same number of features, when increasing the and the speed at which it becomes zero is related to the value of
value of 𝛼, the value of 𝜆 in Function (10) must also be increased 2𝜆𝜂. In Figure 9, we tune the 𝛼 and 𝜆 so that the final number of
Conference’17, July 2017, Washington, DC, USA Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu

𝛼 10 50 100 200 We conducted an online A/B test for two weeks in July 2021. Com-
pared with the previous state-of-the-art method used online, our
𝜆 0.004 0.0066 0.013 0.02
method increased the cumulative revenue [5] by 1.058% and the
# feats 109 119 106 101 social welfare [5] by 1.046%, during the A/B test of 10% traffic.
Table 1: The table shows the relationship between 𝛼 and 𝜆
when about the same number of features are wanted to be
kept. "# feats" means the number of remaining features. We
4.8 Complexity Analysis
can see that when we increase 𝛼, the 𝜆 should also be increase As complexity analysis is important for industrial recommender
if we want to remain about the same number of features. systems, we do a simple complexity analysis in this section. Suppose
there are total 𝑁 features, we need to initialize a 𝑁 -dimensional
learnable vector 𝒙 in equation (3), which consumes 4𝑁 bytes stor-
age. Then the intermediate variables include 𝑔(𝒙), as well as the
gradient and the momentum of 𝒙 during the optimization process,
approximately occupying a total of 4 × 4𝑁 bytes memory, which
is negligible compared with the storage of the whole network and
the embeddings. Moreover, we found in many experiments that the
increase of the training time in each iteration is also negligible.
In our experiments, if the training of deep CTR model with
the feature superset is cold-started, it needs to be pre-trained for
a while until almost converged, using 𝑇1 time, and we then load
this pre-trained checkpoint to perform feature superset pruning to
obtain the best feature subset, using 𝑇2 time. Then we find that 𝑇2 is
usually about 14 𝑇1 in our all experiments. Of course, if we do feature
selection based on an online model checkpoint, this pre-training
step can be omitted for this case.
Figure 9: The number of features with zero-gate changes
over training step. The X-axis represents training steps (we
record every 100 steps, so the total number of training steps 4.9 Experiment details
is about 250000), while the Y-axis represents the number of In order to facilitate readers to reproduce and use our method, we
features with zero-gate. describe our experiment in detail.
In all the experiments, 𝑥 in Function (8) and (4) is initialized to 1.0,
while 𝜖 0.1. It can be calculated that the initial value of Function (8)
removed features is about the same. And it shows that zero gates and (4) is 0.909 and 0.909 + 0.785𝛼𝜖. When we get a pre-trained
occur earlier in the model with larger 𝜆 and 𝛼. Second, in the middle model, in order to ensure that there is no sudden change in the
of training, as the value of 𝛼 increases, the oscillating degree of the performance of the model before and after inserting the gates, we
curve also increases. This is also as expected, since increasing 𝛼 divide all gate functions by their initial value, so that the initial value
′ (𝑥 = 0) = 𝛼𝜖 1/𝜏 , giving more features a chance
increases the 𝑔𝜖++ of the gate function is 1.0. Besides, if we only do feature selection for
to revive. categorical features, rather than continuous features, the magnitude
Third, near the end of the training, all the curves gradually stop of categorical feature embeddings will change quickly compared
′ (𝑥 = 0) = 𝛼𝜖 1/𝜏 and the learning rate is
oscillating, because 𝑔𝜖++ with that of continuous representations, and this phenomenon is
decaying. especially evident for LPFS++. So, we find that it would be better
to divide the categorical feature embedding by an overall number,
4.7 Online Experiments which can be the root mean square of gate values. This number does
The proposed LPFS and LPFS++ have been successfully deployed not participate in gradient backpropagation, and can be updated
on the offline Kuaishou distributed training platform and exploited every 100 steps for example, then keep fixed after enough training
by dozens of advertising scenarios in Kuaishou. To show the effec- steps. We found that this step resulted in a slight improvement in
tiveness of our proposed methods, we choose our first launch as LPFS++ performance.
an example, in which nothing changed except for the input feature For Terabyte, we did not shuffle the dataset, and we train in
fields. The dataset is collected from KuaiShou APP and the feature chronological order to sense changes in user behavior to show the
superset contains 332 feature fields due to the extensive feature robustness of LPFS++. We use Adagrad optimizer for the model
engineering, from which we select 177 features for subsequent parameters (not including the gate parameters), and the batch size
experiments. We found that the feature subsets selected across dif- is 512, and the learning rate is 0.01. We decay 𝜖 by 0.9978 every 100
ferent runs are highly overlapping, which shows that our method steps, and the minimum value is 1.𝑒 − 5. We use proximal-SGD with
is robust. We first conduct an offline experiment on the dataset momentum to train 𝑥 in Function (8) and (4), the initial learning rate
which is sampled from 14 days online logs. Our model relatively is 0.01 for LPFS++, 0.005 for LPFS and decayed by 0.9991 every 100
outperforms permutation feature importance and the previous on- steps, but not exceed 5𝑒 − 4. Different 𝛼 and 𝜆 values are combined
line feature set by 0.209% and 0.304% in offline AUC, respectively. to tune the parameters to obtain feature subsets of different sizes.
LPFS: Learnable Polarizing Feature Selection for Click-Through Rate Prediction Conference’17, July 2017, Washington, DC, USA

For the industrial dataset, we also use Adagrad optimizer for # mask # mask
the model parameters (not including the gate parameters), and the 26 111110111111111111111111111 17 111100010111110101010101011
batch size is 1024, learning rate is initial as 0.01 then exponentially 25 111100111111111111111111111 16 111100010111110001010101011
rises to 0.1 very slowly and stays constant. We decay 𝜖 by 0.986 24 111100111111111111011111111 15 111100010011110101000101011
every 500 steps, with minimal value 1𝑒 − 4. 23 111100011111111111011111111 14 111100010011110001000101011
22 111100011111111101011111111 13 111100010011110001000001011
5 CONCLUSION, LIMITATION AND FUTURE 21 111100110111110101011111111 12 111100010011110000000001011
20 111100010111110101011111111 11 111100010101110000000001001
WORK 19 111100010111110101011101111 10 101100010100110000010101000
In this paper, we adopted the idea of smoothed-ℓ 0 to feature se- 18 111100010111110101011101011
lection and proposed a new smoothed-ℓ 0 -liked function to select
Table 2: Feature mask for Terabyte dataset.
features more effectively and robustly. Both LPFS and LPFS++ show
superiority over other methods, and LPFS++ achieves state-of-the-
art performance. LPFS and LPFS++ have played an important role
as efficient feature selection plugin tools for recommendation sce- A.2 Core code
narios in Kuaishou Technology.
The main limitation for function (8), from the perspective of import torch

dimensional analysis, is that it is not scale-invariant. The 𝑥 in the def lpfs_pp(x, epsilon, alpha=100, tao=2, init_val=1.0):
'''
arctangent trigonometric function arctan means that 𝑥 must be The gate function for LPFS++ (equation 8) in paper.
dimensionless. For function (4), 𝜖 has the dimension of 𝑥 squared. '''
g1 = x*x/(x*x + epsilon)
If we want to construct a gate function that is just a dimensionless g2 = alpha * epsilon**(1.0/tao) * torch.atan(x)
g = torch.where(x > 0, g2+g1, g2-g1)/init_val
coefficient, then the derivative must have the dimension of the form return g
∼ 𝑥1 or ∼ 𝑥𝜖 . When 𝑥 = 0 and 𝜖 is small enough, the ∼ 𝑥1 tends def train(model, optimizer_model, optimizer_gate, lam, train_dataloader, criterion):
to be infinity, ∼ 𝑥𝜖 tends to be strictly zero, and it is impossible to '''
model: the model we want to train and select features (the lpfs_pp function is behind the input features)
have finite non-zero derivative values at 𝑥 = 0. Thus we have to optimizer_model: the optimizer for all the original model parameters, not including gate parameters
optimizer_gate: the optimizer for gate parameters
assume that both 𝑥 and 𝜖 are dimensionless, and the consequence lam: the coefficient for the L1 regulation of x (equation 10) in paper.
that function (8) is not scale-invariant is inevitable. Obviously, the train_dataloader: dataloader for trianing set.
criterion: original training loss function.
solution is not unique to the problem, and there are many smoothed- '''
p = optimizer_gate.param_groups[0]["params"][0] # get gate parameters, i.e. x in paper
ℓ 0 -liked functions that satisfy the derivative being non-zero and for data, label in train_dataloader:
decaying as 𝜖 decaying. Other than function (8), we didn’t try any optimizer_gate.zero_grad() # clear grad
optimizer_model.zero_grad() # clear grad
other similar property functions yet. output = model(data) # forward the model
loss = criterion(output, label) # compute the loss
In fact, LPFS++ can be easily extended to online feature selection. loss.back_forward() # back-propagation
optimizer_model.step()
What we want is that some features can be active or inactive dynam- optimizer_gate.step()
ically as user behavior changes, and then predict whether a user # the following code is for the proximal-L1 algorithm (equation 11) in paper
thr = lam*lr # lr is the original learning rate.
will click on an item using active features in real-time. Our LPFS++ in1 = p.data > thr
in2 = p.data < -thr
can meet well with this requirement. Specially, we can maintain in3 = ~(in1 | in2)
a superset of features to train and select features simultaneously, p.data[in1] -= thr
p.data[in2] += thr
so that all the model parameters and gate parameters are trained p.data[in3] = 0.0

together online. Then, the features in the active status and their
corresponding sub-networks can become effective on a specialized Figure 10: Our LPFS++ can be implemented within several
inference platform in real-time. Online feature selection enables the lines of code using PyTorch. To demonstrate the repro-
model to capture the temporal changes in user behaviors and select ducibility, we list the core code of LPFS++ and the proximal-
the optimal feature subset dynamically and adaptively in real-time. SGD updating here.
This involves both algorithmic, training, and inference engineering
improvements, which are left as future work.

A APPENDIX
A.1 Feature mask for Terabyte dataset
We list the feature subsets that we selected by LPFS++ for refer-
ence. In the following table 2, the "#" column is the number of
remaining features, the "mask" column indicates the feature subsets
selection. In the mask, "0" means the corresponding feature should
be removed, while "1" means kept. The length of every mask is 27,
where the first value represents the 13 continuous features, and
the following 26 values represent the 26 categorical feature fields.
The order of the feature fields is the same as the official code of
DLRM [26].
Conference’17, July 2017, Washington, DC, USA Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu

REFERENCES [24] Hosein Mohimani, Massoud Babaie-Zadeh, and Christian Jutten. 2008. A fast
[1] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon approach for overcomplete sparse decomposition based on smoothed ℓ 0 norm.
Oh. 2020. Learning de-biased representations with biased representations. In IEEE Transactions on Signal Processing 57, 1 (2008), 289–301.
International Conference on Machine Learning. PMLR, 528–539. [25] Christoph Molnar. 2020. Interpretable machine learning. Lulu. com.
[2] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2014. Simple and scalable [26] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
response prediction for display advertising. ACM Transactions on Intelligent Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-
Systems and Technology (TIST) 5, 4 (2014), 1–34. Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model
[3] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for personalization and recommendation systems. arXiv preprint arXiv:1906.00091
for youtube recommendations. In Proceedings of the 10th ACM conference on (2019).
recommender systems. 191–198. [27] Atsushi Nitanda. 2014. Stochastic proximal gradient descent with acceleration
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: techniques. Advances in Neural Information Processing Systems 27 (2014).
A large-scale hierarchical image database. In 2009 IEEE conference on computer [28] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting
vision and pattern recognition. Ieee, 248–255. clicks: estimating the click-through rate for new ads. In Proceedings of the 16th
[5] Chao Du, Zhifeng Gao, Shuo Yuan, Lining Gao, Ziyan Li, Yifan Zeng, Xiaoqiang international conference on World Wide Web. 521–530.
Zhu, Jian Xu, Kun Gai, and Kuang-Chih Lee. 2021. Exploration in Online Adver- [29] Jiayi Shen, Haotao Wang, Shupeng Gui, Jianchao Tan, Zhangyang Wang, and Ji
tising Systems with Deep Uncertainty-Aware Learning. In Proceedings of the 27th Liu. 2020. UMEC: Unified model and embedding compression for efficient rec-
ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2792–2801. ommendation systems. In International Conference on Learning Representations.
[6] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods [30] Zheyan Shen, Peng Cui, Tong Zhang, and Kun Kunag. 2020. Stable learning via
for online learning and stochastic optimization. Journal of machine learning sample reweighting. In Proceedings of the AAAI Conference on Artificial Intelli-
research 12, 7 (2011). gence, Vol. 34. 5692–5699.
[7] Armin Eftekhari, Massoud Babaie-Zadeh, Christian Jutten, and Hamid Abrishami [31] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,
Moghaddam. 2009. Robust-SL0 for stable sparse representation in noisy settings. and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-
In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. attentive neural networks. In Proceedings of the 28th ACM International Conference
IEEE, 3433–3436. on Information and Knowledge Management. 1161–1170.
[8] Hongliang Fei, Jingyuan Zhang, Xingxuan Zhou, Junhao Zhao, Xinyang Qi, and [32] Linyu Wang, Junyan Wang, Jianhong Xiang, and Huihui Yue. 2019. A re-weighted
Ping Li. 2021. GemNN: gating-enhanced multi-task neural networks with feature smoothed-norm regularized sparse reconstructed algorithm for linear inverse
interaction learning for CTR prediction. In Proceedings of the 44th International problems. Journal of Physics Communications 3, 7 (2019), 075004.
ACM SIGIR Conference on Research and Development in Information Retrieval. [33] Zhe Wang, Liqin Zhao, Biye Jiang, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai.
2166–2171. 2020. Cold: Towards the next generation of pre-ranking system. arXiv preprint
[9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the arXiv:2007.16122 (2020).
group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010). [34] Jianhong Xiang, Huihui Yue, Xiangjun Yin, and Linyu Wang. 2019. A New
[10] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Smoothed L0 Regularization Approach for Sparse Signal Recovery. Mathematical
DeepFM: a factorization-machine based neural network for CTR prediction. arXiv Problems in Engineering 2019 (2019).
preprint arXiv:1703.04247 (2017). [35] Feng Yu, Zhaocheng Liu, Qiang Liu, Haoli Zhang, Shu Wu, and Liang Wang.
[11] Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. 2021. 2020. Deep Interaction Machine: A Simple but Effective Model for High-order
Gdp: Stabilized neural network pruning via gates with differentiable polarization. Feature Interactions. In Proceedings of the 29th ACM International Conference on
In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5239– Information & Knowledge Management. 2285–2288.
5250. [36] Luo Yuanfei, Wang Mengshuo, Zhou Hao, Yao Quanming, Tu Weiwei, Chen
[12] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed- Yuqiang, Yang Qiang, and Dai Wenyuan. 2019. AutoCross: Automatic Feature
ings of the IEEE conference on computer vision and pattern recognition. 7132–7141. Crossing for Tabular Data in Real-World Applications. ACM (2019).
[13] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization [37] Ruizhen Zhao, Wanjuan Lin, Hao Li, and S Hu. 2012. Reconstruction algorithm
with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016). for compressive sensing based on smoothed L0 norm and revised newton method.
[14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- Journal of Computer-Aided Design and Computer Graphics 24, 4 (2012), 478–484.
mization. arXiv preprint arXiv:1412.6980 (2014). [38] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang
[15] Yifeng Li, Chih-Yu Chen, and Wyeth W Wasserman. 2016. Deep feature selec- Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate
tion: theory and application to identify enhancers and promoters. Journal of prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33.
Computational Biology 23, 5 (2016), 322–336. 5941–5948.
[16] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and [39] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in- Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through
teractions for recommender systems. In Proceedings of the 24th ACM SIGKDD rate prediction. In Proceedings of the 24th ACM SIGKDD international conference
International Conference on Knowledge Discovery & Data Mining. 1754–1763. on knowledge discovery & data mining. 1059–1068.
[17] Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, Jincai Lai, Ruiming Tang, Xi-
uqiang He, Zhenguo Li, and Yong Yu. 2020. Autofis: Automatic feature interaction
selection in factorization models for click-through rate prediction. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2636–2645.
[18] Qiang Liu, Zhaocheng Liu, Haoli Zhang, Yuntian Chen, and Jun Zhu. 2021. Mining
Cross Features for Financial Credit Risk Assessment. In Proceedings of the 30th
ACM International Conference on Information & Knowledge Management. 1069–
1078.
[19] Zhaocheng Liu, Qiang Liu, Haoli Zhang, and Yuntian Chen. 2020. DNN2LR:
Interpretation-inspired Feature Crossing for Real-world Tabular Data. arXiv
preprint arXiv:2008.09775 (2020).
[20] Xu Ma, Pengjie Wang, Hui Zhao, Shaoguo Liu, Chuhan Zhao, Wei Lin, Kuang-
Chih Lee, Jian Xu, and Bo Zheng. 2021. Towards a Better Tradeoff between
Effectiveness and Efficiency in Pre-Ranking: A Learnable Feature Selection based
Approach. arXiv preprint arXiv:2105.07706 (2021).
[21] Lukas Meier, Sara Van De Geer, and Peter Bühlmann. 2008. The group lasso for
logistic regression. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 70, 1 (2008), 53–71.
[22] G Hosein Mohimani, Massoud Babaie-Zadeh, and Christian Jutten. 2007. Fast
sparse representation based on smoothed ℓ0 norm. In International Conference on
Independent Component Analysis and Signal Separation. Springer, 389–396.
[23] H Mohimani, M Babaie-Zadeh, M Gorodnitsky, and C Jutten. 2010. Sparse
recovery using smoothed L0 norm (SL0): convergence analysis. arXiv preprint
cs.IT/1001.5073 (2010).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy