Uploaded by

lu284918171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Entire Space Multi-Task Model

Uploaded by

lu284918171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

ESMC: Entire Space Multi-Task Model for

Post-Click Conversion Rate via Parameter

Constraint
Zhenhao Jiang§‡∗ , Biao Zeng†∗ , Hao Feng† , Jin Liu† B , Jicong Fan§‡ B ,
Jie Zhang† , Jia Jia† , Ning Hu† , Xingyu Chen¶ , Xuguang Lan¶
§ School
of Data Science, The Chinese University of Hongkong, Shenzhen, China
† Alibaba Group, Shanghai&Hangzhou, China
‡ Shenzhen Research Institute of Big Data, Shenzhen, China
¶ Xi’an Jiaotong University, Xi’an, China
arXiv:2307.09193v2 [cs.AI] 29 Jul 2023

[email protected], {biaozeng.zb, zhisu.fh, nanjia.lj}@alibaba-inc.com, [email protected]

Abstract—Large-scale online recommender system spreads all role to handle the task timely and accurately with the help of
over the Internet being in charge of two basic tasks: Click- deep learning algorithms [33, 44]. Recommendation service
Through Rate (CTR) and Post-Click Conversion Rate (CVR) first recalls candidates from item pool and then feeds them
estimations. However, traditional CVR estimators suffer from
well-known Sample Selection Bias and Data Sparsity issues. into a recommender algorithm to predict several metrics such
Entire space models were proposed to address the two issues via as Click-Through Rate (CTR) and Post-Click Conversion Rate
tracing the decision-making path of “exposure click purchase”. (CVR) [18, 19, 29]. Next, items are ranked according to
Further, some researchers observed that there are purchase- CTR, CVR or other metrics and exposed on the terminal
related behaviors between click and purchase, which can better device of user. A user may click an item to enter the in-
draw the user’s decision-making intention and improve the
recommendation performance. Thus, the decision-making path shop page to add it into cart/wish-list and purchase, which
has been extended to “exposure click in-shop action purchase” can be described as a decision-making graph of “expo-
and can be modeled with conditional probability approach. sure click in-shop action purchase” [16, 30]. This feedback
Nevertheless, we observe that the chain rule of conditional will be recorded and used for updating recommender algorithm
probability does not always hold. We report Probability Space to ensure that system can capture the evolution of interest
Confusion (PSC) issue and give a derivation of difference between
ground-truth and estimation mathematically. We propose a novel and recent preference of the user. To provide users with
Entire Space Multi-Task Model for Post-Click Conversion Rate more accurate recommendation service, a high-quality CVR
via Parameter Constraint (ESMC) and two alternatives: Entire estimator is crucial in practice [45].
Space Multi-Task Model with Siamese Network (ESMS) and There are two basic issues in the CVR estimation task:
Entire Space Multi-Task Model in Global Domain (ESMG) to Sample Selection Bias (SSB) and Data Sparsity (DS) [26]
address the PSC issue. Specifically, we handle “exposure click in-
shop action” and “in-shop action purchase” separately in the shown in Fig. 1a. SSB refers to a gap between training sample
light of characteristics of in-shop action. The first path is still space and online inference sample space. Traditional CVR
treated with conditional probability while the second one is estimators are trained on clicked samples while implemented
treated with parameter constraint strategy. Experiments on both on exposed samples based on the schema of online recommen-
offline and online environments in a large-scale recommendation dation service. DS refers to the issue that the size of clicked
system illustrate the superiority of our proposed methods over
state-of-the-art models. The code and real-world datasets will be samples is too small to train a model that can fit conversion
released for further research. well. Consequently, the performance of recommender algo-
Index Terms—Recommender System, Entire Space Multi- rithm is dissatisfactory in online service [34]. SSB and DS
Task Learning, Conversion Rate Prediction, Probability Space are fundamental issues that we must overcome in industrial
Confusion recommender systems. Many researchers have proposed entire
space models to address SSB and DS [12, 23, 38]. Entire
I. I NTRODUCTION
Space Multi-Task Model (ESMM) is one of the representatives
Selecting best-suited products from floods of candidates to of entire space models that will be presented in Section
deliver them to users based on their appropriate preferences 3. Following ESMM, Entire Space Multi-Task Model via
has become a significant task on most of online platforms Behavior Decomposition (ESMM2 ) is proposed to introduce
such as online food booking, short video, e-commerce, etc in-shop behaviors to estimate CVR with the almost same
[22, 40, 43, 46]. Recommender system plays an important ideology of ESMM [39].
Several studies have claimed that user may purchase items
This paper was done when Zhenhao Jiang as an intern in Alibaba Group.
∗ Both authors contributed equally to this paper. from shopping cart or wish list and we observe this phe-
B Corresponding authors. nomenon in real business as well [27, 28, 42]. The action
(a) Illustration of SSB and DS issues in CVR estimation that model is (b) Demonstration of Probability Space Confusion issue. Cart and
trained over clicked samples while is used to infer on exposed samples. Purchase may not be in the same visit.
The size of samples diminishes from exposure to purchase.

(c) Demonstration of our approach for addressing the PSC issue. The orange line means sample space calibration and the blue line means
information injection.

Fig. 1: Illustration of sample space. (a) SSB and DS issues. (b) Key problem in this paper: PSC issue. (c) Key idea in this
paper to handle PSC issue.

of adding to cart/wish-list (in-shop action1 ) bridges click and also present the derivation of the gap between estimation and
purchase that is more conversion-related than click. Therefore, ground-truth mathematically under PSC issue and propose En-
extracting the functionality of in-shop action in decision- tire Space Multi-Task Model via Parameter Constraint (ESMC)
making path is meaningful. In ESMM2 , algorithm explicitly that mainly consists of three modules: 1) shared embedding
models the sequential behavior of “exposure click in-shop ac- (SE), 2) constrained twin towers (CTT), and 3) sequential
tion purchase” via conditional probability to leverage samples composition module (SCM) and one strategy: Sample Cal-
over the entire space to address SSB and DS issues more ibration to address PSC problem. Before training, Sample
efficiently. However, the probability-based approach doesn’t Calibration unifies the sample space. In the model, SE maps
always work. In-shop action may come from other sample feature vectors into low-dimensionally dense vectors at first.
space not contained in the current exposure space different Then, CCT fits Click-Through Conversion Rate (CTCVR) and
from click or purchase. Click-Through Cart Adding Rate (CTCAR) under a given
In this paper, we report Probability Space Confusion constraint. Finally, SCM combines CTR, CTCAR and CTCVR
(PSC) problem of ESMM2 -like models2 shown in Fig. 1b together to perform a multi-task estimation. Going further, we
which will be presented in detail in Section 4. Because present two alternatives (i.e. ESMS and ESMG) and discuss
ESMM2 -like model is widely used in industrial recom- their advantages and disadvantages to help practitioners choose
menders, it is critical and meaningful to improve it. We the most suitable solution for their own business.
The main contributions of this work are as follows:
1 In this paper, we focus on adding to cart.
2 In
• This is the first work that reports the PSC issue in
this paper, ESMM2 -like model is defined as the model considering both
decision-making graph of “exposure click in-shop action purchase” and the CVR estimation with in-shop behaviors. We demon-
probabilistic dependence among different behaviors. strate the problem from the perspective of sample space
TABLE I: Summarization of important abbreviations. TABLE II: Summarization of notations.
Abbreviation Description Notation Description
Cart the behavior of adding to cart u user
CTR Click-Through Rate V items exposed to user
CVR Post-Click Conversion Rate v item
CAR Cart Adding Rate C/cu,v clicked items by user/entry of C
CTCAR Click-Through Cart Adding Rate O/ou,v purchased items by user/entry of O
CTCVR Click-Through Conversion Rate A/au,v items added in cart by user/entry of A
PSC Probability Space Confusion X exposure space
SSB Sample Selection Bias C click space
DS Data Sparsity O conversion space
A Cart space
X/x exposure event/value of X
C/c click event/value of C
and emphasize the importance of distinguishing between A/a click & Cart event/value of A
click/purchase and in-shop actions. We also highlight the R/r conversion event/value of R
mathematical theory behind the PSC issue. letters with hat (e.g. r̂) the corresponding estimators given by algorithm
2
• We propose ESMC, the first work that enhances ESMM
with a novel parameter constraint approach. ESMC avoids
the PSC issue and improves the performance of ESMM2 . attention mechanism [3]. PLE is used to solve the problem of
Extensive experimental results verify our claims. negative transfer in multi-task learning. It can be considered
• We also propose two alternatives of ESMC (i.e. ESMS
as a stacked structure of basic modules of MMoE, introducing
and ESMG) and discuss their characteristics to help specific expert networks and common expert networks to
others identify the most suitable strategy to address the decouple different tasks [31].
PSC issue in their own business.
• To support future research, we construct real-world B. Conversion Rate Prediction
datasets collected from a large-scale online food platform, ESMM proposes a decision-making path of “expo-
which we will release publicly. sure click purchase” and draws CVR based on the chain rule
The important abbreviations in this paper are summarized of conditional probability [26]. ESMM2 extends the decision-
in Table I. making path to “exposure click in-shop action purchase”
II. R ELATED W ORKS with the similar idea of ESMM [39]. In [41], researchers find
that CVR estimation in basic ESMM is biased and address this
A. Multi-Task Learning problem with causal approach (ESCM). ESCM2 gives out a
Since it is necessary to estimate multiple tasks (i.e. CTR and more solid proof for the bias issue in ESMM and employs
CVR) simultaneously in recommendation system, it is critical a similar solution in ESCM [34]. Here, we first report a
to design a multi-task learning model. Deep recurrent neural novel PSC issue in ESMM2 and provide three solutions to
network is employed to encode the text sequence into a latent address it. There are also many studies that predict CVR from
vector, specifically gated recurrent units trained end-to-end on other perspectives. In [19], researchers model CVR at different
the collaborative filtering task [2]. MMoE consists of multiple hierarchical levels with separate binomial distributions and
expert networks and gate networks to learn the correlations and estimate the distribution parameters individually. ACN uses
differences among different tasks to fit multiple downstream Transformer to implement feature cross-over and employs
tasks [25]. The two basic tasks in recommendation (i.e. rank capsule networks with a modified dynamic routing algorithm
and rate) are traced simultaneously with a multi-task frame- integrating with an attention mechanism to capture multiple
work in [17]. NMTR considers the underlying relationship interests from user behavior sequence [20]. GCI counterfactu-
among different types of behaviors and performs a joint ally predicting the probability of each specific group of each
optimization with a multi-task learning strategy, where the unit belongs to for post-click conversion estimation [15]. Auto-
optimization on a behavior is treated as a task [11]. In [24], a HERI leverages the interplay across multi-tasks’ representation
multi-task recommendation model with matrix factorization is learning. It is designed to learn optimal connections between
proposed which jointly learns to give rating estimation and layer-wise representations of different tasks and can be easily
recommendation explanation. MTRec is designed based on extended to new scenarios with one-shot search algorithm [37].
heterogeneous information network equipped with a Bayesian Unlike the studies mentioned above, we focus on the
task weight learner that is able to balance two tasks during distinctiveness of Cart from a probability perspective. Our
optimization automatically and provide a good interpretability emphasis is on explaining the mathematical theory behind it
[21]. SoNeuMF is an extension of neural matrix factorization and proposing simple yet effective solutions.
and is able to simultaneously model the social domain and
item domain interactions via sharing user representation in III. P RELIMINARY
two tasks [9]. AMT-IRE is a multi-task framework which can
adaptively extract the inner relations between group members Since this paper aims to improve ESMM2 , we first introduce
and obtain consensus group preferences with the help of ESMM and ESMM2 in this Section.
(a) (b) (c)

Fig. 2: Illustration of three types of user decision graph from exposure to purchase. (a) Three real decision-making graphs
online. (b) Decision-making graph in ESMM. (c) Decision-making graph in ESMM2 .

A. Problem Formulation Thus, ESMM addresses SSB problem via training on the
Here, we state the Post-Click Conversion Rate estimation exposure space. Additionally, since the size of clicked samples
problem on the entire space with Cart. Let u denote a user is much larger than that of conversion samples, modeling
browsing item feeds and item set V = {v1 , v2 , . . . , vm } CVR on the exposure space allows for better utilization of
represent the items on exposure space X for u. Define C the available data to tackle DS problem.
as the click set that indicates which item in V is clicked C. Entire Space Multi-Task Model via Behavior Decomposi-
by u where each entry cu,v ∈ {0, 1}, O as the conversion tion
(purchase) set that indicates which item in V is conversed
This is an extension work of ESMM, known as ESMM2
finally where each entry ou,v ∈ {0, 1}. C and O indicates the
[39]. The basic ideology of them are significantly similar.
click space and conversion space, respectively. Especially, let
Compared with ESMM, the main improvement is that ESMM2
A denotes the collection of items added in cart where each
involves intermediate behaviors between click and purchase,
entry au,v ∈ {0, 1} and A is the Cart space. The notations
such as Cart and “adding to wish-list” and is more in line
used in this paper are summarized in Table II.
with the real decision-making process in online service for
In practice, online recommender server has to estimate CTR
user. Actually, different actions cannot be triggered at the
and CVR on the exposure space X. Consequently, we have
same time. To simplify the problem, ESMM2 considers that
to train a model in this manner to keep the online-offline
all in-shop actions can be triggered in parallel. For the sake
consistency (avoiding sample selection bias). Further, if O is
of description, we focus on Cart.
fully observed, the ideal loss function is formulated as:
Similar to ESMM, ESMM2 employs chain rule to model
L := Eu,v [δ(ou,v , ôu,v )], (1) CVR with behaviors:

where E means expectation of events, ôu,v is estimated result, P(CT CV R) = P(CT R) × P(CAR) × P(CV R), (5)
and δ is an error function such as the cross entropy loss: where CAR is Cart Adding Rate on C and CVR is Conversion
δ(ou,v , ôu,v ) := −ou,v log ôu,v − (1 − ou,v ) log(1 − ôu,v ). (2) Rate on A. Further, the probability of CTCVR of item v can
be defined as the conditional probability mathematically in
B. Entire Space Multi-Task Model accordance with “exposure click Cart purchase” process:
On online shopping platform, an item might experience P(ou,v = 1|cu,v = 1) =
“exposure click purchase” to converse. In the light of this (6)
P(ou,v = 1|cu,v = 1, au,v = 1)P(au,v = 1|cu,v = 1).
process, ESMM proposes a CVR estimation approach via
chain rule [26]: Undoubtedly, an item cannot be added to the cart without
being clicked, and it cannot be purchased without being added
P(CT CV R) = P(CT R) × P(CV R), (3) to the cart. Therefore, CTCVR can be modeled in a similar
way to other in-shop behaviors. The decision-making path is
CTCVR estimation is given out with the product of CTR
illustrated in Figure 2.
and CVR predicted by two full-connected towers. During the
However, does the chain rule always hold?
training process, ESMM minimizes the empirical risk of CTR
and CTCVR estimation over X: IV. D ISCUSSION ON ESMM 2
LCT R = Eu,v [δ(cu,v , ĉu,v )] In this section, we first explain the PSC issue. We then
(4)
LCT CV R = Eu,v [δ(cu,v × ou,v , ĉu,v × ôu,v )]. provide a mathematical derivation for quantifying the gap
between the ground-truth and estimated values. Finally, we In the anticipation of ESMM2 , in-shop action always sat-
discuss the implications of the gap. isfies the chain rule of probability. Thus the expectation of
estimator R̂ given by ESMM2 in the Bad Case is:
A. Probability Space Confusion Issue " # " #
⃝
3 R̂ Â
ESMM2 introduces in-shop actions to draw the fine-grained EX2 [R̂] = EX2 · EX2
decision-making process. It considers CVR on the exposure Â Ĉ
Z Z
space and C, A, and O are sub-space defined on X. When r̂ â
= P(r̂, â)d(r̂, â) P(â, ĉ)d(â, ĉ)
a user opens the online recommendation feeds, several items X2 â X2 ĉ (8)

are exposed where the user can see. Then, the user may click 1 1
one of the items to enter the detail page (in-shop page), add = EX2 · EX2 [R̂] · EX2 · EX2 [Â]
Ĉ Â
products to cart, and make the final payment. ESMM2 assumes
1

1
the actions in the path of “exposure click Cart purchase” = EX2 [R̂] · EX2 · EX2 [Â] · EX2 .
Â Ĉ
occur within the same visit or in the same sample space.
However, this assumption does not always hold true on real There are some explanations of the above derivation:
online platforms, as shown in Figure 3. The user may exit ⃝1 means that the user clicked on the item to enter the detail
the detail page without making an immediate purchase after page and added products to cart in the previous visit. In current
adding products to cart. Most online shopping platform records visit the user purchase products in cart. Because the shopping
users’ Cart information so that users can quickly find the cart generally exists independently on the online platform (i.e.
items they prefer. Therefore, the user may log on the online not in the recommendation feeds), the click behavior of the
platform again after a period of time and enter the shopping current visit is not considered and the behavior in the current
cart to buy. As a result, the paths of “exposure click Cart” visit is expressed in terms of conditional probability here.
and “Cart purchase” are in different visits or in different ⃝2 holds under the assumption A ⊥ C.
sample spaces. This raises a problem. Based on the assumption
⃝3 holds because ESMM2 considers that the entire path is
of ESMM2 , the entire space or exposure space is actually
in the same visit and satisfy the chain rule. It doesn’t take into
defined on the sample space of user’s current visit. Because
account the specificity of Cart.
the user’s information will be updated according to their
Consider the gap between group-truth and estimation in the
behavior in the next visit, the recommender could predict
Bad Case:
the recommendation lists only based on the current status of
users and items. Therefore, the exposure space for each visit is Gap = EX2 [R] − EX2 [R̂]
actually independent for a user. The calculation of probabilities
1
defined on different sample spaces leads to the PSC issue. = EX2 [R|A] · EX1 [A] · EX1
C (9)
Remark: To simplify the problem, we use Session 3 [36] to
determine whether actions occur within the same visit. 1 1
− EX2 [R̂] · EX2 · EX2 [Â] · EX2 .
Â Ĉ
B. Mathematical Derivation on PSC Issue h i
Here, we define EX2 [R|A] and EX2 [R̂] · EX2 Â1 as Left
Here we provide a mathematical derivation to evaluate the h i
Terms, and EX1 [A] · EX1 C1 and EX2 [Â] · EX2 Ĉ1 as Right

gap between the ground-truth and estimation of ESMM2 .
First, we give out the right expectation of R under the case Terms.
that the item has been already added to cart before. Based If this gap can be eliminated, then there must be an
on the discussion of the PSC issue, the entire path are not upper bound, so we derive a loose upper bound to prove the
in the same sample space in this case, defined as Bad Case. solvability of this problem.
Additionally, X1 is for the former exposure space while X2 Z Z
is for the current one. a
Gap = rP(r|a)dr P(a, c)d(a, c)
X1 c
⃝1

A Z X2 Z
EX2 [R] = EX2 [R|A] · EX1 r a
C − P(r, a)d(r, a) P(a, c)d(a, c) (10)
X2 a X2 c
⃝
Z
2 a
= EX2 [R|A] · P(a, c)d(a, c) ⃝
4 Z Z
a
c ≤ rdr d(a, c).
Z X1 Z (7) X2 X1 c
1
= EX2 [R|A] · aP(a)da · P(c)dc
X1 c ⃝4 holds because the range of arbitrary probability is [0, 1].
X1
1 It is evident that the integral domain is a finite interval
= EX2 [R|A] · EX1 [A] · EX1 .
C and the integrand is bounded on the integral domain. There
is always a finite upper bound to the gap, and therefore the
3 For a web address, one session is equivalent to one visit. problem is solvable.
Fig. 3: Illustration of Eleme APP, an Alibaba Group’s takeaway platform that serves hundreds of millions of users. The black
arrow represents the user’s behavior. Page 1 (shown at the top left corner) is the recommendation page, page 2 is the detail
page (in-shop page), and page 3 is the shopping cart page. A user may click on an item in the recommendation page to enter
the in-shop page and add a product to shopping cart. And then, the user exits the platform. After a while, the user logs onto
the platform again and goes directly to the shopping cart page to purchase the product. Therefore, the decision-making path
of this user does not occur within the same visit.

C. Discussion on The Difference sample spaces which leads to the PSC issue. This implies that
In (9), compared with the ground-truth, we can find that the in-shop actions does not necessarily satisfy the chain rule,
there are two differences in terms of formula form. so the strategy of conditional probability cannot be directly
employed to manipulate events defined on different sample
• Left Terms. The Cart information and the purchase infor-
spaces.
mation are decoupled in the current space for estimation.
However, ESMM2 does not take into account that Cart V. P ROPOSED M ETHOD
does not necessarily occur in the current exposure space,
In this Section, we propose three approaches to address
which results in a lack of Cart information related to the
the PSC issue and improve the performance of Post-Click
conversion in model estimation in the case we discussed
Conversion Rate estimation.4
above. Thus we have to inject Cart information into the
purchase space, as shown by the blue line in Fig 1c. A. Entire Space Multi-Task Model via Parameter Constraint
• Right Terms. The probability space in the estimation
Shared Embedding Layer First, we build a shared em-
one is incorrect. ESMM2 takes Cart into consideration
bedding layer to transfer all the sparse ID features and
over X2 , although it took place in the previous exposure
discretized numerical features into dense vectors. The features
space X1 , as discussed above. Thus we have to calibrate
mainly consist of user features (e.g. gender, age, consumption
the sample space, as shown by the orange line in Fig 1c.
frequency), item features (e.g. brand, category, geographic
The Bad Case for ESMM2 is discussed. What happens if location) and user-item cross features (e.g. the number of
the Bad Case doesn’t happen for the ground-truth (i.e., Good orders in a shop, age-brand). The entire model uses the same
Case). embedding, which can be expressed as follows.

⃝
5 R A
EX2 [R] = EX2 · EX2 f̄i = Wfi , (12)
A C
(11) where fi is the i-th one-hot feature and W denotes the
1 1
= EX2 [R] · EX2 · EX2 [A] · EX2 . embedding matrix.
A C
Constrained Twin Towers This structure focuses on the
⃝5 holds because the user’s decision-making path of “ex- decision-making path of “Cart purchase”. Since the chain rule
posure click Cart purchase” is in the same visit in the Good of conditional probability cannot describe this path well, we
Case which satisfies the assumption of ESMM2 . employ a pair of towers to learn the mapping automatically.
Thus, there is no gap in the Good Case. Specifically, there are one CTCVR tower and one CTCAR
In summary, there is a significant gap between Cart and
other actions (e.g. click, purchase). Cart may be related to two 4 The code will be released after publication.
Fig. 4: Illustration of ESMC. Loss1 is the CTR loss, loss2 is the CTCAR loss, loss3 is the CTCVR loss, and loss4 is the loss
of parameter constraint.

(Click-Through Cart Adding Rate) tower. To address the gap estimation. Once we obtain the output of the towers, SCM
in the Left Terms discussed in Section 4, we use the parameter composes the probability based on the following equation.
space of CTCAR tower to control that of CTCVR tower. There
are three reasons for this. P(CT CAR) = P(CT R) × P(CAR),
(14)
P(CT CV R) = P(CT R) × P(CV R).
• In this way, the information about Cart can be injected
into conversion which can couple two information to- SCM is a nonparameter structure that expresses conditional
gether to fill the gap in the Left Terms, and the proper probability like ESMM.
function can be automatically fitted out by the neural Sample Calibration To calibrate the probability space, we
network. manipulate sample directly. Since some Cart actions have
• The purchase space is covered by the Cart space, which occurred during the user’s previous visit, there may be no Cart
naturally has a subordinate relationship. action prior to the current purchase. We calibrate these samples
• According to our observation, Cart is strongly related to correlate them with current purchase behavior according to
to purchase, that is, most of the items in the cart will Session. In this way, the probability space of training samples
eventually be bought. is explicitly unified that can be represented as the one-to-one
Here, we employ KL-divergence to evaluate the distance mapping function Q:
between two parameter spaces.
Q : DX1 → DX2 , (15)

DKL (P(X)∥Q(X)) = EX∼P(X) log
P(X)
, (13) where DX1 is Cart sample set in the former visit and DX2 is
Q(X) Cart sample set in the current visit.
After calibration, the ground-truth in the Bad Case becomes:
where P(X) and Q(X) express two probability distributions.
Sequential Composition Module Besides CTCVR tower 1
EX2 [R] = EX2 [R|A] · EX2 [A] · EX2 . (16)
and CTCAR tower, there is also a CTR tower for CTR C
TABLE III: The basic statistics of datasets. & denotes the number. M refers to million and K refers to thousand.
Dataset &Users &Items &Clicks &Purchases Total Size Sparsity of Click Sparsity of Purchase
City 1 6M 110K 61M 10M 1,004M 0.06096 0.01093
City 2 3M 56K 26M 4M 427M 0.06096 0.01108
City 3 4M 85K 30M 5M 507M 0.06076 0.01067
City 4 3M 68K 17M 2M 281M 0.06069 0.00908
City 5 1M 30K 13M 2M 216M 0.06083 0.01044
City 6 1M 36K 11M 1M 184M 0.06103 0.01041

Discussion on ESMC Finally, the expectation of estimator later by the recommendation system. Therefore, in ESMG,
R̂ for ESMC is: we consider the global domain Cart sample. The structure

1 of model keeps unchanged in ESMG. ESMC2 and ESMS2
EX2 [R̂] = fX2 (R̂, Â) · EX2 [Â] · EX2 , (17) are trained with Cart sample in global domain that are taken
Ĉ
into account in later comparative experiments. The training
where fX2 is an unknown function of R̂ and Â defined on X2 . objective is formulated as follows.
It can be considered as a neural network mapping. Conditional L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR|rec
expectation function EX2 [R|A] can be automatically fitted
+ω4 × LCT CAR|global + ω5 × DKL (θCT CAR ∥θCT CV R ),
with neural networks.
(20)
The formula is consistent with the ground-truth one and
ESMC can perform parameter estimation better in this manner. where LCT CAR|global and LCT CAR|rec means loss functions
For the Good Case in Section 4, there is no gap either, which on Cart over global domain and recommendation domain.
means that ESMC can work well. Hence, the final training Especially, ω5 = 0 for ESMS2 .
objective to be minimized to obtain parameter set Θ is as D. Difference among Three Approaches
follows.
ESMC v.s. ESMS The difference lies in the way they
L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR handle the path of “Cart purchase”. In terms of model struc-
+ω4 × DKL (θCT CAR ∥θCT CV R ), ture, ESMS is a special case of ESMC. In terms of model
(18) performance, ESMS is suitable for scenarios where the Cart
space and the purchase space are strongly correlated while
where ω1 , ω2 , ω3 , ω4 are weights of corresponding items,
ESMC is more suitable for scenarios where the correlation
θCT CAR and θCT CV R are the parameters of CTCAR and
between the Cart space and the purchase space is not as strong.
CTCVR towers, respectively. Besides, three loss functions L
In the training stage, it takes a lot of time to adjust the
are cross entropy loss shown in (2) with the proper samples of
constraint coefficient. Besides, due to the dependence between
click, Cart and purchase. Fig. 4 shows the structure of ESMC.
the twin towers, it is difficult to train the two towers in parallel,
B. Entire Space Multi-Task Model with Siamese Network which increases the overhead. In the inference phase, as the
twin networks of ESMS share parameters, only parameters
In our practice, we have observed that the conversion
in one tower needs to be stored, which significantly reduces
rate under the Cart space is very high (more than 80%). If
the number of parameters and memory occupied by the model.
we adjust the parameter constraints of the twin towers to
This makes deployment of the model to online platforms more
infinity, it is approximately equivalent to a Siamese Network
efficient.
[4] with shared parameters. However, we also find that the
ESMC&ESMS v.s. ESMG (ESMC2 &ESMS2 ) The dif-
change of model performance is not stable along with the
ference is whether the Cart sample comes from the global
increasing of constraint coefficient, which may be due to
domain or the recommendation domain. Only using sample
the constraint conditions affecting the search space of the
from the recommendation domain absolutely conforms to the
main task. Therefore, we detach the parameter constraint and
basic assumptions of proposed methodology in Section 4.
directly use the absolute Siamese Network to model CTCVR
However, the use of global Cart samples may relax the basic
that is ESMS. Therefore, the constraint in (18) can be removed
assumptions that affect the performance of the model. But
and the model only focuses on the estimation task. The training
at the same time, considering Cart samples from the global
objective is expressed as follows.
domain can supplement the information that helps improve the
L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR . (19) generalization of algorithm. The degree of information gain
and constraint relaxation is related to the correlation between
C. Entire Space Multi-Task Model in Global Domain the search domain and the recommendation domain.
We only employ the Cart samples in the recommendation In conclusion, we propose three approaches (four models)
domain in ESMC and ESMS. In fact, items in the shopping to address the PSC issue. Considering the online performance
cart do not only come from the recommendation domain, but and the cost of model deployment, we finally choose ESMS to
also from the search domain. After a user searches for an deploy on the online environment. Later, our proposed models
item and adds it to the cart, it may still be exposed to the user are collectively referred to as ESMC-family.
TABLE IV: Comparison with SOTA multi-task learning baselines. The best results are shown in Bold and the second best
results are shown in Italic. Improvement is calculated as the relative increase of our best result compared to the best result in
the baselines.
City 1 City 2 City 3
CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC
Shared Bottom 0.73025 0.82245 0.71242 0.72674 0.81406 0.70152 0.73103 0.82364 0.70824
ESMM 0.72894 0.82395 0.71949 0.72546 0.81649 0.71002 0.73010 0.82532 0.71606
MMOE 0.72963 0.82254 0.71327 0.72592 0.81450 0.70239 0.73065 0.82366 0.70894
ESMM2 0.72995 0.83306 0.76824 0.72681 0.82692 0.75955 0.73100 0.83343 0.76117
ESMS 0.73093 0.83563 0.77048 0.72697 0.82849 0.76069 0.73140 0.83599 0.76385
ESMC 0.73111 0.83594 0.76862 0.72774 0.82936 0.75987 0.73176 0.83637 0.76339
Improvement 0.118% 0.346% 0.292% 0.128% 0.295% 0.150% 0.100% 0.353% 0.352%
City 4 City 5 City 6
CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC
Shared Bottom 0.72465 0.81497 0.69693 0.72902 0.82326 0.71015 0.73349 0.83099 0.71435
ESMM 0.72367 0.81704 0.70419 0.72777 0.82608 0.71966 0.73262 0.83156 0.72050
MMOE 0.72419 0.81481 0.69664 0.72836 0.82460 0.71322 0.73349 0.83076 0.71518
ESMM2 0.72515 0.82627 0.74893 0.72866 0.83532 0.76531 0.73336 0.84168 0.76769
ESMS 0.72484 0.82782 0.75016 0.72959 0.83716 0.76552 0.73433 0.84305 0.76847
ESMC 0.72806 0.84098 0.75788 0.73009 0.83737 0.76762 0.73493 0.84334 0.76871
Improvement 0.401% 1.780% 1.195% 0.147% 0.245% 0.302% 0.196% 0.197% 0.133%

TABLE V: Results of comparison of ESMS&ESMC and A. Experimental Settings

ESMS2 &ESMC2 . Improvement is calculated as the relative
Datasets5 We collect six offline datasets by collecting the
increase of models in global domain compared to models in
users’ feedback logs from six different cities between April 21,
recommender domain.
2023 and May 10, 2023 from a large-scale online platform
City 1 named Eleme, Alibaba’s takeaway platform that produces
CTR-AUC CTCVR-AUC CVR-AUC
nearly one billion behavioral data per day. The statistics of
ESMS2 0.73116 0.83589 0.77066
Improvement 0.031% 0.031% 0.023% the offline datasets are listed in Table III.
ESMC2 0.74015 0.84166 0.76966 Metrics To evaluate the performance of our proposed
Improvement 1.236% 0.684% 0.135% method, we select three widely used metrics for offline
City 2 test, i.e., CTR-AUC, CTCVR-AUC and CVR-AUC for CTR,
CTR-AUC CTCVR-AUC CVR-AUC
ESMS2 0.72746 0.82933 0.76082 CTCVR and CVR estimation.
Improvement 0.067% 0.101% 0.017% 1
ESMC2 0.73518 0.83615 0.76039 AUC = Σp∈P Σn∈N I(Θ(p) > Θ(n)), (21)
Improvement 1.022% 0.818% 0.684%
|P ||N |
City 3 where P and N denote positive sample set and negative
CTR-AUC CTCVR-AUC CVR-AUC sample set, respectively. Θ is the estimator function and I
ESMS2 0.73166 0.83621 0.76388
Improvement 0.035% 0.026% 0.004%
is the indicator function.
ESMC2 0.73835 0.84272 0.76496 Baselines The representative state-of-the-art approaches are
Improvement 0.900% 0.759% 0.205% listed as follows. All models are equipped with DIN [46]
and ETA [5], sequential recommenders which propose inter-
est extraction from users’ historical behaviors with attention
VI. E XPERIMENTS mechanism [32], to extract user’s long/short-term interests.
We conduct extensive experiments to evaluate the perfor- • Shared Bottom To tackle multi-task learning, the output
mance of ESMC-family and the following research questions layer is replaced by two fully-connected towers for the
(RQs) are answered: corresponding two tasks. Moreover, the two tasks share
• RQ1 Do ESMC and ESMS outperform state-of-the-art the same bottom to learn common features [31].
multi-task estimators? • ESMM ESMM models conversion rate with conditional
• RQ2 What is the difference between the performance of probability on the user decision-making path of “ex-
ESMC and that of ESMS? How to choose the proper posure click purchase” without considering the in-shop
one? actions.
• RQ3 What is the difference between the performance of • MMoE MMoE employs several expert networks with
ESMC&ESMS and that of ESMG? gate controllers to leverage different downstream tasks
• RQ4 How do critical components affect the performance [25].
of ESMC-family?
5 To the best of our knowledge, there are no public datasets suited for this
• RQ5 Can ESMC-family address PSC issue?
task, and we believe that a huge amount of data can significantly verify the
• RQ6 Does ESMC-family work in real large-scale online performance of recommender. We will release our datasets for future research
recommendation scenarios? after desensitization and checking.
• ESMM2 ESMM2 models conversion rate with condi- recommenders. In our proposed methods, by solving ESMM2 ’s
tional probability on the user decision-making graph bad case: Probability Space Confusion issue, we further im-
of “exposure click in-shop action purchase” that is de- prove the performance of CVR estimation of ESMM2 , and
graded by PSC issue. achieve SOTA in all experiments!
Training Protocol All models in this paper are implemented Considering ESMS v.s. ESMC, ESMC is better than ESMS
with Tensorflow 1.12 in Python 2.7 environment. All models on average. Sometimes, ESMS performs better in CVR pre-
are trained in a chief-worker distributed structure with 1600 diction task, which may be related to the final Conversion Rate
CPUs [14]. AdagradDecay [1] is chosen as our optimizer for of goods in shopping carts for users in different cities. ESMC
model training and activation function is set to LeakyReLU can adjust the coefficient of parameter constraint to manipulate
[10]. The initial learning rate is set to 0.005, batch size is the model performance. Therefore, if you are looking for
set to 1024, and training epoch is set to 1 because of one- better results, ESMC is recommended. If you don’t want to
epoch phenomenon [44]. The top-k for behavior sequence in adjust hyper-parameters, ESMS is recommended. To facilitate
ETA is set to 50. All models are employed the same warm-up version iteration, we chose to deploy ESMS to serve users.
technique to maintain the training stability. In training stage,
C. RQ3: Recommender Domain v.s. Global Domain
the weights of CTR, CVR and CAR loss are equal to 1. The
coefficient of parameter constraint for ESMC is selected from Table V indicates that ESMG absolutely outperforms
{0.01, 0.05, 0.1, 0.5, 1.0} and the coefficient of CAR loss in ESMS&ESMC in our experiments. However, the Improvement
global domain is selected from {0.1, 0.3, 0.5, 0.7, 1.0}. of ESMS2 is quite small. On the contrary, ESMC2 achieve a
considerable improvement compared with ESMC. Especially,
B. RQ1&RQ2: Comparison with Baselines CTR-AUC raises over 1% and CTCVR-AUC raises over 0.7%
Table IV indicates that ESMS and ESMC outperform all on average! The performance of ESMS2 is not significant
SOTA baselines in terms of CTR-AUC, CTCVR-AUC and probably because CTCVR and CTCAR are trainined with the
CVR-AUC on six real-world datasets. All the best results shared parameters. After bringing global domain Cart samples,
come from our methods and all the second best results come the sample distribution becomes more complex, which is
from our methods except comparison on City 4 dataset in difficult to fit. However, ESMC can handle this problem. As
terms of CTR-AUC. Besides, the Improvement proves that a result, ESMC2 has made a huge improvement. Thus, on the
our proposed approaches achieve a further significant im- leaderboard of ESMC-family, ESMC2 is the best while ESMS
provement6 . Especially, on City 4 dataset, ESMC achieves a is the worst.
huge improvement compared to the best baseline ESMM2 , i.e., D. RQ4: Parameter Sensitivity and Ablation Study
CTR-AUC +0.401%, CTCVR-AUC +1.780%, and CVR-AUC
+1.195%. Fig. 5 illustrates the curves of model performance varying
with coefficient of parameter constraint in ESMC. It can be
Considering the baselines, Shared Bottom performs well on
seen that the performance of model shows a fluctuating curve,
the CTR estimation task. Because of the low coupling of the
and there is no obvious increasing/decreasing tendency. Es-
parameters of CTR and CVR estimation in Shared Bottom,
pecially, ESMC generally performs well when the coefficient
the prediction of the two tasks is more independent [6]. At the
of parameter constraint being set to 0.05 in our experiments.
same time, the number of click samples is more than that of
However, this does not mean that 0.05 is the best param-
conversion samples, so it can achieve relatively high accuracy
eter for ESMC to trace the underlying relationship in path
in the CTR task. However, this does not mean that other
of “exposure click Cart purchase”. Because the optimization
baselines are poor, quite the contrary, MMoE and ESMM have
objective of ESMC is a linear combination of multiple sub-
a wide range of applications in industry [13, 35]. On average,
objectives. Optimizing multiple sub-objectives at the same
Shared Bottom’s CVR prediction is the worst, and in industrial
time will affect the training of the model, and it is difficult to
practice, Conversion Rate is generally more important than
find the Pareto Optimal [7]. Therefore, we need to adjust this
Click-Through Rate. In our proposed approach, CTR and
parameter carefully when using ESMC. Fortunately, ESMC
CVR tasks are linked through their intrinsic relationships
may offer better returns.
and in-shop information, while maintaining a high degree of
independence between CTR and CVR tasks, thus achieving Fig. 6 shows the curves of model performance varying with
better results on CTR prediction. weight of all domain loss in ESMC2 . It can be seen that
the trend of curve is to rise first followed by a decreasing.
For CTCVR-AUC and CVR-AUC, ESMM2 gives the best
Especially, ESMC2 generally performs well when the weight
results among baselines. Especially, Post-Click Conversion
of global domain loss being set to 0.5 in our experiments.
Rate has improved significantly, with an average increase of
This result is in line with the distribution of traffic in our
more than 5%! This fully illustrates the importance of intro-
business. Therefore, we recommend to set the weight of the
ducing Cart behavior to CVR estimation, and also shows that
loss function directly according to the traffic distribution of
our work on enhancing ESMM2 is critical and meaningful for
different domains.
6 Note that the 0.1% AUC gain is already considerable in large-scale Fig. 7 presents the ablation study on Sample Calibration.
industrial recommender [8, 46]. We select ESMC as the research object because it is the initial
(a) CTR-AUC on City 6 dataset. (b) CTCVR-AUC on City 6 dataset. (c) CVR-AUC on City 6 dataset.

(d) CTR-AUC on City 4 dataset. (e) CTCVR-AUC on City 4 dataset. (f) CVR-AUC on City 4 dataset.

Fig. 5: Sensitivity of coefficient of parameter constraint in ESMC. The horizontal axis represents the parameter, and the vertical
axis represents AUC.

(a) CTR-AUC on City 1 dataset. (b) CTCVR-AUC on City 1 dataset. (c) CVR-AUC on City 1 dataset.

(d) CTR-AUC on City 2 dataset. (e) CTCVR-AUC on City 2 dataset. (f) CVR-AUC on City 2 dataset.
2
Fig. 6: Sensitivity of weight of global domain loss in ESMC . The horizontal axis represents the weight, and the vertical axis
represents AUC.

model in ESMC-family. ESMC detached Sample Calibration samples (purchase label = 1) to be divided into two groups:
is named ESMC-. The results show that Sample Calibration 1) Bad Case: Cart and purchase are not in the same exposure
can improve the performance significantly. We observe that space and 2) Good Case: Cart and purchase are in the same
when it is removed, CTCVR-AUC would fall by 0.11% on exposure space. Because all samples’ conversion labels are
average, which implies that Sample Calibration is a simple 1, which means that all samples’ click labels are also 1,
and efficient strategy to maintain the consistency of sample considering CTR-AUC doesn’t make sense. Besides, CTCVR-
selection and probability space. AUC is equal to CVR-AUC here. Therefore, we just consider
CVR-AUC in this experiment. Table VI shows that ESMS
E. RQ5: Case Study. absolutely outperforms ESMM2 on both the Good Case and
the Bad Case. There is no doubt that ESMC-family can address
In Section 4, we have discussed the Bad Case and the
PSC issue perfectly.
Good Case of ESMM2 . ESMC-family is tailored for PSC
issue to handle the Bad Case. Here, we do experiments to Especially, on average, ESMS improves CVR-AUC by over
prove that the model does solve PSC issue and also improves 6% on the Bad Case and near 5% on the Good Case!!! The
the performance on the Good Case. We select conversion reason why the model can achieve a huge improvement on
TABLE VI: The comparison results on the Bad Case and the Good Case in terms of CVR-AUC. Improvement is calculated
as the relative increase of ESMS compared to ESMM2 .
Bad Case
City 1 City 2 City 3 City 4 City 5 City 6
ESMM2 0.65848 0.66022 0.66176 0.67414 0.66887 0.67514
ESMS 0.70224 0.70220 0.70426 0.71108 0.71114 0.71213
Improvement 6.645% 6.358% 6.422% 5.479% 6.319% 5.478%
Good Case
City 1 City 2 City 3 City 4 City 5 City 6
ESMM2 0.68472 0.69251 0.68935 0.72147 0.69270 0.70194
ESMS 0.72341 0.72469 0.72448 0.74979 0.72545 0.72902
Improvement 5.650% 4.646% 5.096% 3.925% 4.727% 3.857%

TABLE VII: Online A/B performances for consecutive 7 days on Eleme.

Day 1 2 3 4 5 6 7 Avg.
NU +0.75% +1.05% +0.45% +1.14% +0.40% +1.06% +0.33% +0.74%
NO +1.42% +1.41% +0.67% +1.41% +0.54% +1.38% +0.56% +1.05%
OR +1.57% +1.45% +0.76% +1.44% +0.59% +1.33% +0.75% +1.12%
NG +1.33% +0.88% +0.39% +1.77% +0.62% +1.44% +0.63% +1.00%

homepage of Eleme. The online base model is a variation

of MMoE with long/short-term behavior sequence module
like DIN and ETA. Here, we select four business-related
metrics: Number of Paying Users (NU), Number of Orders
(NO), Order Rate (OR) and Net GMV (NG, a measure of
net profit) to evaluate the performance in online environment.
Results of strictly online A/B tests are shown in Table VII.
We can see that the proposed ESMS, an alternative of ESMC,
consistently outperforms the base model. On average, our
method improves NU by 0.74%, NO by 1.05%, OR by
1.12% and NG by 1.00% compared with the base model,
(a) CTR-AUC on City 3 (b) CTCVR-AUC on (c) CVR-AUC on City 3 which demonstrates the effectiveness of ESMC-families in
dataset. City 3 dataset. dataset.
large-scale online recommendation system and the proposed
approach has been deployed on the homepage of Eleme,
Alibaba’s online takeaway platform serving more than one
billion recommendation requests per day.

VII. C ONCLUSION AND F UTURE W ORK

In this paper, we report Probability Space Confusion issue in
the chain rule of conditional probability about in-shop actions
(e.g. Cart/Wish-list) and present a mathematical explanation
to illustrate the gap between ground-truth and estimation. We
show that the key points are inconsistent sample space and
(d) CTR-AUC on City 4 (e) CTCVR-AUC on (f) CVR-AUC on City 4 decoupling of purchase and Cart. Based on this, we propose a
dataset. City 4 dataset. dataset.
novel Entire Space Multi-Task Model via Parameter Constraint
Fig. 7: Ablation study on Sample Calibration in ESMC. The and two alternatives: Entire Space Multi-Task Model with
blue bar is for ESMC- while the orange bar is for ESMC. Siamese Network and Entire Space Multi-Task Model in
Global Domain (ESMC2 and ESMS2 ) to address PSC problem.
Extensive offline experiments prove the performance of our
the Good Case may be that the parameter constraint strategy proposed approaches and the seven-day online A/B test shows
introduces more information about Cart, which is helpful for that ESMS yields significant financial benefits to our business.
the fitting of the high-dimensional function of â in (17). To support the future research, we discuss the advantages
and disadvantages of three proposed approaches and decide
F. RQ6: Online A/B Test to release real-world datasets and code. The future work may
From June 22, 2023 to June 28, 2023, we conducted a include exploring a better parameter constraint strategy to
seven-day online experiment by deploying ESMS (ESMS in make ESMC’s training more stable and a better method to
recommender domain) to the recommendation scenario on the calibrate sample space.
R EFERENCES J. Qian, H. Liu, and C. Guo, “Tiresias: A {GPU}
cluster manager for distributed deep learning,” in 16th
[1] K. Antonakopoulos, P. Mertikopoulos, G. Piliouras, and USENIX Symposium on Networked Systems Design and
X. Wang, “Adagrad avoids saddle points,” in Interna- Implementation (NSDI 19), 2019, pp. 485–500.
tional Conference on Machine Learning. PMLR, 2022, [15] T. Gu, K. Kuang, H. Zhu, J. Li, Z. Dong, W. Hu, Z. Li,
pp. 731–771. X. He, and Y. Liu, “Estimating true post-click conversion
[2] T. Bansal, D. Belanger, and A. McCallum, “Ask the gru: via group-stratified counterfactual inference,” 2021.
Multi-task learning for deep text recommendations,” in [16] R. Gupte, S. Rege, S. Hawa, Y. Rao, and R. Sawant,
proceedings of the 10th ACM Conference on Recom- “Automated shopping cart using rfid with a collabora-
mender Systems, 2016, pp. 107–114. tive clustering driven recommendation system,” in 2020
[3] L. Chen, J. Cao, H. Chen, W. Liang, H. Tao, and Second International Conference on Inventive Research
G. Zhu, “Attentive multi-task learning for group itinerary in Computing Applications (ICIRCA). IEEE, 2020, pp.
recommendation,” Knowledge and Information Systems, 400–404.
vol. 63, no. 7, pp. 1687–1716, 2021. [17] G. Hadash, O. S. Shalom, and R. Osadchy, “Rank and
[4] L. Chen, Z. Li, T. Xu, H. Wu, Z. Wang, N. J. Yuan, rate: multi-task learning for recommender systems,” in
and E. Chen, “Multi-modal siamese network for entity Proceedings of the 12th ACM Conference on Recom-
alignment,” in Proceedings of the 28th ACM SIGKDD mender Systems, 2018, pp. 451–454.
conference on knowledge discovery and data mining, [18] R. Kumar, S. M. Naik, V. D. Naik, S. Shiralli, V. Sunil,
2022, pp. 118–126. and M. Husain, “Predicting clicks: Ctr estimation of
[5] Q. Chen, C. Pei, S. Lv, C. Li, J. Ge, and W. Ou, “End-to- advertisements using logistic regression classifier,” in
end user behavior retrieval in click-through rateprediction 2015 IEEE international advance computing conference
model,” arXiv preprint arXiv:2108.04468, 2021. (IACC). IEEE, 2015, pp. 1134–1138.
[6] X. Chen, Z. Cheng, S. Xiao, X. Zeng, and W. Huang, [19] K.-c. Lee, B. Orten, A. Dasdan, and W. Li, “Estimating
“Cross-domain augmentation networks for click-through conversion rate in display advertising from past erfor-
rate prediction,” arXiv preprint arXiv:2305.03953, 2023. mance data,” in Proceedings of the 18th ACM SIGKDD
[7] K. Deb and H. Gupta, “Searching for robust pareto- international conference on Knowledge discovery and
optimal solutions in multi-objective optimization,” in data mining, 2012, pp. 768–776.
International conference on evolutionary multi-criterion [20] D. Li, B. Hu, Q. Chen, X. Wang, Q. Qi, L. Wang,
optimization. Springer, 2005, pp. 150–164. and H. Liu, “Attentive capsule network for click-through
[8] B. Du, S. Lin, J. Gao, X. Ji, M. Wang, T. Zhou, rate and conversion rate prediction in online advertising,”
H. He, J. Jia, and N. Hu, “Basm: A bottom-up adaptive Knowledge-Based Systems, vol. 211, p. 106522, 2021.
spatiotemporal model for online food ordering service,” [21] H. Li, Y. Wang, Z. Lyu, and J. Shi, “Multi-task learn-
arXiv preprint arXiv:2211.12033, 2022. ing for recommendation over heterogeneous information
[9] X. Feng, Z. Liu, W. Wu, and W. Zuo, “Social rec- network,” IEEE Transactions on Knowledge and Data
ommendation via deep neural network-based multi-task Engineering, vol. 34, no. 2, pp. 789–802, 2020.
learning,” Expert Systems with Applications, vol. 206, p. [22] Z. Lin, H. Wang, J. Mao, W. X. Zhao, C. Wang, P. Jiang,
117755, 2022. and J.-R. Wen, “Feature-aware diversified re-ranking with
[10] S. Frei, G. Vardi, P. L. Bartlett, N. Srebro, and W. Hu, disentangled representations for relevant recommenda-
“Implicit bias in leaky relu networks trained on high- tion,” in Proceedings of the 28th ACM SIGKDD Confer-
dimensional data,” arXiv preprint arXiv:2210.07082, ence on Knowledge Discovery and Data Mining, 2022,
2022. pp. 3327–3335.
[11] C. Gao, X. He, D. Gan, X. Chen, F. Feng, Y. Li, T.- [23] H. Liu, D. Tang, J. Yang, X. Zhao, H. Liu, J. Tang, and
S. Chua, and D. Jin, “Neural multi-task recommendation Y. Cheng, “Rating distribution calibration for selection
from multi-behavior data,” in 2019 IEEE 35th interna- bias mitigation in recommendations,” in Proceedings of
tional conference on data engineering (ICDE). IEEE, the ACM Web Conference 2022, 2022, pp. 2048–2057.
2019, pp. 1554–1557. [24] Y. Lu, R. Dong, and B. Smyth, “Why i like it: multi-task
[12] J. Gao, S. Han, H. Zhu, S. Yang, Y. Jiang, J. Xu, and learning for recommendation and explanation,” in Pro-
B. Zheng, “Rec4ad: A free lunch to mitigate sample ceedings of the 12th ACM Conference on Recommender
selection bias for ads ctr prediction in taobao,” arXiv Systems, 2018, pp. 4–12.
preprint arXiv:2306.03527, 2023. [25] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H.
[13] X. Gong, Q. Feng, Y. Zhang, J. Qin, W. Ding, B. Li, Chi, “Modeling task relationships in multi-task learning
P. Jiang, and K. Gai, “Real-time short video recom- with multi-gate mixture-of-experts,” in Proceedings of
mendation on mobile devices,” in Proceedings of the the 24th ACM SIGKDD international conference on
31st ACM International Conference on Information & knowledge discovery & data mining, 2018, pp. 1930–
Knowledge Management, 2022, pp. 3103–3112. 1939.
[14] J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, [26] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu,
and K. Gai, “Entire space multi-task model: An effective on Information & Knowledge Management, 2021, pp.
approach for estimating post-click conversion rate,” in 3528–3532.
The 41st International ACM SIGIR Conference on Re- [38] H. Wen, J. Zhang, F. Lv, W. Bao, T. Wang, and Z. Chen,
search & Development in Information Retrieval, 2018, “Hierarchically modeling micro and macro behaviors
pp. 1137–1140. via multi-task learning for conversion rate prediction,”
[27] L. Peska, A. Eckhardt, and P. Vojtas, “Upcomp-a in Proceedings of the 44th International ACM SIGIR
php component for recommendation based on user be- Conference on Research and Development in Information
haviour,” in 2011 IEEE/WIC/ACM International Confer- Retrieval, 2021, pp. 2187–2191.
ences on Web Intelligence and Intelligent Agent Technol- [39] H. Wen, J. Zhang, Y. Wang, F. Lv, W. Bao, Q. Lin,
ogy, vol. 3. IEEE, 2011, pp. 306–309. and K. Yang, “Entire space multi-task modeling via
[28] S. Pradhan, P. R. Krishna, S. S. Rout, and K. Jonna, post-click behavior decomposition for conversion rate
“Wish-list based shopping path discovery and profitable prediction,” in Proceedings of the 43rd International
path recommendations,” in 2012 Third International ACM SIGIR Conference on Research and Development
Conference on Services in Emerging Markets. IEEE, in Information Retrieval, ser. SIGIR ’20. New York,
2012, pp. 101–106. NY, USA: Association for Computing Machinery, 2020,
[29] M. Richardson, E. Dominowska, and R. Ragno, “Predict- p. 2377–2386.
ing clicks: estimating the click-through rate for new ads,” [40] Z. Xu, S. Wen, J. Wang, G. Liu, L. Wang, Z. Yang,
in Proceedings of the 16th international conference on L. Ding, Y. Zhang, D. Zhang, J. Xu, and B. Zheng,
World Wide Web, 2007, pp. 521–530. “Amcad: Adaptive mixed-curvature representation based
[30] R. Takada, K. Hoshimure, T. Iwamoto, and J. Baba, “Pop advertisement retrieval system,” in 2022 IEEE 38th Inter-
cart: Product recommendation system by an agent on a national Conference on Data Engineering (ICDE), 2022,
shopping cart,” in 2021 30th IEEE International Con- pp. 3439–3452.
ference on Robot & Human Interactive Communication [41] W. Zhang, W. Bao, X.-Y. Liu, K. Yang, Q. Lin, H. Wen,
(RO-MAN). IEEE, 2021, pp. 59–66. and R. Ramezani, “Large-scale causal approaches to de-
[31] H. Tang, J. Liu, M. Zhao, and X. Gong, “Progressive biasing post-click conversion rate estimation with multi-
layered extraction (ple): A novel multi-task learning (mtl) task learning,” in Proceedings of The Web Conference
model for personalized recommendations,” in Proceed- 2020, 2020, pp. 2775–2781.
ings of the 14th ACM Conference on Recommender [42] X. Zhang, B. Xu, L. Yang, C. Li, F. Ma, H. Liu,
Systems, 2020, pp. 269–278. and H. Lin, “Price does matter! modeling price and
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, interest preferences in session-based recommendation,”
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, in Proceedings of the 45th International ACM SIGIR
“Attention is all you need,” Advances in neural informa- Conference on Research and Development in Information
tion processing systems, vol. 30, 2017. Retrieval, 2022, pp. 1684–1693.
[33] C. Wang, W. Ma, C. Chen, M. Zhang, Y. Liu, and S. Ma, [43] Y. Zhang, L. Chen, S. Yang, M. Yuan, H. Yi, J. Zhang,
“Sequential recommendation with multiple contrast sig- J. Wang, J. Dong, Y. Xu, Y. Song, Y. Li, D. Zhang,
nals,” ACM Transactions on Information Systems, vol. 41, W. Lin, L. Qu, and B. Zheng, “Picasso: Unleashing
no. 1, pp. 1–27, 2023. the potential of gpu-centric training for wide-and-deep
[34] H. Wang, T.-W. Chang, T. Liu, J. Huang, Z. Chen, recommender systems,” in 2022 IEEE 38th International
C. Yu, R. Li, and W. Chu, “Escm2: Entire space coun- Conference on Data Engineering (ICDE), 2022, pp.
terfactual multi-task model for post-click conversion rate 3453–3466.
estimation,” in Proceedings of the 45th International [44] Z.-Y. Zhang, X.-R. Sheng, Y. Zhang, B. Jiang, S. Han,
ACM SIGIR Conference on Research and Development H. Deng, and B. Zheng, “Towards understanding the
in Information Retrieval, 2022, pp. 363–372. overfitting phenomenon of deep click-through rate mod-
[35] Q. Wang, H. Yin, T. Chen, Z. Huang, H. Wang, Y. Zhao, els,” in Proceedings of the 31st ACM International Con-
and N. Q. Viet Hung, “Next point-of-interest recom- ference on Information & Knowledge Management, 2022,
mendation on resource-constrained mobile devices,” in pp. 2671–2680.
Proceedings of the Web conference 2020, 2020, pp. 906– [45] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou,
916. X. Zhu, and K. Gai, “Deep interest evolution network
[36] S. Wang, L. Cao, Y. Wang, Q. Z. Sheng, M. A. Orgun, for click-through rate prediction,” in Proceedings of the
and D. Lian, “A survey on session-based recommender AAAI conference on artificial intelligence, vol. 33, no. 01,
systems,” ACM Computing Surveys (CSUR), vol. 54, 2019, pp. 5941–5948.
no. 7, pp. 1–38, 2021. [46] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan,
[37] P. Wei, W. Zhang, Z. Xu, S. Liu, K.-c. Lee, and B. Zheng, J. Jin, H. Li, and K. Gai, “Deep interest network for
“Autoheri: Automated hierarchical representation inte- click-through rate prediction,” in Proceedings of the 24th
gration for post-click conversion rate estimation,” in ACM SIGKDD international conference on knowledge
Proceedings of the 30th ACM International Conference discovery & data mining, 2018, pp. 1059–1068.

ESCM2- Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation
No ratings yet
ESCM2- Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation
10 pages
Entire Space Multi-Task Model-An Effective Approach for Estimating Post-Click Conversion Rate
No ratings yet
Entire Space Multi-Task Model-An Effective Approach for Estimating Post-Click Conversion Rate
4 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Cloud Brokering
From Everand
Cloud Brokering
Felipe Díaz-Sánchez
No ratings yet
Entropy 22 00643
No ratings yet
Entropy 22 00643
18 pages
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
From Everand
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Big Data Analytics for Human-Computer Interactions: A New Era of Computation
From Everand
Big Data Analytics for Human-Computer Interactions: A New Era of Computation
Kuldeep Singh Kaswan
No ratings yet
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
Sussman Anomaly: Fundamentals and Applications
From Everand
Sussman Anomaly: Fundamentals and Applications
Fouad Sabry
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computational Intelligence and its Applications
From Everand
Computational Intelligence and its Applications
Vikash Yadav
No ratings yet
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Hierarchical Control System: Fundamentals and Applications
From Everand
Hierarchical Control System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Omni-Path Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Omni-Path Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CX Engineering and Practice: Definitive Reference for Developers and Engineers
From Everand
CX Engineering and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Vision: Insights into the World of Computer Vision
From Everand
Machine Vision: Insights into the World of Computer Vision
Fouad Sabry
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CQRS Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
CQRS Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
System Design Basics
From Everand
System Design Basics
Kai Turing
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Dancing on a Cloud: A Framework for Increasing Business Agility
From Everand
Dancing on a Cloud: A Framework for Increasing Business Agility
David Sterling
No ratings yet
The Decision Maker's Ultimate Guide to MBSE
From Everand
The Decision Maker's Ultimate Guide to MBSE
Ernest Stambouly
No ratings yet
Industrial Cases in Simulation Modeling
From Everand
Industrial Cases in Simulation Modeling
James A. Chisman PhD
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Cloud vs Edge
From Everand
Cloud vs Edge
Isaac Berners-Lee
No ratings yet
View Synthesis: Exploring Perspectives in Computer Vision
From Everand
View Synthesis: Exploring Perspectives in Computer Vision
Fouad Sabry
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Visualised Systems Engineering on Railway Projects
From Everand
Visualised Systems Engineering on Railway Projects
Jong-Pil Nam
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Artificial Intelligence and Knowledge Processing: Methods and Applications
From Everand
Artificial Intelligence and Knowledge Processing: Methods and Applications
Hemachandran K.
No ratings yet
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Target Recognition: Fundamentals and Applications
From Everand
Automatic Target Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Challenges and Opportunities for Deep Learning Applications in Industry 4.0
From Everand
Challenges and Opportunities for Deep Learning Applications in Industry 4.0
Vaishali Mehta
No ratings yet
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
From Everand
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Compiler Frontiers Unveiled
From Everand
Compiler Frontiers Unveiled
Azhar ul Haque Sario
No ratings yet
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
From Everand
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
Fouad Sabry
No ratings yet
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Boost.Thread in Practice: Definitive Reference for Developers and Engineers
From Everand
Boost.Thread in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
[2019][Huawei][PAL] a Position-bias Aware Learning Framework for CTR Prediction in Live Recommender Systems
No ratings yet
[2019][Huawei][PAL] a Position-bias Aware Learning Framework for CTR Prediction in Live Recommender Systems
6 pages
Prediction of Consumer Purchase Intension
100% (1)
Prediction of Consumer Purchase Intension
4 pages
PPM-A Pre-Trained Plug-In Model For Click-Through Rate Prediction-2024
No ratings yet
PPM-A Pre-Trained Plug-In Model For Click-Through Rate Prediction-2024
8 pages
Sales Prediction and Product Recommendation Model Through
No ratings yet
Sales Prediction and Product Recommendation Model Through
20 pages
[21.12] Adversarial Gradient Driven Exploration for Deep Click-Through Rate Prediction
No ratings yet
[21.12] Adversarial Gradient Driven Exploration for Deep Click-Through Rate Prediction
10 pages
BigMart Sale Prediction Using Machine Learning
No ratings yet
BigMart Sale Prediction Using Machine Learning
2 pages
1 s2.0 S0957417420301676 Main
No ratings yet
1 s2.0 S0957417420301676 Main
16 pages
Main_merged
No ratings yet
Main_merged
76 pages
2CIKM2020DeepMultifacetedTransformersforMulti-objectiveRankinginLarge-ScaleE-commerceRecommenderSystems
No ratings yet
2CIKM2020DeepMultifacetedTransformersforMulti-objectiveRankinginLarge-ScaleE-commerceRecommenderSystems
8 pages
Classification of Models
No ratings yet
Classification of Models
3 pages
Rogoberto Facebook Invest
No ratings yet
Rogoberto Facebook Invest
4 pages
Implementing The Singleton Design Pattern
No ratings yet
Implementing The Singleton Design Pattern
12 pages
DSP System Toolbox™ Getting Started Guide
No ratings yet
DSP System Toolbox™ Getting Started Guide
91 pages
Sub Netting 2
No ratings yet
Sub Netting 2
10 pages
Complete Download You Don t Know JS ES6 Beyond Kyle Simpson PDF All Chapters
100% (8)
Complete Download You Don t Know JS ES6 Beyond Kyle Simpson PDF All Chapters
85 pages
Book SAE R-349 - FM
0% (1)
Book SAE R-349 - FM
11 pages
The Fifteen Puzzle
No ratings yet
The Fifteen Puzzle
6 pages
GSRTC Dhandhuka Bhuj
No ratings yet
GSRTC Dhandhuka Bhuj
1 page
Best 2 RTL
No ratings yet
Best 2 RTL
183 pages
LPBEI Project Synopsis Format
No ratings yet
LPBEI Project Synopsis Format
6 pages
MQ v9.0 Requirements
No ratings yet
MQ v9.0 Requirements
94 pages
The Rockbox Manual For Sansa Clip Zip: February 26, 2017
No ratings yet
The Rockbox Manual For Sansa Clip Zip: February 26, 2017
236 pages
Power ON/OFF X-Ray Exposure: FCR Prima T2 (Cr-Ir 392) Quick Guide
100% (1)
Power ON/OFF X-Ray Exposure: FCR Prima T2 (Cr-Ir 392) Quick Guide
4 pages
Creating An App Spec
No ratings yet
Creating An App Spec
17 pages
Bayes_2
No ratings yet
Bayes_2
3 pages
COSMAC VIP Instruction Manual 1978
No ratings yet
COSMAC VIP Instruction Manual 1978
129 pages
Madhusudhanan P K180111
No ratings yet
Madhusudhanan P K180111
4 pages
User Manual
No ratings yet
User Manual
21 pages
CD Model Set-3 Answer Key
No ratings yet
CD Model Set-3 Answer Key
29 pages
Exam 1: Test Taking Instructions. No Calculators, Laptops or Other Assisting Devices Are Allowed. Write
No ratings yet
Exam 1: Test Taking Instructions. No Calculators, Laptops or Other Assisting Devices Are Allowed. Write
8 pages
Barcode Readers: Fixed-Mount Handheld Mobile
No ratings yet
Barcode Readers: Fixed-Mount Handheld Mobile
20 pages
Answer Sheet
No ratings yet
Answer Sheet
1 page
Adaptive Filtering and Change Dectection
No ratings yet
Adaptive Filtering and Change Dectection
496 pages
Creating A Simple Web Application Using A MySQL Database - NetBeans IDE Tutorial
No ratings yet
Creating A Simple Web Application Using A MySQL Database - NetBeans IDE Tutorial
19 pages
7.1.3.8 Dulay
No ratings yet
7.1.3.8 Dulay
2 pages
SAP File Explorer
100% (4)
SAP File Explorer
151 pages
FSE Chap 1
No ratings yet
FSE Chap 1
25 pages
Quiz 3 Big Data
No ratings yet
Quiz 3 Big Data
2 pages
File-Fujifilm MV-1 Digital Camera-4663.jpg PDF
No ratings yet
File-Fujifilm MV-1 Digital Camera-4663.jpg PDF
3 pages

Uploaded by

Uploaded by

ESMC: Entire Space Multi-Task Model for

Post-Click Conversion Rate via Parameter

[email protected], {biaozeng.zb, zhisu.fh, nanjia.lj}@alibaba-inc.com, [email protected]

TABLE V: Results of comparison of ESMS&ESMC and A. Experimental Settings

TABLE VII: Online A/B performances for consecutive 7 days on Eleme.

homepage of Eleme. The online base model is a variation

VII. C ONCLUSION AND F UTURE W ORK

You might also like