Entire Space Multi-Task Model
Entire Space Multi-Task Model
Abstract—Large-scale online recommender system spreads all role to handle the task timely and accurately with the help of
over the Internet being in charge of two basic tasks: Click- deep learning algorithms [33, 44]. Recommendation service
Through Rate (CTR) and Post-Click Conversion Rate (CVR) first recalls candidates from item pool and then feeds them
estimations. However, traditional CVR estimators suffer from
well-known Sample Selection Bias and Data Sparsity issues. into a recommender algorithm to predict several metrics such
Entire space models were proposed to address the two issues via as Click-Through Rate (CTR) and Post-Click Conversion Rate
tracing the decision-making path of “exposure click purchase”. (CVR) [18, 19, 29]. Next, items are ranked according to
Further, some researchers observed that there are purchase- CTR, CVR or other metrics and exposed on the terminal
related behaviors between click and purchase, which can better device of user. A user may click an item to enter the in-
draw the user’s decision-making intention and improve the
recommendation performance. Thus, the decision-making path shop page to add it into cart/wish-list and purchase, which
has been extended to “exposure click in-shop action purchase” can be described as a decision-making graph of “expo-
and can be modeled with conditional probability approach. sure click in-shop action purchase” [16, 30]. This feedback
Nevertheless, we observe that the chain rule of conditional will be recorded and used for updating recommender algorithm
probability does not always hold. We report Probability Space to ensure that system can capture the evolution of interest
Confusion (PSC) issue and give a derivation of difference between
ground-truth and estimation mathematically. We propose a novel and recent preference of the user. To provide users with
Entire Space Multi-Task Model for Post-Click Conversion Rate more accurate recommendation service, a high-quality CVR
via Parameter Constraint (ESMC) and two alternatives: Entire estimator is crucial in practice [45].
Space Multi-Task Model with Siamese Network (ESMS) and There are two basic issues in the CVR estimation task:
Entire Space Multi-Task Model in Global Domain (ESMG) to Sample Selection Bias (SSB) and Data Sparsity (DS) [26]
address the PSC issue. Specifically, we handle “exposure click in-
shop action” and “in-shop action purchase” separately in the shown in Fig. 1a. SSB refers to a gap between training sample
light of characteristics of in-shop action. The first path is still space and online inference sample space. Traditional CVR
treated with conditional probability while the second one is estimators are trained on clicked samples while implemented
treated with parameter constraint strategy. Experiments on both on exposed samples based on the schema of online recommen-
offline and online environments in a large-scale recommendation dation service. DS refers to the issue that the size of clicked
system illustrate the superiority of our proposed methods over
state-of-the-art models. The code and real-world datasets will be samples is too small to train a model that can fit conversion
released for further research. well. Consequently, the performance of recommender algo-
Index Terms—Recommender System, Entire Space Multi- rithm is dissatisfactory in online service [34]. SSB and DS
Task Learning, Conversion Rate Prediction, Probability Space are fundamental issues that we must overcome in industrial
Confusion recommender systems. Many researchers have proposed entire
space models to address SSB and DS [12, 23, 38]. Entire
I. I NTRODUCTION
Space Multi-Task Model (ESMM) is one of the representatives
Selecting best-suited products from floods of candidates to of entire space models that will be presented in Section
deliver them to users based on their appropriate preferences 3. Following ESMM, Entire Space Multi-Task Model via
has become a significant task on most of online platforms Behavior Decomposition (ESMM2 ) is proposed to introduce
such as online food booking, short video, e-commerce, etc in-shop behaviors to estimate CVR with the almost same
[22, 40, 43, 46]. Recommender system plays an important ideology of ESMM [39].
Several studies have claimed that user may purchase items
This paper was done when Zhenhao Jiang as an intern in Alibaba Group.
∗ Both authors contributed equally to this paper. from shopping cart or wish list and we observe this phe-
B Corresponding authors. nomenon in real business as well [27, 28, 42]. The action
(a) Illustration of SSB and DS issues in CVR estimation that model is (b) Demonstration of Probability Space Confusion issue. Cart and
trained over clicked samples while is used to infer on exposed samples. Purchase may not be in the same visit.
The size of samples diminishes from exposure to purchase.
(c) Demonstration of our approach for addressing the PSC issue. The orange line means sample space calibration and the blue line means
information injection.
Fig. 1: Illustration of sample space. (a) SSB and DS issues. (b) Key problem in this paper: PSC issue. (c) Key idea in this
paper to handle PSC issue.
of adding to cart/wish-list (in-shop action1 ) bridges click and also present the derivation of the gap between estimation and
purchase that is more conversion-related than click. Therefore, ground-truth mathematically under PSC issue and propose En-
extracting the functionality of in-shop action in decision- tire Space Multi-Task Model via Parameter Constraint (ESMC)
making path is meaningful. In ESMM2 , algorithm explicitly that mainly consists of three modules: 1) shared embedding
models the sequential behavior of “exposure click in-shop ac- (SE), 2) constrained twin towers (CTT), and 3) sequential
tion purchase” via conditional probability to leverage samples composition module (SCM) and one strategy: Sample Cal-
over the entire space to address SSB and DS issues more ibration to address PSC problem. Before training, Sample
efficiently. However, the probability-based approach doesn’t Calibration unifies the sample space. In the model, SE maps
always work. In-shop action may come from other sample feature vectors into low-dimensionally dense vectors at first.
space not contained in the current exposure space different Then, CCT fits Click-Through Conversion Rate (CTCVR) and
from click or purchase. Click-Through Cart Adding Rate (CTCAR) under a given
In this paper, we report Probability Space Confusion constraint. Finally, SCM combines CTR, CTCAR and CTCVR
(PSC) problem of ESMM2 -like models2 shown in Fig. 1b together to perform a multi-task estimation. Going further, we
which will be presented in detail in Section 4. Because present two alternatives (i.e. ESMS and ESMG) and discuss
ESMM2 -like model is widely used in industrial recom- their advantages and disadvantages to help practitioners choose
menders, it is critical and meaningful to improve it. We the most suitable solution for their own business.
The main contributions of this work are as follows:
1 In this paper, we focus on adding to cart.
2 In
• This is the first work that reports the PSC issue in
this paper, ESMM2 -like model is defined as the model considering both
decision-making graph of “exposure click in-shop action purchase” and the CVR estimation with in-shop behaviors. We demon-
probabilistic dependence among different behaviors. strate the problem from the perspective of sample space
TABLE I: Summarization of important abbreviations. TABLE II: Summarization of notations.
Abbreviation Description Notation Description
Cart the behavior of adding to cart u user
CTR Click-Through Rate V items exposed to user
CVR Post-Click Conversion Rate v item
CAR Cart Adding Rate C/cu,v clicked items by user/entry of C
CTCAR Click-Through Cart Adding Rate O/ou,v purchased items by user/entry of O
CTCVR Click-Through Conversion Rate A/au,v items added in cart by user/entry of A
PSC Probability Space Confusion X exposure space
SSB Sample Selection Bias C click space
DS Data Sparsity O conversion space
A Cart space
X/x exposure event/value of X
C/c click event/value of C
and emphasize the importance of distinguishing between A/a click & Cart event/value of A
click/purchase and in-shop actions. We also highlight the R/r conversion event/value of R
mathematical theory behind the PSC issue. letters with hat (e.g. r̂) the corresponding estimators given by algorithm
2
• We propose ESMC, the first work that enhances ESMM
with a novel parameter constraint approach. ESMC avoids
the PSC issue and improves the performance of ESMM2 . attention mechanism [3]. PLE is used to solve the problem of
Extensive experimental results verify our claims. negative transfer in multi-task learning. It can be considered
• We also propose two alternatives of ESMC (i.e. ESMS
as a stacked structure of basic modules of MMoE, introducing
and ESMG) and discuss their characteristics to help specific expert networks and common expert networks to
others identify the most suitable strategy to address the decouple different tasks [31].
PSC issue in their own business.
• To support future research, we construct real-world B. Conversion Rate Prediction
datasets collected from a large-scale online food platform, ESMM proposes a decision-making path of “expo-
which we will release publicly. sure click purchase” and draws CVR based on the chain rule
The important abbreviations in this paper are summarized of conditional probability [26]. ESMM2 extends the decision-
in Table I. making path to “exposure click in-shop action purchase”
II. R ELATED W ORKS with the similar idea of ESMM [39]. In [41], researchers find
that CVR estimation in basic ESMM is biased and address this
A. Multi-Task Learning problem with causal approach (ESCM). ESCM2 gives out a
Since it is necessary to estimate multiple tasks (i.e. CTR and more solid proof for the bias issue in ESMM and employs
CVR) simultaneously in recommendation system, it is critical a similar solution in ESCM [34]. Here, we first report a
to design a multi-task learning model. Deep recurrent neural novel PSC issue in ESMM2 and provide three solutions to
network is employed to encode the text sequence into a latent address it. There are also many studies that predict CVR from
vector, specifically gated recurrent units trained end-to-end on other perspectives. In [19], researchers model CVR at different
the collaborative filtering task [2]. MMoE consists of multiple hierarchical levels with separate binomial distributions and
expert networks and gate networks to learn the correlations and estimate the distribution parameters individually. ACN uses
differences among different tasks to fit multiple downstream Transformer to implement feature cross-over and employs
tasks [25]. The two basic tasks in recommendation (i.e. rank capsule networks with a modified dynamic routing algorithm
and rate) are traced simultaneously with a multi-task frame- integrating with an attention mechanism to capture multiple
work in [17]. NMTR considers the underlying relationship interests from user behavior sequence [20]. GCI counterfactu-
among different types of behaviors and performs a joint ally predicting the probability of each specific group of each
optimization with a multi-task learning strategy, where the unit belongs to for post-click conversion estimation [15]. Auto-
optimization on a behavior is treated as a task [11]. In [24], a HERI leverages the interplay across multi-tasks’ representation
multi-task recommendation model with matrix factorization is learning. It is designed to learn optimal connections between
proposed which jointly learns to give rating estimation and layer-wise representations of different tasks and can be easily
recommendation explanation. MTRec is designed based on extended to new scenarios with one-shot search algorithm [37].
heterogeneous information network equipped with a Bayesian Unlike the studies mentioned above, we focus on the
task weight learner that is able to balance two tasks during distinctiveness of Cart from a probability perspective. Our
optimization automatically and provide a good interpretability emphasis is on explaining the mathematical theory behind it
[21]. SoNeuMF is an extension of neural matrix factorization and proposing simple yet effective solutions.
and is able to simultaneously model the social domain and
item domain interactions via sharing user representation in III. P RELIMINARY
two tasks [9]. AMT-IRE is a multi-task framework which can
adaptively extract the inner relations between group members Since this paper aims to improve ESMM2 , we first introduce
and obtain consensus group preferences with the help of ESMM and ESMM2 in this Section.
(a) (b) (c)
Fig. 2: Illustration of three types of user decision graph from exposure to purchase. (a) Three real decision-making graphs
online. (b) Decision-making graph in ESMM. (c) Decision-making graph in ESMM2 .
A. Problem Formulation Thus, ESMM addresses SSB problem via training on the
Here, we state the Post-Click Conversion Rate estimation exposure space. Additionally, since the size of clicked samples
problem on the entire space with Cart. Let u denote a user is much larger than that of conversion samples, modeling
browsing item feeds and item set V = {v1 , v2 , . . . , vm } CVR on the exposure space allows for better utilization of
represent the items on exposure space X for u. Define C the available data to tackle DS problem.
as the click set that indicates which item in V is clicked C. Entire Space Multi-Task Model via Behavior Decomposi-
by u where each entry cu,v ∈ {0, 1}, O as the conversion tion
(purchase) set that indicates which item in V is conversed
This is an extension work of ESMM, known as ESMM2
finally where each entry ou,v ∈ {0, 1}. C and O indicates the
[39]. The basic ideology of them are significantly similar.
click space and conversion space, respectively. Especially, let
Compared with ESMM, the main improvement is that ESMM2
A denotes the collection of items added in cart where each
involves intermediate behaviors between click and purchase,
entry au,v ∈ {0, 1} and A is the Cart space. The notations
such as Cart and “adding to wish-list” and is more in line
used in this paper are summarized in Table II.
with the real decision-making process in online service for
In practice, online recommender server has to estimate CTR
user. Actually, different actions cannot be triggered at the
and CVR on the exposure space X. Consequently, we have
same time. To simplify the problem, ESMM2 considers that
to train a model in this manner to keep the online-offline
all in-shop actions can be triggered in parallel. For the sake
consistency (avoiding sample selection bias). Further, if O is
of description, we focus on Cart.
fully observed, the ideal loss function is formulated as:
Similar to ESMM, ESMM2 employs chain rule to model
L := Eu,v [δ(ou,v , ôu,v )], (1) CVR with behaviors:
where E means expectation of events, ôu,v is estimated result, P(CT CV R) = P(CT R) × P(CAR) × P(CV R), (5)
and δ is an error function such as the cross entropy loss: where CAR is Cart Adding Rate on C and CVR is Conversion
δ(ou,v , ôu,v ) := −ou,v log ôu,v − (1 − ou,v ) log(1 − ôu,v ). (2) Rate on A. Further, the probability of CTCVR of item v can
be defined as the conditional probability mathematically in
B. Entire Space Multi-Task Model accordance with “exposure click Cart purchase” process:
On online shopping platform, an item might experience P(ou,v = 1|cu,v = 1) =
“exposure click purchase” to converse. In the light of this (6)
P(ou,v = 1|cu,v = 1, au,v = 1)P(au,v = 1|cu,v = 1).
process, ESMM proposes a CVR estimation approach via
chain rule [26]: Undoubtedly, an item cannot be added to the cart without
being clicked, and it cannot be purchased without being added
P(CT CV R) = P(CT R) × P(CV R), (3) to the cart. Therefore, CTCVR can be modeled in a similar
way to other in-shop behaviors. The decision-making path is
CTCVR estimation is given out with the product of CTR
illustrated in Figure 2.
and CVR predicted by two full-connected towers. During the
However, does the chain rule always hold?
training process, ESMM minimizes the empirical risk of CTR
and CTCVR estimation over X: IV. D ISCUSSION ON ESMM 2
LCT R = Eu,v [δ(cu,v , ĉu,v )] In this section, we first explain the PSC issue. We then
(4)
LCT CV R = Eu,v [δ(cu,v × ou,v , ĉu,v × ôu,v )]. provide a mathematical derivation for quantifying the gap
between the ground-truth and estimated values. Finally, we In the anticipation of ESMM2 , in-shop action always sat-
discuss the implications of the gap. isfies the chain rule of probability. Thus the expectation of
estimator R̂ given by ESMM2 in the Bad Case is:
A. Probability Space Confusion Issue " # " #
⃝
3 R̂ Â
ESMM2 introduces in-shop actions to draw the fine-grained EX2 [R̂] = EX2 · EX2
decision-making process. It considers CVR on the exposure  Ĉ
Z Z
space and C, A, and O are sub-space defined on X. When r̂ â
= P(r̂, â)d(r̂, â) P(â, ĉ)d(â, ĉ)
a user opens the online recommendation feeds, several items X2 â X2 ĉ (8)
are exposed where the user can see. Then, the user may click 1 1
one of the items to enter the detail page (in-shop page), add = EX2 · EX2 [R̂] · EX2 · EX2 [Â]
Ĉ Â
products to cart, and make the final payment. ESMM2 assumes
1
1
the actions in the path of “exposure click Cart purchase” = EX2 [R̂] · EX2 · EX2 [Â] · EX2 .
 Ĉ
occur within the same visit or in the same sample space.
However, this assumption does not always hold true on real There are some explanations of the above derivation:
online platforms, as shown in Figure 3. The user may exit ⃝1 means that the user clicked on the item to enter the detail
the detail page without making an immediate purchase after page and added products to cart in the previous visit. In current
adding products to cart. Most online shopping platform records visit the user purchase products in cart. Because the shopping
users’ Cart information so that users can quickly find the cart generally exists independently on the online platform (i.e.
items they prefer. Therefore, the user may log on the online not in the recommendation feeds), the click behavior of the
platform again after a period of time and enter the shopping current visit is not considered and the behavior in the current
cart to buy. As a result, the paths of “exposure click Cart” visit is expressed in terms of conditional probability here.
and “Cart purchase” are in different visits or in different ⃝2 holds under the assumption A ⊥ C.
sample spaces. This raises a problem. Based on the assumption
⃝3 holds because ESMM2 considers that the entire path is
of ESMM2 , the entire space or exposure space is actually
in the same visit and satisfy the chain rule. It doesn’t take into
defined on the sample space of user’s current visit. Because
account the specificity of Cart.
the user’s information will be updated according to their
Consider the gap between group-truth and estimation in the
behavior in the next visit, the recommender could predict
Bad Case:
the recommendation lists only based on the current status of
users and items. Therefore, the exposure space for each visit is Gap = EX2 [R] − EX2 [R̂]
actually independent for a user. The calculation of probabilities
1
defined on different sample spaces leads to the PSC issue. = EX2 [R|A] · EX1 [A] · EX1
C (9)
Remark: To simplify the problem, we use Session 3 [36] to
determine whether actions occur within the same visit. 1 1
− EX2 [R̂] · EX2 · EX2 [Â] · EX2 .
 Ĉ
B. Mathematical Derivation on PSC Issue h i
Here, we define EX2 [R|A] and EX2 [R̂] · EX2 Â1 as Left
Here we provide a mathematical derivation to evaluate the h i
Terms, and EX1 [A] · EX1 C1 and EX2 [Â] · EX2 Ĉ1 as Right
gap between the ground-truth and estimation of ESMM2 .
First, we give out the right expectation of R under the case Terms.
that the item has been already added to cart before. Based If this gap can be eliminated, then there must be an
on the discussion of the PSC issue, the entire path are not upper bound, so we derive a loose upper bound to prove the
in the same sample space in this case, defined as Bad Case. solvability of this problem.
Additionally, X1 is for the former exposure space while X2 Z Z
is for the current one. a
Gap = rP(r|a)dr P(a, c)d(a, c)
X1 c
⃝1
A Z X2 Z
EX2 [R] = EX2 [R|A] · EX1 r a
C − P(r, a)d(r, a) P(a, c)d(a, c) (10)
X2 a X2 c
⃝
Z
2 a
= EX2 [R|A] · P(a, c)d(a, c) ⃝
4 Z Z
a
c ≤ rdr d(a, c).
Z X1 Z (7) X2 X1 c
1
= EX2 [R|A] · aP(a)da · P(c)dc
X1 c ⃝4 holds because the range of arbitrary probability is [0, 1].
X1
1 It is evident that the integral domain is a finite interval
= EX2 [R|A] · EX1 [A] · EX1 .
C and the integrand is bounded on the integral domain. There
is always a finite upper bound to the gap, and therefore the
3 For a web address, one session is equivalent to one visit. problem is solvable.
Fig. 3: Illustration of Eleme APP, an Alibaba Group’s takeaway platform that serves hundreds of millions of users. The black
arrow represents the user’s behavior. Page 1 (shown at the top left corner) is the recommendation page, page 2 is the detail
page (in-shop page), and page 3 is the shopping cart page. A user may click on an item in the recommendation page to enter
the in-shop page and add a product to shopping cart. And then, the user exits the platform. After a while, the user logs onto
the platform again and goes directly to the shopping cart page to purchase the product. Therefore, the decision-making path
of this user does not occur within the same visit.
C. Discussion on The Difference sample spaces which leads to the PSC issue. This implies that
In (9), compared with the ground-truth, we can find that the in-shop actions does not necessarily satisfy the chain rule,
there are two differences in terms of formula form. so the strategy of conditional probability cannot be directly
employed to manipulate events defined on different sample
• Left Terms. The Cart information and the purchase infor-
spaces.
mation are decoupled in the current space for estimation.
However, ESMM2 does not take into account that Cart V. P ROPOSED M ETHOD
does not necessarily occur in the current exposure space,
In this Section, we propose three approaches to address
which results in a lack of Cart information related to the
the PSC issue and improve the performance of Post-Click
conversion in model estimation in the case we discussed
Conversion Rate estimation.4
above. Thus we have to inject Cart information into the
purchase space, as shown by the blue line in Fig 1c. A. Entire Space Multi-Task Model via Parameter Constraint
• Right Terms. The probability space in the estimation
Shared Embedding Layer First, we build a shared em-
one is incorrect. ESMM2 takes Cart into consideration
bedding layer to transfer all the sparse ID features and
over X2 , although it took place in the previous exposure
discretized numerical features into dense vectors. The features
space X1 , as discussed above. Thus we have to calibrate
mainly consist of user features (e.g. gender, age, consumption
the sample space, as shown by the orange line in Fig 1c.
frequency), item features (e.g. brand, category, geographic
The Bad Case for ESMM2 is discussed. What happens if location) and user-item cross features (e.g. the number of
the Bad Case doesn’t happen for the ground-truth (i.e., Good orders in a shop, age-brand). The entire model uses the same
Case). embedding, which can be expressed as follows.
⃝
5 R A
EX2 [R] = EX2 · EX2 f̄i = Wfi , (12)
A C
(11) where fi is the i-th one-hot feature and W denotes the
1 1
= EX2 [R] · EX2 · EX2 [A] · EX2 . embedding matrix.
A C
Constrained Twin Towers This structure focuses on the
⃝5 holds because the user’s decision-making path of “ex- decision-making path of “Cart purchase”. Since the chain rule
posure click Cart purchase” is in the same visit in the Good of conditional probability cannot describe this path well, we
Case which satisfies the assumption of ESMM2 . employ a pair of towers to learn the mapping automatically.
Thus, there is no gap in the Good Case. Specifically, there are one CTCVR tower and one CTCAR
In summary, there is a significant gap between Cart and
other actions (e.g. click, purchase). Cart may be related to two 4 The code will be released after publication.
Fig. 4: Illustration of ESMC. Loss1 is the CTR loss, loss2 is the CTCAR loss, loss3 is the CTCVR loss, and loss4 is the loss
of parameter constraint.
(Click-Through Cart Adding Rate) tower. To address the gap estimation. Once we obtain the output of the towers, SCM
in the Left Terms discussed in Section 4, we use the parameter composes the probability based on the following equation.
space of CTCAR tower to control that of CTCVR tower. There
are three reasons for this. P(CT CAR) = P(CT R) × P(CAR),
(14)
P(CT CV R) = P(CT R) × P(CV R).
• In this way, the information about Cart can be injected
into conversion which can couple two information to- SCM is a nonparameter structure that expresses conditional
gether to fill the gap in the Left Terms, and the proper probability like ESMM.
function can be automatically fitted out by the neural Sample Calibration To calibrate the probability space, we
network. manipulate sample directly. Since some Cart actions have
• The purchase space is covered by the Cart space, which occurred during the user’s previous visit, there may be no Cart
naturally has a subordinate relationship. action prior to the current purchase. We calibrate these samples
• According to our observation, Cart is strongly related to correlate them with current purchase behavior according to
to purchase, that is, most of the items in the cart will Session. In this way, the probability space of training samples
eventually be bought. is explicitly unified that can be represented as the one-to-one
Here, we employ KL-divergence to evaluate the distance mapping function Q:
between two parameter spaces.
Q : DX1 → DX2 , (15)
DKL (P(X)∥Q(X)) = EX∼P(X) log
P(X)
, (13) where DX1 is Cart sample set in the former visit and DX2 is
Q(X) Cart sample set in the current visit.
After calibration, the ground-truth in the Bad Case becomes:
where P(X) and Q(X) express two probability distributions.
Sequential Composition Module Besides CTCVR tower 1
EX2 [R] = EX2 [R|A] · EX2 [A] · EX2 . (16)
and CTCAR tower, there is also a CTR tower for CTR C
TABLE III: The basic statistics of datasets. & denotes the number. M refers to million and K refers to thousand.
Dataset &Users &Items &Clicks &Purchases Total Size Sparsity of Click Sparsity of Purchase
City 1 6M 110K 61M 10M 1,004M 0.06096 0.01093
City 2 3M 56K 26M 4M 427M 0.06096 0.01108
City 3 4M 85K 30M 5M 507M 0.06076 0.01067
City 4 3M 68K 17M 2M 281M 0.06069 0.00908
City 5 1M 30K 13M 2M 216M 0.06083 0.01044
City 6 1M 36K 11M 1M 184M 0.06103 0.01041
Discussion on ESMC Finally, the expectation of estimator later by the recommendation system. Therefore, in ESMG,
R̂ for ESMC is: we consider the global domain Cart sample. The structure
1 of model keeps unchanged in ESMG. ESMC2 and ESMS2
EX2 [R̂] = fX2 (R̂, Â) · EX2 [Â] · EX2 , (17) are trained with Cart sample in global domain that are taken
Ĉ
into account in later comparative experiments. The training
where fX2 is an unknown function of R̂ and  defined on X2 . objective is formulated as follows.
It can be considered as a neural network mapping. Conditional L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR|rec
expectation function EX2 [R|A] can be automatically fitted
+ω4 × LCT CAR|global + ω5 × DKL (θCT CAR ∥θCT CV R ),
with neural networks.
(20)
The formula is consistent with the ground-truth one and
ESMC can perform parameter estimation better in this manner. where LCT CAR|global and LCT CAR|rec means loss functions
For the Good Case in Section 4, there is no gap either, which on Cart over global domain and recommendation domain.
means that ESMC can work well. Hence, the final training Especially, ω5 = 0 for ESMS2 .
objective to be minimized to obtain parameter set Θ is as D. Difference among Three Approaches
follows.
ESMC v.s. ESMS The difference lies in the way they
L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR handle the path of “Cart purchase”. In terms of model struc-
+ω4 × DKL (θCT CAR ∥θCT CV R ), ture, ESMS is a special case of ESMC. In terms of model
(18) performance, ESMS is suitable for scenarios where the Cart
space and the purchase space are strongly correlated while
where ω1 , ω2 , ω3 , ω4 are weights of corresponding items,
ESMC is more suitable for scenarios where the correlation
θCT CAR and θCT CV R are the parameters of CTCAR and
between the Cart space and the purchase space is not as strong.
CTCVR towers, respectively. Besides, three loss functions L
In the training stage, it takes a lot of time to adjust the
are cross entropy loss shown in (2) with the proper samples of
constraint coefficient. Besides, due to the dependence between
click, Cart and purchase. Fig. 4 shows the structure of ESMC.
the twin towers, it is difficult to train the two towers in parallel,
B. Entire Space Multi-Task Model with Siamese Network which increases the overhead. In the inference phase, as the
twin networks of ESMS share parameters, only parameters
In our practice, we have observed that the conversion
in one tower needs to be stored, which significantly reduces
rate under the Cart space is very high (more than 80%). If
the number of parameters and memory occupied by the model.
we adjust the parameter constraints of the twin towers to
This makes deployment of the model to online platforms more
infinity, it is approximately equivalent to a Siamese Network
efficient.
[4] with shared parameters. However, we also find that the
ESMC&ESMS v.s. ESMG (ESMC2 &ESMS2 ) The dif-
change of model performance is not stable along with the
ference is whether the Cart sample comes from the global
increasing of constraint coefficient, which may be due to
domain or the recommendation domain. Only using sample
the constraint conditions affecting the search space of the
from the recommendation domain absolutely conforms to the
main task. Therefore, we detach the parameter constraint and
basic assumptions of proposed methodology in Section 4.
directly use the absolute Siamese Network to model CTCVR
However, the use of global Cart samples may relax the basic
that is ESMS. Therefore, the constraint in (18) can be removed
assumptions that affect the performance of the model. But
and the model only focuses on the estimation task. The training
at the same time, considering Cart samples from the global
objective is expressed as follows.
domain can supplement the information that helps improve the
L(Θ) = ω1 × LCT R + ω2 × LCT CV R + ω3 × LCT CAR . (19) generalization of algorithm. The degree of information gain
and constraint relaxation is related to the correlation between
C. Entire Space Multi-Task Model in Global Domain the search domain and the recommendation domain.
We only employ the Cart samples in the recommendation In conclusion, we propose three approaches (four models)
domain in ESMC and ESMS. In fact, items in the shopping to address the PSC issue. Considering the online performance
cart do not only come from the recommendation domain, but and the cost of model deployment, we finally choose ESMS to
also from the search domain. After a user searches for an deploy on the online environment. Later, our proposed models
item and adds it to the cart, it may still be exposed to the user are collectively referred to as ESMC-family.
TABLE IV: Comparison with SOTA multi-task learning baselines. The best results are shown in Bold and the second best
results are shown in Italic. Improvement is calculated as the relative increase of our best result compared to the best result in
the baselines.
City 1 City 2 City 3
CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC
Shared Bottom 0.73025 0.82245 0.71242 0.72674 0.81406 0.70152 0.73103 0.82364 0.70824
ESMM 0.72894 0.82395 0.71949 0.72546 0.81649 0.71002 0.73010 0.82532 0.71606
MMOE 0.72963 0.82254 0.71327 0.72592 0.81450 0.70239 0.73065 0.82366 0.70894
ESMM2 0.72995 0.83306 0.76824 0.72681 0.82692 0.75955 0.73100 0.83343 0.76117
ESMS 0.73093 0.83563 0.77048 0.72697 0.82849 0.76069 0.73140 0.83599 0.76385
ESMC 0.73111 0.83594 0.76862 0.72774 0.82936 0.75987 0.73176 0.83637 0.76339
Improvement 0.118% 0.346% 0.292% 0.128% 0.295% 0.150% 0.100% 0.353% 0.352%
City 4 City 5 City 6
CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC CTR-AUC CTCVR-AUC CVR-AUC
Shared Bottom 0.72465 0.81497 0.69693 0.72902 0.82326 0.71015 0.73349 0.83099 0.71435
ESMM 0.72367 0.81704 0.70419 0.72777 0.82608 0.71966 0.73262 0.83156 0.72050
MMOE 0.72419 0.81481 0.69664 0.72836 0.82460 0.71322 0.73349 0.83076 0.71518
ESMM2 0.72515 0.82627 0.74893 0.72866 0.83532 0.76531 0.73336 0.84168 0.76769
ESMS 0.72484 0.82782 0.75016 0.72959 0.83716 0.76552 0.73433 0.84305 0.76847
ESMC 0.72806 0.84098 0.75788 0.73009 0.83737 0.76762 0.73493 0.84334 0.76871
Improvement 0.401% 1.780% 1.195% 0.147% 0.245% 0.302% 0.196% 0.197% 0.133%
(d) CTR-AUC on City 4 dataset. (e) CTCVR-AUC on City 4 dataset. (f) CVR-AUC on City 4 dataset.
Fig. 5: Sensitivity of coefficient of parameter constraint in ESMC. The horizontal axis represents the parameter, and the vertical
axis represents AUC.
(a) CTR-AUC on City 1 dataset. (b) CTCVR-AUC on City 1 dataset. (c) CVR-AUC on City 1 dataset.
(d) CTR-AUC on City 2 dataset. (e) CTCVR-AUC on City 2 dataset. (f) CVR-AUC on City 2 dataset.
2
Fig. 6: Sensitivity of weight of global domain loss in ESMC . The horizontal axis represents the weight, and the vertical axis
represents AUC.
model in ESMC-family. ESMC detached Sample Calibration samples (purchase label = 1) to be divided into two groups:
is named ESMC-. The results show that Sample Calibration 1) Bad Case: Cart and purchase are not in the same exposure
can improve the performance significantly. We observe that space and 2) Good Case: Cart and purchase are in the same
when it is removed, CTCVR-AUC would fall by 0.11% on exposure space. Because all samples’ conversion labels are
average, which implies that Sample Calibration is a simple 1, which means that all samples’ click labels are also 1,
and efficient strategy to maintain the consistency of sample considering CTR-AUC doesn’t make sense. Besides, CTCVR-
selection and probability space. AUC is equal to CVR-AUC here. Therefore, we just consider
CVR-AUC in this experiment. Table VI shows that ESMS
E. RQ5: Case Study. absolutely outperforms ESMM2 on both the Good Case and
the Bad Case. There is no doubt that ESMC-family can address
In Section 4, we have discussed the Bad Case and the
PSC issue perfectly.
Good Case of ESMM2 . ESMC-family is tailored for PSC
issue to handle the Bad Case. Here, we do experiments to Especially, on average, ESMS improves CVR-AUC by over
prove that the model does solve PSC issue and also improves 6% on the Bad Case and near 5% on the Good Case!!! The
the performance on the Good Case. We select conversion reason why the model can achieve a huge improvement on
TABLE VI: The comparison results on the Bad Case and the Good Case in terms of CVR-AUC. Improvement is calculated
as the relative increase of ESMS compared to ESMM2 .
Bad Case
City 1 City 2 City 3 City 4 City 5 City 6
ESMM2 0.65848 0.66022 0.66176 0.67414 0.66887 0.67514
ESMS 0.70224 0.70220 0.70426 0.71108 0.71114 0.71213
Improvement 6.645% 6.358% 6.422% 5.479% 6.319% 5.478%
Good Case
City 1 City 2 City 3 City 4 City 5 City 6
ESMM2 0.68472 0.69251 0.68935 0.72147 0.69270 0.70194
ESMS 0.72341 0.72469 0.72448 0.74979 0.72545 0.72902
Improvement 5.650% 4.646% 5.096% 3.925% 4.727% 3.857%