작성자 : 남성욱
발표일 : 2020.11.23
criteo (크리테오) 광고
The users’ behavior consists of two parts: the sequence of items that they viewed without intervention (the organic part) and the sequences of items recommended to them and their outcome (the bandit part).
the organic signal
is not always relevant to the recommendation task
is typically strong and covers most items
the bandit signal
gives direct feedback of recommendation performance,
the signal quality is very uneven
In order to leverage the organic signal to efficiently learn the bandit signal in a Bayesian model we identify three fundamental types of distances, namely action-history, action-action and history-history distances
We implement a scalable approximation of the full model using variational auto-encoders and the local re-parameterization trick
We show using extensive simulation studies that our method out-performs or matches the value of both state-of-the-art organic-based recommendation algorithms, and of bandit-based methods (both value and policy-based)
the organic part
the user’s interest is described by a K dimensional variable which can be interpreted as the user’s interest in K topics.
The organic embedding matrix Ψ is P × K and represents information about how items correlate in a users session organically
(We will show that using variational auto encoders with the re-parameterization trick is an effective way to train the organic model.)
the bandit part
Once organic session is generated a recommendation or actions is made to user denoted and a reward or click will be observed .
The bandit embedding matrix β is P × K and represents information about how to personalize recommendations to a user with a latent user representation .
The approach developed in this paper takes the organic model and estimates Ψ by maximum likelihood and ω by posterior mean (denoted ωˆ) and then treats Ψ and ωˆ as observed in the bandit model. The graphical model is shown in Figure 2. In this probabilistic model we will develop full Bayesian inference of the β, κ, , and .
We refer to the organic only component of the model as BLO(Bayesian Latent Organic) model (we apply maximum likelihood to Ψ, ρ and integrate ω). The full model is referred to as BLOB(Bayesian Latent Organic Bandit Model).
three fundamental types of distances, namely action-history, action-action and history-history distances
(auto-completion assumption) it would have similarity between recommendation and the items in history
The mean of the matrix normal Ψ embodies this assumption.
if action(recommendation) and are similar then we expect that the responses to these actions to the same (or similar) users be correlated.
This distance is encoded with the first (low rank) co-variance ΨΨT in the matrix normal prior on β.
If user and are similar then we expect the response to the same (or similar) action on these users to be correlated.
This distance is encoded with the second (low rank) co-variance ΨT Ψ in the matrix normal prior on β
(organic part only)
process (VAE에서 늘 하는 과정)
we want to maximize the log likelihood
It is hard → find the lower bound, and maximize it
it is also hard → use the re-parametrization trick, get the optimal parameter using b.p.
일반적인 recommendation 평가 지표.
Pop : 가장 인기있는 상품을 모두에게 추천.
Bouch/AE : A linear variational auto-encoder using the Bouchard bound
re-parameterization을 사용하지 않아도 풀 수 있게 변형할 수 있는듯 보임, appendix에 수식전개가 있고, lower bound를 새로 정의함.
RT/AE : re-parameterization 을 이용한 auto-encoder
The results are shown in Table 2. BLO is much better than the baselines at standard organic recommender systems metrics.
baseline 모델에 비해 이 논문이 제안한 BLO가 성능이 훨씬 좋다.
그렇지만 일반적인 dataset을 사용하지 않은 점이 아쉽다.
RNN model로 GRU4Rec을 적용하였다고 하는데, 해당 논문의 성능이 이렇게 안좋은 성능이 나오는 이유가 명시되지 않았다. dataset의 어떤 특성이 있던건지? Recall@20 이 아니라 Recall@5를 쓴게 문제인지?
click through rate
Logistic Regression :
Contextual bandit : policy based method
BLO : The organic portion of the model developed here
BLOB-NQ : (organic and bandit combined)The complete model developed here,
the normal variational approximation NQ
BLOB-MNQ : (organic and bandit combined)The complete model developed here,
the matrix normal variational approximation MNQ.
The first experiment considers the catalog size to be P=100, the number of user sessions to be 1000, the simulated A/B test is done over 4000 users and the logging policy being session popularity with epsilon greedy exploration (epsilon=0.3).
50번 반복했을때, organic model들은 random 하게 추천하는것 보다 오히려 안좋아 지고 있음을 확인 할 수 있다.
organic signal과 bandit signal을 동시에 고려한 model 이라는 점에서 의의가 있다.
기존 model들보다 bandit signal을 고려한 문제에서 더 높은 성능을 보인다.
이 논문에서 사용하고 있는 benchmark dataset이 얼마나 적절한가에 대해서는 의문이 남는다.