Deep Learning Recommendation Model for Personalization and Recommendation Systems

Injung Hwang
속성 1
Facebook paper 2019

Brief Introduction

Two primary perspective
Recommendation systems
Content filtering → collaborative filtering → Neighborhood method
Predictive analytics
Statistical models to classify or predict the probability of events
From simple regression to models with deep networks
With embedding,which transform one-hot vectors into dense representations
Propose a personalization model with such two perspectives
Uses embeddings to process sparse features into dense features
Interacts these features using the statistical techinques (Factorizaion Machine)
Finds the event probability with MLP

Background Concept


An embedding is a mapping of a discrete variable to a vector of continuous numbers
Embeddings map each category to a dense representation in an abstract space
One-hot or multi-hot vector to dense representation
Dot-product of two shows similiarity
Verification of embeddings

Matrix Factorization

MF is a simple embedding model — google developer page
This problem is different from low-rank approximation, which can be solved by SVD
Not all entries of matrix R are known

Factorization Machine

Regression method which can exploit several features as fields like polynomial regression

Polynomial regression


FMs are notably distinct from SVMs with polynomial kernels
Factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization
Significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors
Yielding linear computational complexity

Matrix Factorization & Multilayer Perceptrons

Neural Collaborative Filtering (NCF)
Matrix Factorization has the low-dimensional latent space
Simple and fixed inner product → Limitation to represent complex form
Find similarity between two using MLP
MLP is non-lineaer form with activation function unlike linear dot-product
As follows, MLP way can represent much more complex relationship between two
Uses MLP rather than dot product to compute interactions

Model Design and Architecture

Bottom-up way description
Users and products: continuous feature & categorical features
Categorical feature will be represented by an embedding vector (Matrix Factorization)
Continuous feature will be transformed by an MLP into a dense representation (by NCF)
Same length as the embedding vectors
Compute second-order interaction of different features explicitly
Dot product between all pairs of embedding vectors and processed dense features
These dot products are concatenated with the original processed dense features
Then processed with another MLP

Comparison with Prior Models

DLRM specifically interacts embeddings in a structured way that mimics factorization
Significantly reduce the dimensionality of the model
Only considering cross-term produced by the dot-product between paris of embeddings in the final MLP
A key difference is in how these networks treat embedded feature vectors and thier cross-terms
DLRM interprets each feature vector as a single unit representing a single category
Others produce cross-terms via the dot product
Not only between different features but also the same features → higher dimensionality


Embeddings have several tables each requiring in excess of multiple GBs of memory
Due to the size of the embeddings, it's hard to use data parallelism (replica)
The MLP parameters are smaller in memory but produce sizeable amounts of computation
Good for data-parallelism
Parallelized DLRM use a combination of model parallelism for the embedding & data parallelism for the MLPs
This type of combination is a unique requirement of DLRM
Design a custom implementation (not supported by Caffe2 or PyTorch)
Top MLP and the interaction operator require access to part of the mini-batch from the bottom MLP and all of the embeddings
Model parallemism: distribute the embeddings across devices
For the data parallel MLPs, the parameter updates in the backward pass are accumulated with allreduce and applied to the replicated parameters on each device asynchronously


Three types of data sets: random, synthetic and public data sets
First two are useful in experimenting from the systems perspective
Last is useful to measure the accuracy of the model


Inputs of DLRM are continuous and categorical features
Continuous feature can be modeled by generating random numbers using either a uniform or normal distributions
To generate categorical features, need to determine how many non-zero elements in a given multi-hot vector
This number to be either fixed or random within a range [1,k]
Then generate the corresponding number of integer indices within a range [1,m]


Express categorical features through distributions


Criteo AI Labs Ad Kaggle and Terabyte data sets
Open-soured data sets consisting of click logs for ad CTR prediction
13 contiuous and 26 categorical features
Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days
First 6 days: Traninig
7th day: validation
Criteo Ad Terbyte data set is sampled over 24 days
First 23 days: Training
24th day: validation



Implemented in PyTorch and Caffe2
Uses fp32 floating point and int32(Caffe2)/int64(PyTorch)
On the Big Basin platform with Dual Socket Intel Xeon 6138 CPU@ 2.00GHz 8 Nvidia Tesla V100 16GB GPUs

Model Accuracy

Evaluate DLRM against a Deep and Cross Network (DCN)
DCN is one of the few models with comprehensive result on the same data set
The bottom MLP of DLRM consists of 3 hidden layers with 512, 256 and 64 nodes
The top MLP consists of 2 hidden layers with 512 and 256 nodes
DCN consists of 6 cross layers and a deep network with 512 and 256 nodes
Embedding dimension: 16 (both models with about 540M parameters)
With SGC and Adagrad optimizers
No regularization
Without extensive tuning of model hyperparameters

Model Performance on a Single Socket/Device

A sample model with 8 categorical features and 512 continuous features
Each categorical feature is processed through an embedding table with 1M vectors, with vector dimension 64
Bottom MLP 2 layers and Top MLP 4 layers
This model impl. in Caffe2 runs in around 256 seconds on the CPU and 62 seconds on the GPU
Majority of time is spent performing embedding lookups and fully connected layers
FC layers take a significant portion on CPU, but rare on GPU


Proposed and open-sourced a novel deep learning-based recommendation model that exploits categorical data