Facebook paper 2019

## Brief Introduction

Two primary perspective

•

Recommendation systems

◦

Content filtering → collaborative filtering → Neighborhood method

•

Predictive analytics

◦

Statistical models to classify or predict the probability of events

▪

From simple regression to models with deep networks

◦

With embedding,which transform one-hot vectors into dense representations

Propose a personalization model with such two perspectives

•

Uses embeddings to process sparse features into dense features

•

Interacts these features using the statistical techinques (Factorizaion Machine)

•

Finds the event probability with MLP

## Background Concept

### Embedding

•

An embedding is a mapping of a discrete variable to a vector of continuous numbers

◦

Embeddings map each category to a dense representation in an abstract space

◦

One-hot or multi-hot vector to dense representation

•

Dot-product of two shows similiarity

•

Verification of embeddings

### Matrix Factorization

MF is a simple embedding model — google developer page

•

This problem is different from low-rank approximation, which can be solved by SVD

◦

Not all entries of matrix R are known

### Factorization Machine

•

Regression method which can exploit several features as fields like polynomial regression

#### Polynomial regression

#### FMs

•

FMs are notably distinct from SVMs with polynomial kernels

◦

Factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization

◦

Significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors

▪

Yielding linear computational complexity

### Matrix Factorization & Multilayer Perceptrons

Neural Collaborative Filtering (NCF)

•

Matrix Factorization has the low-dimensional latent space

◦

Simple and fixed inner product → Limitation to represent complex form

•

Find similarity between two using MLP

◦

MLP is non-lineaer form with activation function unlike linear dot-product

◦

As follows, MLP way can represent much more complex relationship between two

•

Uses MLP rather than dot product to compute interactions

## Model Design and Architecture

Bottom-up way description

•

Users and products: continuous feature & categorical features

◦

Categorical feature will be represented by an embedding vector (Matrix Factorization)

◦

Continuous feature will be transformed by an MLP into a dense representation (by NCF)

▪

Same length as the embedding vectors

•

Compute second-order interaction of different features explicitly

◦

Dot product between all pairs of embedding vectors and processed dense features

◦

These dot products are concatenated with the original processed dense features

•

Then processed with another MLP

### Comparison with Prior Models

•

DLRM specifically interacts embeddings in a structured way that mimics factorization

◦

Significantly reduce the dimensionality of the model

◦

Only considering cross-term produced by the dot-product between paris of embeddings in the final MLP

•

A key difference is in how these networks treat embedded feature vectors and thier cross-terms

◦

DLRM interprets each feature vector as a single unit representing a single category

◦

Others produce cross-terms via the dot product

▪

Not only between different features but also the same features → higher dimensionality

## Parallelism

•

Embeddings have several tables each requiring in excess of multiple GBs of memory

◦

Due to the size of the embeddings, it's hard to use data parallelism (replica)

•

The MLP parameters are smaller in memory but produce sizeable amounts of computation

◦

Good for data-parallelism

•

Parallelized DLRM use a combination of model parallelism for the embedding & data parallelism for the MLPs

◦

This type of combination is a unique requirement of DLRM

◦

Design a custom implementation (not supported by Caffe2 or PyTorch)

•

Top MLP and the interaction operator require access to part of the mini-batch from the bottom MLP and all of the embeddings

◦

Model parallemism: distribute the embeddings across devices

•

For the data parallel MLPs, the parameter updates in the backward pass are accumulated with allreduce and applied to the replicated parameters on each device asynchronously

## Data

•

Three types of data sets: random, synthetic and public data sets

◦

First two are useful in experimenting from the systems perspective

◦

Last is useful to measure the accuracy of the model

### Random

•

Inputs of DLRM are continuous and categorical features

◦

Continuous feature can be modeled by generating random numbers using either a uniform or normal distributions

◦

To generate categorical features, need to determine how many non-zero elements in a given multi-hot vector

▪

This number to be either fixed or random within a range [1,k]

▪

Then generate the corresponding number of integer indices within a range [1,m]

### Synthetic

???

•

Express categorical features through distributions

### Public

•

Criteo AI Labs Ad Kaggle and Terabyte data sets

◦

Open-soured data sets consisting of click logs for ad CTR prediction

◦

13 contiuous and 26 categorical features

•

Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days

◦

First 6 days: Traninig

◦

7th day: validation

•

Criteo Ad Terbyte data set is sampled over 24 days

◦

First 23 days: Training

◦

24th day: validation

## Experiments

### Environment

•

Implemented in PyTorch and Caffe2

◦

Uses fp32 floating point and int32(Caffe2)/int64(PyTorch)

◦

On the Big Basin platform with Dual Socket Intel Xeon 6138 CPU@ 2.00GHz 8 Nvidia Tesla V100 16GB GPUs

### Model Accuracy

•

Evaluate DLRM against a Deep and Cross Network (DCN)

◦

DCN is one of the few models with comprehensive result on the same data set

•

The bottom MLP of DLRM consists of 3 hidden layers with 512, 256 and 64 nodes

•

The top MLP consists of 2 hidden layers with 512 and 256 nodes

•

DCN consists of 6 cross layers and a deep network with 512 and 256 nodes

•

Embedding dimension: 16 (both models with about 540M parameters)

•

With SGC and Adagrad optimizers

•

No regularization

•

Without extensive tuning of model hyperparameters

### Model Performance on a Single Socket/Device

•

A sample model with 8 categorical features and 512 continuous features

◦

Each categorical feature is processed through an embedding table with 1M vectors, with vector dimension 64

•

Bottom MLP 2 layers and Top MLP 4 layers

•

This model impl. in Caffe2 runs in around 256 seconds on the CPU and 62 seconds on the GPU

•

Majority of time is spent performing embedding lookups and fully connected layers

◦

FC layers take a significant portion on CPU, but rare on GPU

## Conclusion

Proposed and open-sourced a novel deep learning-based recommendation model that exploits categorical data