KDD'17
(SIGKDD Conference on Knowledge Discovery and Data Mining)
Abstract
•
word2vec → node2vec (homogeneous) → metapath2vec (heterogeneous)
•
Metapath-based word2vec method
◦
In a nutshell, Applied "Node2Vec" for Heterogeneous Graph
Background: Heterogeneous Graph
•
Mining Heterogeneous Information Networks: A Structural Analysis Approach (KDD'12)
Heterogeneous Graph?
The graph where the number of types of nodes and their relations are more than 1
Schema & Network
Metapath
•
By different meta-path comes different output
Strength for link types (weighted Meta-path)
•
Can use different weights following types of links to make proper graph mining
Introduction
•
Applied word2vec-method, latent-space representation learning, to heterogeneous graph w/ meta-path-based random walks
•
Extend the skip-gram model to facilitate the modeling of geographically and semantically close nodes
•
Develop a heterogeneous negative sampling-based method
Metapath2vec & Metapath2vec++
Embedding
•
Homogeneous graph embedding
•
Heterogeneous graph embedding
•
Negative sampling
◦
Build the node frequency distribution by viewing different types of nodes homogeneously
◦
Draw negative nodes regardless of node types
Walks
•
Neighborhood construction
◦
Random walk by ignoring the types of nodes has bias to highly visible types of nodes
▪
Nodes with a dominant number of paths
▪
With a governing percentage of paths pointing to a small set of nodes
◦
The flow of the walker is conditioned on the pre-defined meta-path
▪
Meta-paths are commonly used in a symmetric way
(The types of first and last are same)
•
APA, APVPA, OAPVPAO, ...
Metapath2vec++
•
Softmax within same types of nodes
◦
Metapath2vec++ specifies one set of multinomial distributions for each type of neighborhood in the output layer of the skip-gram model
기본 보기
Search
Experiments
•
Dataset
◦
AMiner Computer Science (CS) dataset
▪
9,323,739 computer scientists and 3,194,405 papers
▪
from 3,883 computer science venues
◦
the Database and Infor- mation Systems (DBIS) dataset
▪
464 venues, their top-5000 authors, and corresponding 72,902 publications
•
Parameters
◦
The number of walks per node w: 1000;
◦
The walk length l : 100;
◦
The vector dimension d: 128 (LINE: 128 for each order);
◦
The neighborhood size k : 7;
◦
The size of negative samples: 5.
•
Meta-paths
◦
“APA” → the coauthor semantic
◦
“APVPA” → heterogeneous semantic of authors publishing papers at the same venues
Multi-class classification
Parameter Sensitivity
Node Clustering
Parameter Sensitivity
Case Study
Similarity Search
•
in most cases, the top three results cover venues with similar prestige to the query one
◦
STOC to FOCS in theory
OSDI to SOSP in system
HPCA to ISCA in architecture
CCS to S&P in security
CSCW to CHI in human-computer interaction
EMNLP to ACL in NLP
ICML to NIPS in machine learning
WSDM to WWW in Web
AAAI to IJCAI in artificial intelligence
PVLDB to SIGMOD in database, etc.
Visualization
•
Refer to Figure 1
•
Instead of separating the two types of nodes into two columns, it is capable of grouping each pair of one venue and its corresponding author closely
◦
R. E. Tarjan and FOCS, H. Jensen and SIGGRAPH, H. Ishli and CHI, R. Agrawal and SIG- MOD, etc.
•
Together, both models arrange nodes from similar fields close to each other and dissimilar ones distant from each other
◦
such as the “Core CS” cluster of systems (OSDI), networking (SIGCOMM), security (S&P), and architecture (ISCA), as well as the “Big AI” clus- ter of data mining (KDD), information retrieval (SIGIR), artificial intelligence (AI), machine learning (NIPS), NLP (ACL), and vision (CVPR).
•
Notice that the heterogeneous embeddings are able to unveil the similarities across different do- mains
◦
including the “Core CS” sub-field cluster at the bottom right and the “Big AI” sub-field cluster at the top right
•
Demonstrate metapath2vec++’s novel capability to discover, model, and capture the underlying structural and semantic relationships between multiple types of nodes in heterogeneous networks.
Scalability
•
Environment
◦
Implemented in C and C++
◦
with Quad 12 (48) core 2.3 GHz Intel Xeon CPUs E7-4850