Discriminative Embeddings of Latent Variable Models for Structured Data

作者

佐治亚理工学院

Hanjun Dai
Bo Dai
Le Song

摘要

专门为序列、树和图设计的核分类器和回归器，已经大大促进了计算生物学、药物设计等跨学科领域。通常，内核是事先为探索结构的统计数据或使用概率生成模型的数据类型设计的，然后基于通过凸优化的核来学习判别分类器。但是这样优雅的两阶段方法也限制了核方法扩展到数百万数据点，限制了利用判别信息来学习特征表示。我们提出了structure2vec，一种有效的可扩展的结构化数据的表示，其基于将潜变量嵌入到特征空间的思想，并使用判别信息学习此类特征空间。有趣的是，structure2vec 通过执行一系列类似于图模型推导程序的映射来提取特征，例如平均场和置信传播。在应用中涉及数百万个数据点，我们证明了structure2vec运行速度提高了2倍，模型缩小了10000倍，而同时达到了最先进的预测性能。

Introduction

bag of structures(BOS)

spectrum kernel
subtree kernel
graphlet kernel
Weisfeiler-lehman graph kernel

这些核的特征表示事先是固定的，每个维度对应一个子结构。
而且应用的数据集的大小也受限

第二类核是基于概率图模型，来表示噪音和结构化的数据。

hidden Markov models for sequence data
pairwise Markov random fields for graph data

代表性的有：

Fisher kernel
probability product kernel

本文Idea

Our idea is to model each structured data point as a latent variable model, then embed the graphical model into feature spaces, and use inner product in the embedding space to define kernels.

创新：

Instead of fixing a feature or embedding space beforehand, we will also learn the feature space by directly minimizing the empirical loss defined by the label information.

不同之处：

learning similarity measure for structured data
structure2vec learns the nonlinear mappings using the discriminative information
a variant of structure2vec can run in a mean field update fashion, different from message passing algorithms

Backgrounds

核方法
结构化数据的核
希尔伯特空间

Discriminative Embeddings of Latent Variable Models for Structured Data

Models

Embedding Mean Field
Embedding Loopy BP
Embedding Other Variational Inference
Discriminative Embedding

Experiments

代码：https://github.com/Hanjun-Dai/graphnn.

数据集

String dataset
- SCOP
- FC and RES dataset
Graph datasets
- MUTAG
- NCI1
- NCI109
- ENZYMES
- D&D
Harvard Clean Energy Project dataset

思考

图模型、希尔伯特空间嵌入与深度学习一般方法的结合将会越来越普遍