Newick 树表示为 scipy.cluster.hierarchy 链接矩阵格式答案

【问题标题】：Newick tree representation to scipy.cluster.hierarchy linkage matrix formatNewick 树表示为 scipy.cluster.hierarchy 链接矩阵格式
【发布时间】：2015-09-11 02:02:46
【问题描述】：

我有一组已根据 DNA 序列进行比对和聚类的基因，并且我在 Newick 树表示中拥有这组基因 (https://en.wikipedia.org/wiki/Newick_format)。有谁知道如何将此格式转换为 scipy.cluster.hierarchy.linkage 矩阵格式？来自链接矩阵的 scipy 文档：

返回 A (n-1) x 4 矩阵 Z。在第 i 次迭代中，簇将索引 Z[i, 0] 和 Z[i, 1] 组合起来形成簇 n+i。一种索引小于 n 的簇对应于 n 个原始簇之一观察。簇 Z[i, 0] 和 Z[i, 1] 之间的距离为由 Z[i, 2] 给出。第四个值 Z[i, 3] 表示新形成的集群中的原始观测值。

至少从 scipy 文档来看，他们对这个链接矩阵的结构的描述相当混乱。他们所说的“迭代”是什么意思？此外，这种表示如何跟踪哪些原始观测值在哪个集群中？

我想弄清楚如何进行这种转换，因为我项目中其他聚类分析的结果已经使用 scipy 表示完成，并且我一直将其用于绘图目的。

【问题讨论】：

链接矩阵包含有关构建树的过程的所有信息。层次聚类是一种迭代方法。在每一步（=迭代）中，将两个现有集群连接起来以创建一个更大的集群。重复 n-1 次，直到只剩下一个簇。这些信息是否有助于您理解链接矩阵告诉您的信息？
是的，我知道树是如何生成的，只是与算法相比，文档使用的语言有点混淆。我现在知道了，很快就会发布答案。感谢您的澄清！ :D

标签： python scipy hierarchical-clustering phylogeny

【解决方案1】：

我找到了这个解决方案：

import numpy as np
import pandas as pd
from ete3 import ClusterTree
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage
import logging


def newick_to_linkage(newick: str, label_order: [str] = None) -> (np.ndarray, [str]):
    """
    Convert newick tree into scipy linkage matrix

    :param newick: newick string, e.g. '(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);'
    :param label_order: list of labels, e.g. ['A', 'B', 'C']
    :returns: linkage matrix and list of labels
    """
    # newick string -> cophenetic_matrix
    tree = ClusterTree(newick)
    cophenetic_matrix, newick_labels = tree.cophenetic_matrix()
    cophenetic_matrix = pd.DataFrame(cophenetic_matrix, columns=newick_labels, index=newick_labels)

    if label_order is not None:
        # sanity checks
        missing_labels = set(label_order).difference(set(newick_labels))
        superfluous_labels = set(newick_labels).difference(set(label_order))
        assert len(missing_labels) == 0, f'Some labels are not in the newick string: {missing_labels}'
        if len(superfluous_labels) > 0:
            logging.warning(f'Newick string contains unused labels: {superfluous_labels}')

        # reorder the cophenetic_matrix
        cophenetic_matrix = cophenetic_matrix.reindex(index=label_order, columns=label_order)

    # reduce square distance matrix to condensed distance matrices
    pairwise_distances = pdist(cophenetic_matrix)

    # return linkage matrix and labels
    return linkage(pairwise_distances), list(cophenetic_matrix.columns)

基本用法：

>>> linkage_matrix, labels = newick_to_linkage(
...     newick='(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);'
... )
>>> print(linkage_matrix)
[[0.        1.        0.4472136 2.       ]
 [2.        3.        1.        2.       ]
 [4.        5.        1.4832397 4.       ]]
>>> print(labels)
['A', 'B', 'C', 'D']

共质矩阵是什么样子的：

>>> print(cophenetic_matrix)
     A    B    C    D
A  0.0  0.3  0.9  1.0
B  0.3  0.0  1.0  1.1
C  0.9  1.0  0.0  0.7
D  1.0  1.1  0.7  0.0

高级用法：

>>> linkage_matrix, labels = newick_to_linkage(
...     newick='(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);',
...     label_order=['C', 'B', 'A']
... )
WARNING:root:Newick string contains unused labels: {'D'}
>>> print(linkage_matrix)
[[1.         2.         0.43588989 2.        ]
 [0.         3.         1.4525839  3.        ]]
>>> print(labels)
['C', 'B', 'A']

【讨论】：

【解决方案2】：

我知道了链接矩阵是如何从树表示中生成的，感谢@cel 的澄清。让我们以 Newick wiki 页面 (https://en.wikipedia.org/wiki/Newick_format) 中的示例为例

树，字符串格式为：

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

首先，应该计算所有叶子之间的距离。例如，如果我们希望计算 A 和 B 的距离，方法是通过最近的分支从 A 到 B 遍历树。由于在 Newick 格式中，我们给出了每个叶子和分支之间的距离，所以从 A 到 B 的距离很简单 0.1 + 0.2 = 0.3。对于 A 到 D，我们必须做0.1 + (0.5 + 0.4) = 1.0，因为从 D 到最近的分支的距离是 0.4，而从 D 的分支到 A 的距离是 0.5。因此距离矩阵看起来像这样（索引A=0、B=1、C=2、D=3）：

distance_matrix=
 [[0.0, 0.3, 0.9, 1.0],
  [0.3, 0.0, 1.0, 1.1],
  [0.9, 1.0, 0.0, 0.7],
  [1.0, 1.1, 0.1, 0.0]]

从这里，链接矩阵很容易找到。由于我们已经有n=4 簇（A,B,C,D）作为原始观察值，我们需要找到树的额外n-1 簇。每一步都只是简单地将两个集群组合成一个新集群，我们选取彼此最接近的两个集群。在这种情况下，A 和 B 最接近，因此链接矩阵的第一行将如下所示：

[A,B,0.3,2]

从现在开始，我们把A&B看成一个簇，它到最近的分支的距离就是A&B之间的距离。

现在我们还剩下 3 个集群，AB、C 和 D。我们可以更新距离矩阵来查看哪些集群是最接近的。让AB 在更新后的距离矩阵中有索引0。

distance_matrix=
[[0.0, 1.1, 1.2],
 [1.1, 0.0, 0.7],
 [1.2, 0.7, 0.0]]

我们现在可以看到 C 和 D 彼此最接近，所以让我们将它们组合成一个新的集群。链接矩阵中的第二行现在将是

[C,D,0.7,2]

现在，我们只剩下两个集群，AB 和 CD。这些簇到根分支的距离分别为 0.3 和 0.7，因此它们的距离为 1.0。链接矩阵的最后一行将是：

[AB,CD,1.0,4]

现在，scipy 矩阵实际上并没有我在这里展示的字符串，我们将使用索引方案，因为我们首先组合了 A 和 B，AB 将具有索引 4 和 @ 987654348@ 的索引为 5。所以我们应该在 scipy 链接矩阵中看到的实际结果是：

[[0,1,0.3,2],
 [2,3,0.7,2],
 [4,5,1.0,4]]

这是从树表示到 scipy 链接矩阵表示的一般方法。但是，已经有其他 python 包中的工具可以读取 Newick 格式的树，从这些工具中，我们可以相当容易地找到距离矩阵，然后将其传递给 scipy 的链接函数。下面是一个小脚本，正是这个例子。

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)

【讨论】：

仅适用于其他可能对 scipy 中的 Z 链接矩阵是如何创建感到困惑的人。
不是dist(A, B) = 0.3吗？这样distance_matrix= [[0.0, 0.3*, 0.9, 1.0], [0.3*, 0.0, 1.0, 1.1], [0.9, 1.0, 0.0, 0.7], [1.0, 1.1, 0.1, 0.0]]?
@Sang - 感谢您指出错字！看起来玛丽亚首先解决了这个问题，所以感谢她解决了这个问题！
np。你的代码很完美，所以很容易确认。感谢这篇文章。使我免于大量挖掘。