如何从相似矩阵中绘制 MDS？答案

【问题标题】：How to plot a MDS from a similarity matrix?如何从相似矩阵中绘制 MDS？
【发布时间】：2021-01-21 21:17:16
【问题描述】：

我正在使用值介于 0 和 1 之间的相似度矩阵（1 表示元素相等），并且我正在尝试使用 python 和 scikit-learn 绘制 MDS。

我找到了多个示例，但我不确定将什么作为 mds.fit() 的输入。

目前，我的数据看起来像这样（file.csv）：

  ;  A  ;  B  ;  C  ;  D  ; E
A ; 1   ; 0.1 ; 0.2 ; 0.5 ; 0.2
B ; 0.1 ; 1   ; 0.3 ; 1   ; 0
C ; 0.2 ; 0.3 ; 1   ; 0.8 ; 0.6
D ; 0.5 ; 1   ; 0.8 ; 1   ; 0.2
E ; 0.2 ; 0   ; 0.6 ; 0.2 ; 1

我目前正在使用此代码：

import pandas
from sklearn import manifold
import matplotlib.pyplot as plt

data = pandas.read_table("file.csv", ";", header=0, index_col=0)

mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
points = mds.embedding_

# Prepare axes
ax = plt.axes([0,0,2,2])
ax.set_aspect(aspect='equal')

# Plot points
plt.scatter(points[:,0], points[:,1], color='silver', s=150)
# Add labels
for i in range(data.shape[0]):
    ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue')

#plt.show() # Open display and show at screen
plt.savefig('out.png', format='png', bbox_inches='tight') # PNG
#plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG

我不确定 sklearn 在做什么。我阅读了很多示例，其中人们使用中间为 0（而不是 1）的“相异矩阵”。

我应该转型吗？或不？如果是，应该进行哪种转换？（我读到there 说一个简单的减法就足够了......但是还有其他方法......我有点迷失了：（）
sklearn 和 MDS 会自动理解输入吗？（作为一个相似或相异矩阵，中间有 0 或 1 吗？）还是使用距离矩阵？（在这种情况下，如何从相似度矩阵中获得它？）
在link 中，他们说相似度在 1 和 -1 之间...我使用的是 0 和 1 之间的相似度...我想我应该转换我的数据？应该使用哪种转换？

【问题讨论】：

标签： python scikit-learn multi-dimensional-scaling

【解决方案1】：

我与 XLSTAT（一个 excel 扩展）进行了比较，以便尝试很多场景并比较如何做。

首先：我的输入矩阵是一个“相似度”矩阵，因为我可以将其解释为：“A 和 A 100% 相等”。由于 MDS 将相异矩阵作为输入，因此我必须应用转换。

在文献Ricco Rakotomalala's french course on data science (p 208-209) 中，简单的方法是将最大值减去每个单元格（进行“1 - 单元格”运算）。所以你可以很容易地制作一个 python 程序，或者（因为我跟踪每个矩阵）一个 AWK 预处理程序：

相似度到不相似度-simple.awk

# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word/tag
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (1 - similarity)
    NUM = $i;
    VAL = 1 - NUM;
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

可以通过命令调用：

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv

一种更复杂的计算方式（我找不到参考，抱歉 :( ) 是基于每个单元格上的另一个转换：

如果对角线不包含相同的值，这种方法似乎非常适合（我看到there 一个共现矩阵......它应该适用于他的 cas）。就我而言，由于对角线总是满 1，我将其减少为：

因此，进行这种转换的 AWK 程序（由于我的数据，我实现了简化的程序）是：

相似度到不相似度复杂度.awk

# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (2 - 2 * similarity)^-1/2
    NUM = $i;
    VAL = sqrt(2 - 2 * NUM);
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

你可以用这个命令调用它：

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv

当我使用 Kruskal 的压力来检查哪个版本更好时......在我的情况下，简单的相似性与相异性（1 - 单元格）是最好的（我将压力保持在 0,34 和 0,32 之间...... . 这不好...其中复数显示的值大于 0,34，这更糟）。

【讨论】：