根据匹配/不匹配排序算法答案

【问题标题】：Sorting Algorithm according to match/mismatch根据匹配/不匹配排序算法
【发布时间】：2021-12-25 01:39:42
【问题描述】：

我正在处理的数据代表 5 个栖息地中是否存在某些物种。我想根据它们之间的共享区域获得集群，基本上我想最大化每个物种的元素之间的匹配。
这是原始数据集

这就是我正在寻找的排序类型
通过简单地对占用区域的数量进行排序，然后手动修复明显的错误，我设法获得了 3 个组。
这种非常简化的算法之所以有效，是因为被占用的区域总是连续的，并且存在一些常见的模式。

起初我认为这个问题与Sequence Alignment 有点相似，但我真的想不出任何方法将这些算法应用于我的案例。此外，我还想找到一种自动方法将这些数据分成 3 组，但这并不是绝对必要的。

真正的挑战是对包含空白的new Dataset 进行排序。
我认为最好的方法应该是：

计算每对可能的物种之间的匹配和不匹配数量
找到与第一个匹配最多的物种并将它们彼此相邻设置
对第二个物种（刚刚重新定位的那个）重复，不包括对第一个物种的检查（以避免无限循环），依此类推...

## Original df
    Species Zones array
0   A   [0, 1, 1, 1, 1]
1   B   [0, 1, 1, 1, 1]
2   C   [0, 1, 1, 1, 1]
3   D   [0, 1, 1, 0, 0]
4   E   [0, 1, 1, 1, 1]
5   F   [0, 1, 1, 1, 1]
6   G   [0, 1, 1, 1, 1]
7   H   [0, 1, 1, 1, 1]
8   I   [1, 1, 1, 1, 1]
9   J   [0, 1, 1, 1, 1]
10  K   [1, 1, 1, 0, 0]
11  L   [1, 1, 1, 0, 0]
12  M   [1, 1, 1, 1, 1]
13  N   [0, 1, 1, 1, 1]
14  O   [0, 0, 1, 1, 1]
15  P   [0, 1, 1, 1, 1]
16  Q   [0, 0, 1, 1, 1]
17  R   [1, 1, 1, 0, 0]
18  S   [0, 1, 1, 1, 1]
19  T   [0, 1, 1, 1, 0]
20  U   [0, 1, 1, 1, 1]


## Sorted df
  Species   Zones array
0   A   [1, 1, 1, 1, 1]
1   B   [1, 1, 1, 1, 1]
2   C   [0, 1, 1, 1, 1]
3   D   [0, 1, 1, 1, 1]
4   E   [0, 1, 1, 1, 1]
5   F   [0, 1, 1, 1, 1]
6   G   [0, 1, 1, 1, 1]
7   H   [0, 1, 1, 1, 1]
8   I   [0, 1, 1, 1, 1]
9   J   [0, 1, 1, 1, 1]
10  K   [0, 1, 1, 1, 1]
11  L   [0, 1, 1, 1, 1]
12  M   [0, 1, 1, 1, 1]
13  N   [0, 1, 1, 1, 1]
14  O   [0, 0, 1, 1, 1]
15  P   [0, 0, 1, 1, 1]
16  Q   [0, 1, 1, 1, 0]
17  R   [0, 1, 1, 0, 0]
18  S   [1, 1, 1, 0, 0]
19  T   [1, 1, 1, 0, 0]
20  U   [1, 1, 1, 0, 0]

## New gapped df
    Species Prey
0   A   [1, 1, 1, 0, 1, 0, 1, 1, 1]
1   B   [1, 0, 1, 0, 1, 1, 1, 0, 1]
2   C   [1, 1, 1, 0, 1, 0, 1, 0, 1]
3   D   [1, 1, 1, 0, 1, 0, 1, 0, 0]
4   E   [1, 1, 1, 0, 1, 0, 0, 0, 0]
5   F   [1, 0, 1, 0, 1, 1, 0, 0, 0]
6   G   [1, 1, 1, 1, 0, 0, 0, 0, 0]
7   H   [1, 1, 1, 0, 0, 0, 0, 0, 0]
8   I   [1, 0, 1, 0, 0, 0, 0, 0, 0]
9   J   [0, 0, 1, 0, 0, 1, 0, 0, 0]
10  K   [1, 0, 1, 0, 0, 0, 0, 0, 0]
11  L   [1, 0, 1, 0, 0, 0, 0, 0, 0]
12  M   [1, 0, 0, 0, 0, 0, 0, 0, 0]
13  N   [0, 0, 0, 0, 0, 1, 0, 0, 0]
14  O   [1, 0, 0, 0, 0, 0, 0, 0, 0]
15  P   [1, 0, 0, 0, 0, 0, 0, 0, 0]
16  Q   [0, 0, 0, 0, 0, 1, 0, 0, 0]
17  R   [1, 0, 0, 0, 0, 0, 0, 0, 0]
18  S   [0, 0, 1, 0, 0, 0, 0, 0, 0]
19  T   [0, 0, 1, 0, 0, 0, 0, 0, 0]
20  U   [0, 0, 1, 0, 0, 0, 0, 0, 0]

【问题讨论】：

添加演示数据框而不是图片
当然，我正在创建它

标签： python pandas dataframe sorting cluster-analysis

【解决方案1】：

我认为我以一种简单而适当的方式解决了这个问题（但我确信可能会有更好的算法）。我检查每对布尔向量之间的匹配数：

n=len(diet_bool_df['Prey'])
match_mtx = [[0 for i in range(n)] for j in range(n)] #empty matrix nxn

for i in range(n):
    for j in range(n):
#this gives a vector containing 1 for a match and 0 for a mismatch position-wise
        match_vec = [int(diet_bool_df['Boolean Diet'][i][k]==diet_bool_df['Prey'][j][k]) for k in range(len(diet_bool_df['Prey'][i]))]
#the sum of all 1 gives the number of matches between each couple (i,j)
        match_mtx[i][j] = sum(match_vec)

由于有 9 个猎物，因此最多有 9 个匹配项，这将表示相同的饮食。这个新矩阵现在实际上可以用 kmeans 聚类，因为数据是各向同性的，应该满足所有假设。

def plot_match_cluster():
    fig= go.Figure()
    fig.add_trace(go.Heatmap(
        z=match_mtx,
    ))
    fig.show()

kmeans = sklearn.cluster.KMeans(n_clusters=3).fit_predict(match_mtx)

新的聚类数据如下所示
Clusterized Data

注意。有趣的是，我根据猎物数量决定的初始排序是在这种情况下创建集群的一个很好的指标。但这仅适用于 3 个集群。

【讨论】：

经过一些更深入的分析后，我注意到在相似矩阵上使用 kmean 仍然不正确。正确的方法应该是层次算法或者谱法，

【解决方案2】：

集群的方法有很多，除非您有特殊情况，否则最好使用现成的方法。这是一个使用 kmeans 聚类的示例，为您的用例设置 k=3。该解决方案使用了许多您可能需要安装的软件包。 sklearn 用于 kmeans-clustering，然后其余的导入用于绘图。

import pandas as pd
import sklearn.cluster

from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({ 
    'A':[0, 1, 1, 1, 1], 
    'B':[0, 1, 1, 1, 1], 
    'C':[0, 1, 1, 1, 1], 
    'D':[0, 1, 1, 0, 0], 
    'E':[0, 1, 1, 1, 1], 
    'F':[0, 1, 1, 1, 1], 
    'G':[0, 1, 1, 1, 1], 
    'H':[0, 1, 1, 1, 1], 
    'I':[1, 1, 1, 1, 1], 
    'J':[0, 1, 1, 1, 1], 
    'K':[1, 1, 1, 0, 0], 
    'L':[1, 1, 1, 0, 0], 
    'M':[1, 1, 1, 1, 1], 
    'N':[0, 1, 1, 1, 1], 
    'O':[0, 0, 1, 1, 1], 
    'P':[0, 1, 1, 1, 1], 
    'Q':[0, 0, 1, 1, 1], 
    'R':[1, 1, 1, 0, 0], 
    'S':[0, 1, 1, 1, 1], 
    'T':[0, 1, 1, 1, 0], 
    'U':[0, 1, 1, 1, 1], 
}).T #transposing to be species as rows and zones as columns

#gapped species
df = pd.DataFrame({
    'A':[1, 1, 1, 0, 1, 0, 1, 1, 1], 
    'B':[1, 0, 1, 0, 1, 1, 1, 0, 1], 
    'C':[1, 1, 1, 0, 1, 0, 1, 0, 1], 
    'D':[1, 1, 1, 0, 1, 0, 1, 0, 0], 
    'E':[1, 1, 1, 0, 1, 0, 0, 0, 0], 
    'F':[1, 0, 1, 0, 1, 1, 0, 0, 0], 
    'G':[1, 1, 1, 1, 0, 0, 0, 0, 0], 
    'H':[1, 1, 1, 0, 0, 0, 0, 0, 0], 
    'I':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'J':[0, 0, 1, 0, 0, 1, 0, 0, 0], 
    'K':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'L':[1, 0, 1, 0, 0, 0, 0, 0, 0], 
    'M':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'N':[0, 0, 0, 0, 0, 1, 0, 0, 0], 
    'O':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'P':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'Q':[0, 0, 0, 0, 0, 1, 0, 0, 0], 
    'R':[1, 0, 0, 0, 0, 0, 0, 0, 0], 
    'S':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
    'T':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
    'U':[0, 0, 1, 0, 0, 0, 0, 0, 0], 
}).T #transposing to be species as rows and zones as columns 

#Cluster by kmeans with k=3
kmeans = sklearn.cluster.KMeans(n_clusters=3).fit(df)
cluster_labels = pd.Series(kmeans.labels_+1, index=df.index)

#Prepare for plotting
plot_data = df.multiply(cluster_labels,axis=0)
sorted_plot_data = plot_data.loc[cluster_labels.sort_values().index]

myColors = (
    (1.0, 1.0, 1.0, 1.0), #0-value color
    (0.4, 0.0, 0.4, 1.0), #cluster 1 color
    (0.0, 0.4, 0.4, 1.0), #cluster 2 color
    (0.4, 0.4, 0.0, 1.0), #cluster 3 color
)
my_cmap = LinearSegmentedColormap.from_list('Custom', myColors, len(myColors))

#Unsorted Heatmap
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(
    
    data = plot_data.T,
    cmap=my_cmap,
    cbar=False,
)
ax.invert_yaxis()
plt.title('Species in order colored by k-means cluster')
plt.show()
plt.close()

#Sorted Heatmap
fig = plt.figure(figsize=(8,6))
ax = sns.heatmap(
    data = sorted_plot_data.T,
    cmap=my_cmap,
    cbar=False,
)
ax.invert_yaxis()
plt.title('Species sorted and colored by k-means cluster, sorted alphabetically within cluster')
plt.show()
plt.close()

输出图

【讨论】：

这是一个聪明的解决方案，但起初我不想使用 k-means，因为该算法仅适用于各向同性数据的假设，而且我认为沿 2 个轴的方差也相似。这里的两个维度之一是“ABCD ...”，它是一个常规轴。我很确定它违反了一些假设。