使用 CLUTO 进行聚类时对输入数据进行数据预处理答案

【问题标题】：Data pre-processing for input data when clustering with CLUTO使用 CLUTO 进行聚类时对输入数据进行数据预处理
【发布时间】：2017-01-13 23:13:06
【问题描述】：

我正在尝试根据它们的相似性（两个词之间）对一些词进行聚类我的部分数据如下（只是示例“animal.txt”，与邻接矩阵类似）。

    cat dog horse ostrich 
cat  5    4    3    2
dog  4    5    1    2
horse 3   1    5    4
ostrich 2  2   4    5

数字越大，表示两个词的相似度越高。基于这种格式数据，我想做一个集群。（例如，如果我想创建 2 个集群，那么结果将是 (cat, dog), (horse,ostrich))。

我尝试使用 CLUTO... 制作一些集群。

首先，我必须在进行 CLUTO 聚类之前重新构建输入文件。所以，我使用了 doc2mat (http://glaros.dtc.umn.edu/gkhome/files/fs/sw/cluto/doc2mat.html).. 但我不知道如何正确使用它来制作 CLUTO 输入文件（如 mat、标签文件）并且在制作 CLUTO 输入文件之后，然后我如何制作集群根据以上数据？

【问题讨论】：

您希望在预处理脚本的输出中看到什么数据？
用 doc2mat 预处理后，我想要 mat 文件和列、行文件。这些是 CLUTO 的输入。

标签： perl cluster-analysis hierarchical-clustering cluto

【解决方案1】：

由于您的数据是邻接矩阵，因此相应的 CLUTO 输入文件是所谓的 GraphFile，而不是 MatrixFile，因此 doc2mat 没有帮助.

这个程序txt2graph.pl 将像您的示例“animal.txt”这样的文件转换为图形文件和行标签文件：

#!/usr/bin/perl
@F = split ' ', <>;             # begin reading txt file, read column headers
($GraphFile = $ARGV) =~ s/(.txt)?$/.graph/;
$LabelFile = $GraphFile.".rlabel";
open LABEL, ">$LabelFile";
open GRAPH, ">$GraphFile";
print GRAPH $#F+1, "\n";        # output number of vertices=objects=columns=rows
while (<>)
{                               # process each object row
    @F = split ' ', $_, 2;      # split into name, numbers
    print LABEL shift @F, "\n"; # output name
    print GRAPH @F;             # output numbers
}

CLUTO 聚类完成后，这个程序pclusters.pl 以您想要的输出格式打印结果：

#!/usr/bin/perl
($LabelFile = $ARGV[0]) =~ s/(.clustering.\d+)?$/.rlabel/;
open LABEL, $LabelFile; chomp(@label = <LABEL>); close LABEL;   # read labels
while (<>)
{
    $cluster[$_] = [] unless $cluster[$_];      # initialize a new cluster
    push $cluster[$_], $label[$.-1];            # add label to its cluster
}
foreach $cluster (@cluster)
{
    print "(", join(', ', @$cluster), ")\n";    # print a cluster's labels
}

那么整个过程就是：

> txt2graph.pl animal.txt
> scluster animal.graph 2
> pclusters.pl animal.graph.clustering.2

【讨论】：