Java代码中亲和传播的实现答案

【问题标题】：Implementation of Affinity propagation in Java codeJava代码中亲和传播的实现
【发布时间】：2013-08-06 17:21:43
【问题描述】：

过去一周我一直在尝试在 Java 中实现 Affinity Propagation。我完全按照 Frey 和 Dueck 的原始论文所描述的那样做，但我没有得到好的样本。

研究论文可以在这里找到：http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf

这是我为相似度函数编写的代码（从研究论文中聚类句子。）

public static void calculateSimilarity(){

                try{    
                    for(int i=0; i<tweets.size(); i++){//For each tweet

                        for(int j=0; j<tweets.size(); j++){//and the one next to it, split both into tokens

                            String[]firstTokens=tweets.get(i).toLowerCase().split(" ");
                            String[]secondTokens=tweets.get(j).toLowerCase().split(" ");//tokenize it

                            //store summed cost in respective matrix.
                            if(i==j){//calculate self similarity{
                                similarity[i][j]=firstTokens.length*NEGATIVE_LOG_OF_DICTIONARY+ADJUSTMENT_FACTOR;
                                System.out.println(similarity[i][j]);
                            }
                            else{
                                //The costC per word. These will be summed
                                double Cost=compare(firstTokens, secondTokens);//compare
                                similarity[i][j]=Cost;//assign the similarity
                            }


                    }//end inner for



                }//end outer for
            }//end try
            catch(Exception e){
                System.out.println(temp);
                e.printStackTrace();
            }//end catch

            }//end method


        public static double compare(String[]firstString,String[]secondString){


            double Cost=0;
            for(int k=0; k<firstString.length; k++){//for first tweet tokens

                for(int l=0; l<secondString.length;l++){//compare to second tweet tokens

                    //Look at words that are greater than 2 characters
                if(firstString[k].length()>=5 &&secondString[l].length()>=5){
                        if(firstString[k].contains(secondString[l])){
                            //increment the cost
                            Cost+=-Math.log10(secondString.length);
                        }
                        else//Cost of the word if no word is similar
                            {   
                                Cost+=NEGATIVE_LOG_OF_DICTIONARY;

                            }
                    }//end big if

                    }//end l for loop
            }//end inner inner for

            return Cost;
        }

这就是他们所说的他们如何计算两个数据点（句子）之间的相似性：句子 i 与句子 k 的相似度设置为使用句子 i 中的单词对句子 i 中的每个单词进行编码的信息论成本（S5）句子 k 和手稿中所有单词的字典。对于句子 i 中的每个单词，如果该词与句子 k 中的一个词匹配，该词的编码成本被设置为 neg- 句子 k 中单词数的取数对数（编码的索引的成本） 5 匹配的单词），否则设置为单词数的负对数在手稿词典中（对手稿中单词的索引进行编码的成本字典）。如果其中一个词是子串，则认为该词与另一个词匹配其他的。

我还写了可用性和责任函数。

可用性： public static double updateAvalibility(int datapoint, int Candidate,double[][] a, double[][] r,double aOld){ 双重可用性； //ArrayListtemp=new ArrayList(); 双倍总计=0；

            //*For self availibility
            if(datapoint==candidate){

                    for(int j=0; j<tweets.size(); j++){

                        if(j==datapoint)
                            continue;                           
                        else if(r[j][candidate]<0)//skip negative terms
                            continue;
                        else
                            total+=(r[j][candidate]);//sum up r of rows

                    }//end for
                availibity=total;//The total becjomes the A
                System.out.println("Availibility :"+availibity);
            }//end if
            else{//else
                for(int j=0; j<tweets.size(); j++){

                    if(j==candidate||j==datapoint)
                        continue;
                    else if(r[j][candidate]<0)//skip negative terms
                        continue;
                    else
                        total+=r[j][candidate];//else sum all R of all rows

                }//end for

                availibity=(r[candidate][candidate]+total);//A is set to self R + the sum

                if(availibity<0)//if not positive ignore
                    availibity=0;
                }//end else

            return (1-LAM)*availibity+(LAM*aOld);//Return with Adjustment factor
        }

责任：

//updates responsibility. Takes the two competeing datapoints, s, r, and a
        //returns the responsibility of i to k
        public static double updateResponsibility(int datapoint, int candidate, double[][] s, double[][] a,double rOld){

            double responsibility;

            //A temporary array 
            ArrayList<Double>temp=new ArrayList<Double>();
            double max;//The max of the a(i,k')+r(i,k')

            //################################
            //SETTING THE SELF RESPONSIBILITY
            if(datapoint==candidate){

                for(int k=0;k<tweets.size(); k++){

                    if(k==candidate)
                        continue;
                    else
                    temp.add(s[datapoint][k]);//store all the similarites b/w this point
                    //others
                }
                max=Collections.max(temp);//The max of the similarity

                responsibility=(similarity[datapoint][candidate])-max;
                System.out.println("s:"+similarity[datapoint][candidate]+"- m:"+max+"= responsibility: "+responsibility);
            }   
            else{
                    for(int j=0; j<tweets.size();j++){  
                            //store the A + S
                        if(j==candidate)
                            continue;
                        else
                            temp.add(a[datapoint][j]+s[datapoint][j]);// a(i,k')+r(i,k') Max will be calculated later   

                    }//end inner for

                //Max of the a+r of other k's. 
                max=Collections.max(temp);//Then get the max

                responsibility=s[datapoint][candidate]-max;//then the similarity - the max
            }//end else
            return ((1-LAM)*responsibility)-(LAM*rOld);//Dampen responsibility and return
        }//end method

为什么即使我使用论文中列出的调整因子，我也会得到糟糕的样本？我做错了什么？

任何帮助将不胜感激。

【问题讨论】：

你找到问题了吗？
是的，我做到了。该算法是错误的，我没有从论文中正确解释它。
很好，你能分享一下工作版本吗？

标签： java algorithm machine-learning cluster-computing

【解决方案1】：

我找不到您的问题，但您可以改用以下库：

Official site、GitHub site 和Community site。

它包含一个很好的 Affinity Propagation 算法。

【讨论】：