【问题标题】:Finding collocation patterns in Java在 Java 中寻找搭配模式
【发布时间】:2012-11-01 23:19:27
【问题描述】:

我正在从事一个需要使用搭配的项目。我创建了以下代码来提取它们。该代码接受一个字符串并返回该字符串中的搭配模式列表。我已经使用斯坦福 POS 进行标记。

我需要您对代码的建议,因为我处理大量文本,它似乎很慢。 任何改进代码的建议都将受到高度赞赏。

/**
*
*  A COLLOCATION is an expression consisting of two or more words that
*  correspond to some conventional way of saying things.
* 
*  I used the seventh Part-of-speech-tag patterns for collocation filtering that 
*  were suggested by Justeson and Katz(1995).
*  These patterns are:
* 
*  -----------------------------------------
*  |Tag |     Pattern Example              |
*  -----------------------------------------
*  |AN  | linear function                  |
*  |NN  | regression coefficients          |
*  |AAN | Gaussian random variable         |
*  |ANN | cumulative distribution function |
*  |NAN | mean squared error               |
*  |NNN | class probability function       |
*  |NPN | degrees of freedom               |                     
*  -----------------------------------------
*  Where A=adjective, P=preposition, & N=noun.
* 
*  Stanford POS have been used for the extraction process. 
*  see: http://nlp.stanford.edu/software/tagger.shtml#Download
* 
*  more on collocation:    http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
*  more on POS:            http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*  
*/

public class GetCollocations {
    public static ArrayList<String> GetCollocations(String text) throws IOException,                ClassNotFoundException{
       MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
       String[] tagged = tagger.tagString(text).split("\\s+");

       ArrayList<String> collocations = new ArrayList();
       for (int i = 0; i < tagged.length; i++) {

           String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
           if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") ||    pot.equals("NNPS")) {

               pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
               if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {

                collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));

                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("IN")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            }


        } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
            pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
            if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }
            }

        }

    }
    return collocations;

}
public static String GetWordWithoutTag(String wordWithTag){
    String wordWithoutTag = wordWithTag.substring(0,wordWithTag.indexOf("_"));
    return wordWithoutTag;
}

}

【问题讨论】:

    标签: java nlp stanford-nlp


    【解决方案1】:

    如果您每秒处理接近 15,000 个单词,那么您将使用 POS 标记器达到极限。据斯坦福Stanford POS tagger FAQ

    on a 2008 nothing-special Intel server, it tags about 15000 words per second
    

    您的算法的其余部分看起来不错,但如果您真的想从中榨取一些汁液,您可以预先分配一个 Array 作为静态类变量而不是 ArrayList。基本上牺牲了前期内存成本,不必在每次调用时实例化 ArrayList 或遭受添加元素的amortized O(n) cost

    也只是关于提高代码可读性的建议,您可以考虑使用一些私有方法来检查pot变量是什么词性,

    private static Boolean  _isNoun(String pot) {
        if(pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) return true;
        else return false;
    }
    
    private static Boolean _isAdjective(String pot){
        if(pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) return true;
        else return false;
    }
    

    另外,如果我没记错的话,你应该能够简化你正在做的事情,结合一些if 语句。这不会真正加快你的代码,但它会使它更好地工作。请仔细阅读,我只是试图简化你的逻辑来证明我的观点。请记住,以下代码未经测试:

    public static ArrayList<String> GetCollocations(String text) throws IOException,                ClassNotFoundException{
        MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
        String[] tagged = tagger.tagString(text).split("\\s+");
        ArrayList<String> collocations = new ArrayList();
    
        for (int i = 0; i < tagged.length; i++) {
            String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
    
            if (_isNoun(pot) || _isAdjective(pot)) {
                pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
    
                if (_isNoun(pot) || _isAdjective(pot)) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
                    pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
    
                    if (_isNoun(pot)) {
                        collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                    }
    
                } else if (pot.equals("IN")) {
                    pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
    
                    if (_isNoun(pot)) {
                        collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                    }
    
                }
            }
        }
        return collocations;
    
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-06-26
      • 1970-01-01
      • 2021-11-11
      • 2010-09-21
      • 1970-01-01
      • 2017-08-17
      • 2022-06-16
      • 2017-11-11
      相关资源
      最近更新 更多