【问题标题】:Extract keyphrases from text (1-4 word ngrams)从文本中提取关键短语(1-4 个单词 ngram)
【发布时间】:2021-11-13 17:56:02
【问题描述】:

从一段文本中提取关键短语的最佳方法是什么?我正在编写一个工具来提取关键字:something like this。我找到了一些 Python 和 Perl 库来提取 n-gram,但我是在 Node 中编写的,所以我需要一个 JavaScript 解决方案。如果没有任何现有的 JavaScript 库,有人可以解释如何执行此操作,以便我自己编写吗?

【问题讨论】:

    标签: javascript keyword n-gram


    【解决方案1】:

    我喜欢这个想法,所以我实现了它:见下文(包括描述性 cmets)。
    预览地址:https://jsfiddle.net/WsKMx

    /*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http://stackoverflow.com/q/7085454/938089)
     * Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
     * This script will calculate words. For the simplicity and efficiency,
     * there's only one loop through a block of text.
     * A 100% accuracy requires much more computing power, which is usually unnecessary
     **/
    
    
    var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";
    
    var atLeast = 2;       // Show results with at least .. occurrences
    var numWords = 5;      // Show statistics for one to .. words
    var ignoreCase = true; // Case-sensitivity
    var REallowedChars = /[^a-zA-Z'\-]+/g;
     // RE pattern to select valid characters. Invalid characters are replaced with a whitespace
    
    var i, j, k, textlen, len, s;
    // Prepare key hash
    var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
    var results = [];
    numWords++; //for human logic, we start counting at 1 instead of 0
    for (i=1; i<=numWords; i++) {
        keys.push({});
    }
    
    // Remove all irrelevant characters
    text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");
    
    // Create a hash
    if (ignoreCase) text = text.toLowerCase();
    text = text.split(/\s+/);
    for (i=0, textlen=text.length; i<textlen; i++) {
        s = text[i];
        keys[1][s] = (keys[1][s] || 0) + 1;
        for (j=2; j<=numWords; j++) {
            if(i+j <= textlen) {
                s += " " + text[i+j-1];
                keys[j][s] = (keys[j][s] || 0) + 1;
            } else break;
        }
    }
    
    // Prepares results for advanced analysis
    for (var k=1; k<=numWords; k++) {
        results[k] = [];
        var key = keys[k];
        for (var i in key) {
            if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
        }
    }
    
    // Result parsing
    var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`
    
    var f_sortAscending = function(x,y) {return y.count - x.count;};
    for (k=1; k<numWords; k++) {
        results[k].sort(f_sortAscending);//sorts results
        
        // Customize your output. For example:
        var words = results[k];
        if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
        for (i=0,len=words.length; i<len; i++) {
            
            //Characters have been validated. No fear for XSS
            outputHTML.push("<td>" + words[i].word + "</td><td>" +
               words[i].count + "</td><td>" +
               Math.round(words[i].count/textlen*10000)/100 + "%</td>");
               // textlen defined at the top
               // The relative occurence has a precision of 2 digits.
        }
    }
    outputHTML = '<table id="wordAnalysis"><thead><tr>' +
                  '<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
                  '</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
                   "</tr></tbody></table>";
    document.getElementById("RobW-sample").innerHTML = outputHTML;
    /*
    CSS:
    #wordAnalysis td{padding:1px 3px 1px 5px}
    .num-words-header{font-weight:bold;border-top:1px solid #000}
    
    HTML:
    <div id="#RobW-sample"></div>
    */
    

    【讨论】:

    • 我已更新代码以修复 IE8 中的错误。此错误是通过邮件报告的,我已在此处粘贴邮件和我的回复(提供修复并包含详细说明):pastebin.com/7Edx88Gp
    • 漂亮,几年后你还在帮助别人
    • 最好排除所谓的停用词,例如:the、a、they、is 等。
    【解决方案2】:

    我不知道 JavaScript 中有这样的库,但逻辑是

    1. 将文本拆分为数组
    2. 然后排序和计数

    或者

    1. 拆分成数组
    2. 创建辅助数组
    3. 遍历第一个数组的每一项
    4. 检查当前项是否存在于二级数组中
    5. 如果不存在 把它作为物品的钥匙来推
    6. 其他 增加具有键 = 到所寻求项目的值。 HTH

    伊沃·斯托伊科夫

    【讨论】:

    • 这不符合我想要的 b/c 它不提取多词 ngram... 它仅适用于单个词
    • 看这里 -> valuetype.wordpress.com/2011/08/24/… 这是一个只有一个字数但可以轻松扩展为 3 或 4 个字的示例
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-09
    • 1970-01-01
    • 2010-11-26
    相关资源
    最近更新 更多