所有单词中出现频率最高的 n-gram答案

【问题标题】：The n-gram that is the most frequent one among all the words所有单词中出现频率最高的 n-gram
【发布时间】：2021-03-07 00:03:32
【问题描述】：

我遇到了以下编程面试问题：

挑战 1：N-gram

N-gram 是来自给定单词的 N 个连续字符的序列。对于“pilot”这个词有三个 3-gram：“pil”、“ilo”和“lot”。对于给定的一组单词和一个 n-gram 长度你的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

请注意，您的函数将接收以下参数：

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

数据约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

效率限制

• your function is expected to print the result in less than 2 seconds

示例输入文本：“aaaab a0a baaab c”

输出 aaa ngramLength: 3

解释

对于上面显示的输入，按频率排序的 3-gram 是：

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

如果我只有一小时的时间来解决这个问题并且我选择使用 C 语言来解决它：实现一个哈希表来计算 N-gram 在这段时间内出现的频率是一个好主意吗？因为在 C 库中没有哈希表的实现...

如果是，我正在考虑使用带有有序链表的单独链接来实现哈希表。这些实现减少了您必须解决问题的时间......

这是最快的选择吗？

谢谢！！！

【问题讨论】：

这是真正的编码面试吗？
您确定二叉树（例如 AVL）无法完成这项工作吗？
你会被要求最多 3 克吗？有 (26+26+10)^3 = 238328 个可能只有字母数字字符的 3-gram，因此直接 LUT 看起来是可行的。
我会提前在一个数组中分配所需数量的桶（这是可能的，因为你有文本长度的上限），并且只将指向它们的指针存储在哈希中桌子。使用移动到前面/在后面的启发式插入以使哈希表检索更快。并在最后对数组进行排序。在实践中使用树会比较慢。
想一想。在 1000 个字符的文本中，有多少个 3-gram？

标签： c algorithm n-gram

【解决方案1】：

如果实现效率很重要并且您使用的是 C，我将初始化一个指向字符串中 n-gram 开头的指针数组，使用 qsort 根据它们所在的 n-gram 对指针进行排序的一部分，然后遍历该排序数组并计算计数。

这应该足够快，并且不需要编写任何花哨的数据结构。

【讨论】：

我唯一能想到的可能比这更快的是使用偏移量而不是实际指针。假设 64 位（本机）指针，您可以使用 4 字节偏移量将内存减半。一个聪明的编码可能会将所需的 18 位压缩成 2 个字节以获得更多..
@phs 排序为O(n log(n))，而基于哈希的解决方案为O(n)。所以这不应该有最好的性能。这只是一个非常简单的方法。
在小尺度上，渐近极限比硬件细节更重要。我怀疑缓存线效率和所选散列函数的细节会很重要。同样，只是猜测。
@phs 我看不出为什么跳转到哈希中的随机存储桶比比较来自区域的两个随机指针更有效地使用缓存。所以散列总是一个胜利。但是，如果您创建一个 ngram 数组然后对其进行排序（而不是使用指针），您可能会击败散列。
很好的答案。排序后，您可以在一次从较大的数组索引到较小的遍历中使用 O(1) 空间。

【解决方案2】：

很抱歉发布 python，但这是我会做的：你可能会得到一些关于算法的想法。请注意，这个程序解决了一个数量级的单词。

from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**

【讨论】：

在我的测试中，维护 ngram/count 字典的速度是您的解决方案的两倍。而且是更简单的代码。
我希望两者都有，这就是为什么我提到它与 cmets 内联。我在我的帖子中添加了 C++ 中的答案。我认为这是一个有趣的小问题。

【解决方案3】：

所以这个问题的基本方法是：

查找字符串中的所有 n-gram
将所有重复条目映射到具有 n-gram 及其出现次数的新结构中

你可以在这里找到我的 c++ 解决方案：http://ideone.com/MNFSis

给定：

const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };

然后，第一步 - 这是代码：

const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) { 
    //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
    if (noSpace(ptr)) {
        //safely copy our current n-gram over to the ngrams array
        //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
        //are valid letters
        for (int i=0; i<NGRAM; i++) {
            ngrams[idx][i] = ptr[i];
        }
        ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
        idx++;
    }
    ptr++;
}

此时，我们有一个包含所有 n-gram 的列表。让我们找出最受欢迎的：

FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list

for (int i=0; i<NGRAMS; i++) {
    if (ngrams[i][0] == '\0') break;
    //insertFreqNode takes a start node, this where we will start to search for duplicates
    //the simplest description is like this:
    //  1 we search from head down each child, if we find a node that has text equal to
    //    ngrams[i] then we update it's frequency count
    //  2 if the freq is >= to the current winner we place this as head.next
    //  3 after program is complete, our most popular nodes will be the first nodes
    //    I have not implemented sorting of these - it's an exercise for the reader ;)
    insertFreqNode(&head, ngrams[i]);
}

//as the list is ordered, head.next will always be the most popular n-gram
cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl

祝你好运！

【讨论】：

【解决方案4】：

只是为了好玩，我写了一个SQL版本（SQL Server 2012）：

if object_id('dbo.MaxNgram','IF') is not null
    drop function dbo.MaxNgram;
go

create function dbo.MaxNgram(
     @text      varchar(max)
    ,@length    int
) returns table with schemabinding as
return
    with 
    Delimiter(c) as ( select ' '),
    E1(N) as (
        select 1 from (values 
            (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
        )T(N)
    ),
    E2(N) as (
        select 1 from E1 a cross join E1 b
    ),
    E6(N) as (
        select 1 from E2 a cross join E2 b cross join E2 c
    ),
    tally(N) as (
        select top(isnull(datalength(@text),0))
             ROW_NUMBER() over (order by (select NULL))
        from E6
    ),
    cteStart(N1) as (
        select 1 union all
        select t.N+1 from tally t cross join delimiter 
            where substring(@text,t.N,1) = delimiter.c
    ),
    cteLen(N1,L1) as (
        select s.N1,
               isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000)
        from cteStart s
        cross join delimiter
    ),
    cteWords as (
        select ItemNumber = row_number() over (order by l.N1),
               Item       = substring(@text, l.N1, l.L1)
        from cteLen l
    ),
    mask(N) as ( 
        select top(@length) row_Number() over (order by (select NULL))
        from E6
    ),
    topItem as (
        select top 1
             substring(Item,m.N,@length) as Ngram
            ,count(*)                    as Length
        from cteWords   w
        cross join mask m
        where m.N     <= datalength(w.Item) + 1 - @length
          and @length <= datalength(w.Item) 
        group by 
            substring(Item,m.N,@length)
        order by 2 desc, 1 
    )
    select d.s
    from (
        select top 1 NGram,Length
        from topItem
    ) t
    cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s)
;
go

当使用 OP 提供的示例输入调用时

set nocount on;
select s as [ ] from MaxNgram(
    'aaaab a0a baaab c aab'
   ,3
);
go

根据需要产出

------------------------------
aaa
3

【讨论】：

【解决方案5】：

如果您不使用 C，我在大约 10 分钟内编写了这个 Python 脚本，它处理 1.5Mb 文件，包含超过 265 000 个单词，在 中寻找 3-gram >0.4s（除了在屏幕上打印数值）
用于测试的文本是 詹姆斯乔伊斯的尤利西斯，你可以在这里免费找到它https://www.gutenberg.org/ebooks/4300

这里的单词分隔符都是space和回车\n

import sys

text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
    word_len = len(word)
    if word_len < ngram_len:
        continue
    for i in range(0, (word_len - ngram_len) + 1):
        ngram = word[i:i+ngram_len]
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
        if val not in ngrams_by_freq:
                ngrams_by_freq[val] = [key]
        else:
                ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
        print('{} with frequency of {}'.format(key[1:], key[0]))

【讨论】：

【解决方案6】：

您可以将 trigram 转换为 RADIX50 代码。见http://en.wikipedia.org/wiki/DEC_Radix-50

在 radix50 中，trigram 的输出值适合 16 位无符号整数值。

此后，您可以使用基数编码的三元组作为数组中的索引。

所以，你的代码应该是这样的：

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

此后，只需在数组中搜索最大值，并将索引解码为三元组。

大约 10 年前，我在实施用于模糊搜索的 Wilbur-Khovayko 算法时使用了这个技巧。

您可以在这里下载源代码：http://itman.narod.ru/source/jwilbur1.tar.gz。

【讨论】：

这是不区分大小写的 A-Z、0-9、SPACE、DOLLAR、DOT 和 UNDEF。足以计算文本字符串的三元组。
但是您现在没有预先设置ngramLength 参数，这在很大程度上取决于n=3 的事实。不过，三元组的巧妙解决方案
以 radix50 开头，for each c in counters.sorteddesc find most frequest starting -> counters2 //now the only thing see if something could be bigger among rest counters if ( max(counters2) > next in counters and bigger than local max found before) RESULT else store local max, and proceed to the next item

【解决方案7】：

你可以在 O(nk) 时间内解决这个问题，其中 n 是单词数，k 是 n 的平均数- 每个单词的克数。

您认为哈希表是解决问题的好方法是正确的。

但是，由于您编写解决方案的时间有限，我建议您使用open addressing 而不是链表。实现可能更简单：如果您遇到碰撞，您只需沿着列表走得更远。

此外，请务必为哈希表分配足够的内存：大约是预期 n-gram 数量的两倍应该没问题。由于 n-gram 的预期数量是

就编码速度而言，较小的输入长度 (250,000) 使得排序和计数成为一种可行的选择。最快的方法可能是生成一个指向每个 n-gram 的指针数组，使用适当的比较器对数组进行排序，然后沿着它跟踪哪个 n-gram 出现最多。

【讨论】：

【解决方案8】：

这个问题的一个简单的python解决方案

your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])

【讨论】：