【问题标题】：How to compute skipgrams in python?如何在python中计算skipgrams？
【发布时间】：2015-08-06 05:44:34
【问题描述】：

A k skipgram 是一个 ngram，它是所有 ngram 的超集，每个 (k-i )skipgram 直到 (k-i)==0（包括 0 个 skipgram）。那么如何在python中高效地计算这些skipgrams呢？

以下是我尝试过的代码，但没有按预期运行：

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>

上面的代码没有正确渲染，但是 print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1) 给出了以下输出

[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]

这里列出的代码也没有给出正确的输出： https://github.com/heaven00/skipgram/blob/master/skipgram.py

print skipgram_ndarray("你叫什么名字") 给出： ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']

name 是一个 unigram！

【问题讨论】：

你尝试过什么？
@msw 更新了问题！！

标签： python nlp n-gram language-model

【解决方案1】：

来自 OP 链接的paper，以下字符串：

叛乱分子在持续战斗中丧生

产量：

2-skip-bi-grams = {叛乱分子被杀，叛乱分子进入，叛乱分子正在进行的, 被杀的, 被杀的正在进行的, 被杀的战斗, 正在进行的, 在战斗，持续战斗}

2-skip-tri-grams = {叛乱分子被杀，叛乱分子被杀正在进行，叛乱分子打死战斗，叛乱分子在进行中，叛乱分子在战斗, 叛乱分子持续战斗, 被杀持续, 被杀战斗，杀死正在进行的战斗，正在进行的战斗}。

对 NLTK 的 ngrams 代码 (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383) 稍作修改：

from itertools import chain, combinations
import copy
from nltk.util import ngrams

def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    return sequence

def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
    sequence_length = len(sequence)
    sequence = iter(sequence)
    sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)

    if sequence_length + pad_left + pad_right < k:
        raise Exception("The length of sentence + padding(s) < skip")

    if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

    history = []
    nk = n+k

    # Return point for recursion.
    if nk < 1: 
        return
    # If n+k longer than sequence, reduce k by 1 and recur
    elif nk > sequence_length: 
        for ng in skipgrams(list(sequence), n, k-1):
            yield ng

    while nk > 1: # Collects the first instance of n+k length history
        history.append(next(sequence))
        nk -= 1

    # Iterative drop first item in history and picks up the next
    # while yielding skipgrams for each iteration.
    for item in sequence:
        history.append(item)
        current_token = history.pop(0)      
        # Iterates through the rest of the history and 
        # pick out all combinations the n-1grams
        for idx in list(combinations(range(len(history)), n-1)):
            ng = [current_token]
            for _id in idx:
                ng.append(history[_id])
            yield tuple(ng)

    # Recursively yield the skigrams for the rest of seqeunce where
    # len(sequence) < n+k
    for ng in list(skipgrams(history, n, k-1)):
        yield ng

让我们做一些 doctest 来匹配论文中的例子：

>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

但请注意，如果n+k > len(sequence)，它将产生与skipgrams(sequence, n, k-1) 相同的效果（这不是错误，这是一个故障安全功能），例如

>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams  = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams  = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams 
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams 
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]

这允许n == k，但不允许n > k，如以下行所示：

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")

为了理解起见，让我们试着理解“神秘”的那一行：

for idx in list(combinations(range(len(history)), n-1)):
    pass # Do something

给定一个独特项目的列表，组合产生这个：

>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

而且由于标记列表的索引始终是唯一的，例如

>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]

可以计算出可能的combinations (without replacement) 范围：

>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

如果我们将索引映射回标记列表：

>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]

然后我们与current_token 连接，我们得到当前标记和上下文+跳过窗口的skipgrams：

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]

然后我们继续下一个词。

【讨论】：

干得好，但我希望它应该在长度超过时返回句子本身
你能回答这个问题吗：stackoverflow.com/questions/31827756/…
@stackit 这是一个完全不同的 NLP 任务，但我会在有空的时候尝试 =)
关于elif nk > sequence_length: for ng in skipgrams(list(sequence), n, k-1): yield ng;，它与正常生成ngram的方式基本相同。我会保持原样，而不是返回单个字符串列表。
感谢您查看该问题，令人惊讶的是，如此常见的问题尚未解决..

【解决方案2】：

已编辑

最新的 NLTK 版本 3.2.5 已实现 skipgrams。

这是来自 NLTK 存储库的@jnothman 的更简洁的实现：https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    Returns all possible skipgrams generated from a sequence of items, as an iterator.
    Skipgrams are ngrams that allows tokens to be skipped.
    Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param k: the skip distance
    :type  k: int
    :rtype: iter(tuple)
    """

    # Pads the sequence as desired by **kwargs.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    # Note when iterating through the ngrams, the pad_right here is not
    # the **kwargs padding, it's for the algorithm to detect the SENTINEL
    # object on the right pad to stop inner loop.
    SENTINEL = object()
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

[出]：

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

【讨论】：

酷，链接在哪里？

【解决方案3】：

虽然这将完全从您的代码中分离出来并将其推迟到外部库；您可以使用 Colibri Core (https://proycon.github.io/colibri-core) 进行跳过图提取。这是一个专门为从大文本语料库中高效提取 n-gram 和 skipgram 而编写的库。代码库是 C++（为了速度/效率），但可以使用 Python 绑定。

您正确地提到了效率，因为skipgram 提取很快显示出指数复杂性，如果您像在input_list 中那样只传递一个句子，这可能不是一个大问题，但如果您在大型语料库数据上发布它就会出现问题。为了缓解这种情况，您可以设置诸如出现阈值之类的参数，或者要求 skipgram 的每个跳过至少可以由 x 个不同的 n-gram 填充。

import colibricore

#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)

#Instantiate an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)

#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)

#Output all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

网站上有关于这一切的更广泛的 Python 教程。

免责声明：我是 Colibri Core 的作者

【讨论】：

是的，我在写这个问题之前尝试过，但无法在 ubuntu 上安装 colibri
我上周改进了安装程序和说明，希望现在安装起来少一些麻烦。
@proycon，是否可以在 colibri 的 python 包装器中创建鸭子类型，使得界面看起来像NLTK，例如colibri.ngrams(text, n=3) 还是 colibri.skipgram(text, n=3, k=2)？或者在 NLTK 存储库中重新实现 colibri 包装器的一些位是否更容易？
@alvas 我担心额外的开销会带来巨大的性能成本，并可能导致代码效率低下。将 Python 字符串编码和解码为 colibri 的内部压缩表示应尽可能早或晚完成。仅当text 是一个真正的大文本块时才有益（在这种情况下，最好让 colibri 直接从文件中读取它，因为这样会更快）。至于在 NLTK 中实现包装器，我不确定他们是否希望依赖外部 C++ 库？
@proycon，感谢您的留言！可能在 python 之外调用colibri 然后在 python 中读取输出文件的 NLTK 包装器会更快（就像他们对 Stanford/MaltParser 所做的那样）。开销可能是读/写文本文件，但这应该不是什么大问题。再次感谢！

【解决方案4】：

有关完整信息，请参阅 this。

下面的示例已经在其中提到了它的用法和魅力！

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]

【讨论】：

skipgram 函数是由于 nltk 中的这个老问题在他们的论坛中提出请求后创建的

【解决方案5】：

如何使用别人的实现https://github.com/heaven00/skipgram/blob/master/skipgram.py，其中k = skip_size 和n=ngram_order：

def skipgram_ndarray(sent, k=1, n=2):
    """
    This is not exactly a vectorized version, because we are still
    using a for loop
    """
    tokens = sent.split()
    if len(tokens) < k + 2:
        raise Exception("REQ: length of sentence > skip + 2")
    matrix = np.zeros((len(tokens), k + 2), dtype=object)
    matrix[:, 0] = tokens
    matrix[:, 1] = tokens[1:] + ['']
    result = []
    for skip in range(1, k + 1):
        matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
    for index in range(1, k + 2):
        temp = matrix[:, 0] + ',' + matrix[:, index]
        map(result.append, temp.tolist())
    limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
    return result[:limit]

def skipgram_list(sent, k=1, n=2):
    """
    Form skipgram features using list comprehensions
    """
    tokens = sent.split()
    tokens_n = ['''tokens[index + j + {0}]'''.format(index)
                for index in range(n - 1)]
    x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
    query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
    query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
    exec(query_part1 + query_part2)
    return result

【讨论】：

不，它不起作用，打印 skipgram_ndarray("What is your name") 给出：['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name'] name 是 unigram 而其他函数就更不对了
这个实现是为k<3 硬编码的。它确实有效，只是没有缩放（而且还有很多黑客......exec(...) 很有趣）。