在 Python 中将一个单词及其前 10 个单词的上下文提取到一个数据框中答案

【问题标题】：Extracting a word and its prior 10 word context to a dataframe in Python在 Python 中将一个单词及其前 10 个单词的上下文提取到一个数据框中
【发布时间】：2015-06-02 03:15:33
【问题描述】：

我对 Python (2.7) 还很陌生，如果这是一个非常简单直接的问题，请原谅我。我希望 (i) 从已使用 NLTK 库标记的文本中提取所有以 -ing 结尾的单词，以及 (ii) 提取每个提取的单词之前的 10 个单词。然后我希望 (iii) 将这些保存为文件作为两列的数据框，可能看起来像：

Word        PreviousContext 
starting    stood a moment, as if in a troubled reverie; then
seeming     of it retraced our steps. But Elijah passed on, without
purchasing  a sharp look-out upon the hands: Bildad did all the

我知道如何做 (i)，但不确定如何去做 (ii)-(iii)。任何帮助将不胜感激和承认。到目前为止，我有：

>>> import bs4 
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
...     if w.endswith("ing"):
...             print(w)
... 
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..

【问题讨论】：

我刚刚添加了我到目前为止所做的工作以达到 (i)。 :)
提示：调查enumerate

标签： python extract

【解决方案1】：

代码行之后：

>>> tokens = word_tokenize(raw)

使用以下代码生成带有上下文的单词：

>>> context={}
>>> for i,w in enumerate(tokens):
...      if w.endswith("ing"):
...         try:
...            context[w]=tokens[i:i+10]  # this try...except is used to pass last 10 words whose context is less than 10 words.
...         except: pass
... 
>>> fp=open('dataframes','w')   # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
...    fp.write(word+'\t\t'+' '.join(context[word])+'\n')
... 
>>> fp.close()
>>> fp=open('dataframes','r')  
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
...    print line
... 
Word                PreviousContext
raining             raining , and I saw more fog and mud in
bidding             bidding him good night , if he were yet sitting
growling            growling old Scotch Croesus with great flaps of ears ?
bright-looking      bright-looking bride , I believe ( as I could not
hanging             hanging up in the shop&mdash ; went down to look
scheming            scheming and devising opportunities of being alone with her .
muffling            muffling her hands in it , in an unsettled and
bestowing           bestowing them on Mrs. Gummidge. She was with him all
adorning            adorning , the perfect simplicity of his manner , brought

需要注意的两点：

nltk 将标点符号视为单独的标记，因此标点符号被视为单独的单词。
我使用字典来存储带有上下文的单词，因此单词的顺序无关紧要，但可以保证所有带有上下文的单词都存在。

【讨论】：

【解决方案2】：

假设您将所有单词都放在单词列表中：

>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']

我会将它们放入一个系列并获取相关单词的索引：

words =  pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0     9
1    20
dtype: int64

现在idx 的值是我们原始Series 中以'ing' 结尾的单词的索引。接下来我们需要将这些值转换为范围：

starts = idx - 10
ends = idx

现在我们可以用这些范围索引原始系列（不过，首先，剪辑的下限为 0，以防'ing' 单词在列表中出现少于 10 个单词）：

starts = starts.clip(0)
df = pandas.DataFrame([{
    'word': words[e], 
    'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
                           Previous  word
0  abc def gdi asd ew d ew fdsa dsa  aing
1      e f dsa fe dfa e d fe asd fe  ting

不完全是一个班轮，但它有效。

注意'aing'对应的栏目只有9个字是因为在我做的假名单中出现得太早了。

【讨论】：

谢谢。我试过你的示例代码。在分配 'df' 对象之前一切都很好，我收到以下错误消息：>>> df = pandas.DataFrame([{ ... 'word': words[e], ... 'Previous': string.join(words[s:e])} for s,e in zip(starts,ends)]) Traceback（最近一次调用最后）：文件“”，第 3 行，在 NameError: name “字符串”未定义
你只需要一个import string 在某个地方，或者使用内置方法，比如' '.join(words[s:e])

【解决方案3】：

如果您询问如何以算法方式执行此操作，首先我将始终维护前 10 个单词的队列和一个数据帧，其中第一列是以“ing”结尾的单词，第二列是对应单词前面的 10 个单词（在第一列中）。

因此，在您的程序开始时，队列将为空，然后对于前 10 个单词，它会将每个单词排入队列。然后每次在循环中前进之前，将当前单词排入队列并出列一个单词（确保保持一个大小为 10 的队列）。

这样，在每次迭代时，您都会检查单词是否以“ing”结尾。如果是这样，请在数据框中添加一行，其中单词是第一项，第二项是队列的当前状态。

最后，您应该有一个数据框，其中第一列单词以“ing”结尾，其对应的第二列是它前面的 10 个单词。

【讨论】：