将句子中的换行符映射到另一个列表答案

【问题标题】：Map line breaks in sentence to another list将句子中的换行符映射到另一个列表
【发布时间】：2014-11-25 02:40:41
【问题描述】：

在一个文件中，我有这样的带有随机换行符的文本：

Spencer J. Volk, president and CEO of this company, was elected a director. 
Mr. Volk, 55 years old, succeeds Duncan Dwight, 
who retired in September.

我正在使用 nltk 的句子标记器来查找句子，然后使用词性标签标记这些句子中的单词。例如，标记后，我得到这样的输出（单词列表，句子中每个单词的标记元组）：

[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')]

[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')]

但现在我想将标签写入另一个文件中使用与我读取文本的原始文件中相同的换行符。对于上面的例子，它会是这样的：

NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN
NNP NNP CD NNS JJ VBZ NNP NNP
WP VBD IN NNP

我可以在这个表单中获取标签和所有内容，但是如何将原始换行符与标签列表中的换行符关联起来？

一种方法是拆分每个句子，找到\n 的索引，希望每个拆分对应于句子中的一个单词（这可能并不总是正确），然后在该索引处破坏标签列表.这更像是一种 hack，在许多情况下都失败了。实现这一目标的更稳健的方法是什么？

【问题讨论】：

为什么要删除标点符号？它们非常有用。
@alvas 我没有。我使用的标记器确实如此。

标签： python nltk

【解决方案1】：

有趣的谜题。首先，请注意nltk.sent_tokenize() 将保留在句子中的换行符：

sents = nltk.sent_tokenize(text)
for s in sents:
    print(repr(s))

因此，要将 POS 标记与换行符交错，您可以一次遍历一个标记并检查它们之间的换行符：

def process_sent(sent):
    tagged = nltk.pos_tag(nltk.word_tokenize(sent))

    for word, tag in tagged:
        pre, _, post = sent.partition(word)
        if "\n" in pre:
            print("\n", end="")
        print(tag, end=" ")
        sent = post # advance to the next word
    if "\n" in post:
        print("\n", end="")

我不太确定为什么，但nltk.sent_tokenize() 会丢弃出现在句子边界之间的换行符。所以我们也需要寻找它们。幸运的是，我们可以使用完全相同的算法：一次一个句子地遍历全文，并检查它们之间的换行符。

sents = nltk.sent_tokenize(text)
for s in sents:
    pre, _, post = text.partition(s)
    if "\n" in pre:
        print("\n", end="")
    process_sent(s)
    text = post  # Advance to the next sentence -- munges `text` so use another var if it matters.

if "\n" in post:
    print("\n", end="")

PS。应该这样做，除了你只会在输出中获得 one 换行符，只要有几个相邻的换行符。如果您关心这一点，请将if "\n" in pre: print("\n", end="") 替换为对此的调用：

def nlretain(txt):
    """Output as many newlines as there are in `txt`"""
     print("\n"*txt.count("\n"), end="")

【讨论】：

【解决方案2】：

忽略换行符并使用sent_tokenize:

>>> from nltk import word_tokenize, pos_tag, sent_tokenize
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> text = " ".join(i for i in text.split('\n'))
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP .

注意换行：

>>> from nltk import word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """
>>> 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')]
>>> for sent in tagged_text:
...     poses = " ".join(pos for word, pos in sent)
...     print poses
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NNP NNP , CD NNS JJ , NNS NNP NNP ,
WP VBN IN NNP .

您意识到分词器在正确的句子中没有任何作用。是词性标注器使用的上下文信息比单词的默认标签弱，所以你是否使用sent_tokenize然后再次拆分非句子并不重要。

如果你想sent_tokenize然后将标签拆分为\n作为原始te

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """

>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')]
>>> sent_lens
[16, 11, 5]
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)]
>>> for l in sent_lens:
...     sum = 0
...     for pos in list(chain(*tagged_text))[sum:sum+l]:
...             print pos,
...             sum = sum+l
...     print
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN .
NN NNP NNP , NN CC NNP IN DT NN ,
NN NNP NNP , NN

【讨论】：

好吧，在这种情况下，标签并没有什么不同。但这不是问题。我需要在整个句子上运行标记器。拆分文本并在每一行上运行标记器是微不足道的。真正的问题是，如果您像在第一个示例中那样使用sent_tokenize 运行标记器，您将如何获得第二个结果。