在保留撇号的同时删除单引号 Python，NLTK答案

【问题标题】：Removing single quotation marks while preserving apostrophes Python, NLTK在保留撇号的同时删除单引号 Python，NLTK
【发布时间】：2014-03-12 12:01:15
【问题描述】：

我正在尝试创建一个诗歌语料库的频率列表。该代码读取 .txt 文件并使用数据创建一个 .csv。

我正在努力解决的部分是从文本中删除不相关的标点符号。我到目前为止的相关代码是：

import nltk

raw = open('file_name.txt', 'r').read()
output = open('output_filename.csv','w')
txt = raw.lower()

pattern = r'''(?x)([A_Z]\.)+|\w+(-\w+)*|\.\.\|[][.,;"'?():-_`]'''
tokenized = nltk.regexp_tokenize(txt,pattern)

这几乎是完美的，因为它保留了诸如 chimney-sweeper 之类的单词中的连字符，但它也将缩略词分成两个单独的单词，这不是我想要的。

例如，我的文本文件（试运行在威廉布莱克的纯真之歌上）有以下几行：

“播放一首关于羔羊的歌曲！”

我想成为什么样的人

管道 |一个 |歌曲 |关于 |一个 |羊肉

我之前使用的代码保留了缩写，但也给我留下了单引号：

for punct in string.punctuation:
    txt = txt.replace(punct,' ')
re.sub(r'\r+',' ',txt)

所以我会得到

'管道 |一个 |歌曲 |关于 |一个 |羊肉

我想在这两者之间找到一个中间立场，因为我需要保留撇号，例如 O'er 和连字符，但去掉其他所有内容。

我知道这个话题在这个论坛上似乎已经筋疲力尽了，但我在过去的四天里尝试了所有提供的示例，但无法让它们像宣传的那样工作，所以我没有把我的头发都扯掉我想我会尝试发布一个问题。

编辑：

似乎标准标记器无法处理我的文本的原因是一些撇号在奇怪的地方向右/向左倾斜。我使用一堆.replace() 指令产生了我想要的结果：

txt = txt.replace("\n", " ")
#formats the text so that the line break counts as a space
txt = txt.replace("”", " ")
#replaces stray quotation marks with a space
txt = txt.replace("“", " ")
#replaces stray quotation marks with a space
txt = txt.replace(" ’", " ")
#replaces a right leaning apostrophe with a space if it follows a space(which now includes line breaks)
txt = txt.replace(" ‘", " ")
#replaces a left leaning apostrophe with a space if it follows a space

我不怀疑有一种方法可以将所有这些合并到一行代码中，但我真的很高兴这一切都有效！

【问题讨论】：

标签： python python-2.7 nltk

【解决方案1】：

您可以在空格上split，然后在每个单词的开头和结尾使用strip 标点符号，而不是替换标点符号：

>>> import string
>>> phrase = "'This has punctuation, and it's hard to remove!'"
>>> [word.strip(string.punctuation) for word in phrase.split(" ")]
['This', 'has', 'punctuation', 'and', "it's", 'hard', 'to', 'remove']

这会在单词中保留撇号和连字符，同时删除单词开头或结尾的标点符号。

请注意，独立标点符号将替换为空字符串""：

>>> phrase = "This is - no doubt - punctuated"
>>> [word.strip(string.punctuation) for word in phrase.split(" ")]
['This', 'is', '', 'no', 'doubt', '', 'punctuated']

这很容易过滤掉，因为空字符串会评估False：

filtered = [f for f in txt if f and f.lower() not in stopwords]
                            # ^ excludes empty string

【讨论】：

您能否提供比“努力让它工作”更多的内容？错误（提供完整的追溯）？意外输出（提供示例输入以及预期和实际输出）？
抱歉，我未能在评论准备好之前将其格式化并提交。现在正在研究如何格式化它。
对，我现在的结果是 import string raw = open('file.txt', 'r').read() output = open('Output/result.csv','w') txt = raw.lower() [word.strip(string.punctuation) for word in txt.split(" ")] 而现在结果只是给了我一些随机字母和它们在文本中出现的频率。例如：e - 1635、t - 766 等
为了让自己更加尴尬，我无法正确格式化评论。我最诚挚的歉意。
@wim 与 its 和 it's 不同，twas 没什么好混淆的！虽然astronauts' 和astronauts 可能是个问题。