使用 Python 和正则表达式计算文本中的标点符号答案

【问题标题】：Counting punctuation in text using Python and regex使用 Python 和正则表达式计算文本中的标点符号
【发布时间】：2013-04-23 21:47:24
【问题描述】：

我正在尝试计算标点符号在小说中出现的次数。例如，我想找到问号和句点以及所有其他非字母数字字符的出现。然后我想将它们插入到 csv 文件中。我不确定如何执行正则表达式，因为我对 python 没有太多经验。有人可以帮我吗？

texts=string.punctuation
counts=dict(Counter(w.lower() for w in re.findall(r"\w+", open(cwd+"/"+book).read())))
writer = csv.writer(open("author.csv", 'a'))
writer.writerow([counts.get(fieldname,0) for fieldname in texts])

【问题讨论】：

不要使用正则表达式进行频率计数。只需逐个字符循环并过滤掉字母、数字和空格，然后将其余部分放入字典中进行频率计数。或者另一种方法是替换所有字母、数字和空格，然后循环遍历剩余的字符串（这样更干净）。
你根本不需要正则表达式，只需在遍历小说时检查字符是否为instring module's punctuation string

标签： python regex text-mining

【解决方案1】：

import re
def count_puncts(x):
  # sub. punct. with '' and returns the new string with the no. of replacements.
  new_str, count = re.subn(r'\W', '', x)
  return count

【讨论】：

【解决方案2】：

使用诅咒：

import curses.ascii
str1 = "real, and? or, and? what."
t = (c for c in str1 if curses.ascii.ispunct(c))
d = dict()
for p in t:
    d[p] = 1 if not p in d else d[p] + 1 for p in t

【讨论】：

不需要for 循环；只需使用d = Counter(t)。此外，您可以使用 map 而不是生成器表达式，尽管这可能不是那么明显。
尽量避免使用str作为变量名，因为您可能需要稍后在程序中使用str(1)，而现在您不能

【解决方案3】：

from string import punctuation
from collections import Counter

with open('novel.txt') as f: # closes the file for you which is important!
    c = Counter(c for line in f for c in line if c in punctuation)

这也避免了一次将整本小说加载到内存中。

顺便说一句，这就是string.punctuation 的样子：

>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

您可能希望根据需要在此处添加或减少符号。

同样Counter 定义了一个__missing__ 和return 0。因此，不要将其初始化为字典，然后调用.get(x, 0)。把它作为一个计数器，像c[x]一样访问它，如果它不存在，它的计数是0。我不知道为什么每个人都会突然想把他们所有的Counters降级为dicts只是因为您在打印时看到的 Counter([...]) 看起来很吓人，而实际上 Counters 也是字典，值得尊重。

writer.writerow([counts.get(c, 0) for c in punctuation])

如果您离开柜台，您可以这样做：

writer.writerow([counts[c] for c in punctuation])

这样就容易多了。

【讨论】：

【解决方案4】：

In [1]: from string import punctuation

In [2]: from collections import Counter

In [3]: counts = Counter(open('novel.txt').read())

In [4]: punctuation_counts = {k:v for k, v in counts.iteritems() if k in punctuation}

【讨论】：

我唯一真正遇到的问题是你一次将整本小说加载到内存中！！！ open('novel.txt').read()我可以想象任何平均大小的小说都会使这成为一个内存密集型操作。
@jamylak，entire King James bible 只有几兆字节。（解压后为 4.4MB）。

【解决方案5】：

如果您计算字数，您拥有的代码非常接近您需要的代码。如果您要计算字数，您唯一需要做的修改可能是将最后一行更改为：

writer.writerows(counts.items())

很遗憾，您并没有在这里计算字数。如果您正在寻找单个字符的计数，我会避免使用正则表达式并直接访问count。您的代码可能如下所示：

book_text = open(cwd+"/"+book).read()
counts = {}
for character in texts:
    counts[character] = book_text.count(character)
writer.writerows(counts.items())

您可能会说，这会生成一个字典，其中字符作为键，字符在文本中出现的次数作为值。然后我们像计算单词一样编写它。

【讨论】：