如何查找字符串中单词的计数？答案

【问题标题】：How to find the count of a word in a string?如何查找字符串中单词的计数？
【发布时间】：2012-07-03 06:19:21
【问题描述】：

我有一个字符串“Hello I am going to I with hello am”。我想找出一个单词在字符串中出现了多少次。示例 hello 出现 2 次。我尝试了这种只打印字符的方法 -

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

我想学习如何计算字数。

【问题讨论】：

Hello 和 hello 一样吗？
根据您的用例，您可能还需要考虑一件事：某些单词的含义会根据其大小写而改变，例如Polish 和polish。可能这对您来说无关紧要，但值得记住。
您能否为我们定义更多数据集，您是否会担心I'll、don't 等中的标点符号......其中一些在下面的 cmets 中提出。以及大小写的区别？

标签： python

【解决方案1】：

如果要查找单个单词的计数，只需使用count：

input_string.count("Hello")

使用collections.Counter 和split() 统计所有单词：

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

【讨论】：

集合模块是基本 python 安装的一部分吗？
我正在复制@DSM 留给我的部分评论，因为我也使用str.count() 作为我的初始解决方案 - 这是一个问题，因为"am ham".count("am") 将产生 2 而不是 1跨度>
@Varun：我相信collections 在 Python 2.4 及更高版本中。
@Levon：你说得对。我相信使用 Counter 和正则表达式单词收集器可能是最好的选择。将相应地编辑答案。
嗯 .. 归功于 @DSM，他首先让我意识到了这一点（因为我也使用了 str.count()）

【解决方案2】：

Counter from collections是你的朋友：

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

【讨论】：

【解决方案3】：

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

使用re.findall 比split 更通用，因为否则您无法考虑诸如“don't”和“I'll”等缩写词。

演示（使用您的示例）：

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

如果您希望进行许多这样的查询，这只会做一次 O(N) 的工作，而不是 O(N*#queries) 的工作。

【讨论】：

+1 表示重新。 split 解决方案不适用于包含标点符号的短语。
这对我来说是最好的答案+1

【解决方案4】：

单词出现次数的向量称为bag-of-words。

Scikit-learn 提供了一个很好的模块来计算它，sklearn.feature_extraction.text.CountVectorizer。示例：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

输出：

2 am
1 going
2 hello
1 to
1 with

部分代码取自Kaggle tutorial on bag-of-words。

仅供参考：How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

【讨论】：

【解决方案5】：

将Hello 和hello 视为同一个词，无论大小写如何：

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

【讨论】：

我会选择Counter(strs.lower().split())。减少一些开销以加快运行时间
这不只是 Martijn Pieters 的解决方案吗？
@DSM 我不知何故没有看到他的解决方案，将我的解决方案更新回原始版本。 :)

【解决方案6】：

这是另一种不区分大小写的方法

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

通过将字符串和目标转换为小写来匹配。

ps：处理str.count() 的"am ham".count("am") == 2 问题，下面@DSM 也指出了:)

【讨论】：

单独使用 count 可能会导致意外结果："am ham".count("am") == 2。
@DSM .. 好点 .. 我对这个解决方案不满意，因为它区分大小写，现在正在寻找替代方案......

【解决方案7】：

可以将字符串划分为元素并计算它们的号码

count = len(my_string.split())

【讨论】：

代码答案被认为是低质量：确保提供一个解释您的代码以及如何解决问题的解释。如果您可以在帖子中添加更多信息，它将有助于提问者和未来读者。另请参阅解释基于代码的答案：meta.stackexchange.com/questions/114762/… span>

【解决方案8】：

您可以使用 Python 正则表达式库 re 查找子字符串中的所有匹配项并返回数组。

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

打印：

【讨论】：

【解决方案9】：

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result

【讨论】：

您好，欢迎来到 SO。您的答案仅包含代码。如果您还可以添加一些评论来解释它的作用和方式，那就更好了。你能请edit你的答案并添加吗？谢谢！