计算 Pandas 数据框中的单个单词答案

【问题标题】：Count individual words in Pandas data frame计算 Pandas 数据框中的单个单词
【发布时间】：2015-10-20 16:16:56
【问题描述】：

我正在尝试计算我的数据框列中的单个单词。它看起来像这样。实际上，这些文本是推文。

text
this is some text that I want to count
That's all I wan't
It is unicode text

所以我从其他 stackoverflow 问题中发现，我可以使用以下内容：

Count most frequent 100 words from sentences in Dataframe Pandas

Count distinct words from a Pandas Data Frame

我的 df 被称为结果，这是我的代码：

from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2

我收到以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
      1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
      3 result2
TypeError: sequence item 25831: expected str instance, float found

文本的 dtype 是对象，据我了解，这对于 unicode 文本数据是正确的。

【问题讨论】：

显然你的数据框中有浮点值，你想用它们做什么？你也想数一数吗？
因为这些文本应该都是推文，所以我也想计算它们。如果此列还包含浮点值，这是否意味着我收集的推文只是数字？（让我好奇哪些是浮动的）
是的，这是可能的。

标签： python pandas ipython

【解决方案1】：

出现此问题是因为您的系列 (result['text']) 中的某些值的类型为 float。如果您也想在' '.join() 期间考虑它们，那么您需要将浮点数转换为字符串，然后再将它们传递给str.join()。

您可以使用Series.astype() 将所有值转换为字符串。另外，你真的不需要使用.tolist()，你也可以简单地将系列给str.join()。示例 -

result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()

演示 -

In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])

In [61]: df
Out[61]:
      A
0  blah
1   asd
2  10.1

In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])

TypeError: sequence item 2: expected str instance, float found

In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'

【讨论】：

谢谢，这似乎有效。现在输出在 dict 中，将其移回 pandas 数据框或以某种方式继续在 df 中工作是否合乎逻辑？
取决于您打算做什么工作。但我的猜测是，如果您打算进行某种分析，dataframe 会更快。
通用问题的通用答案 :D 当我有一个具体问题时，我会提出一个新问题。感谢您的帮助！

【解决方案2】：

最后我使用了以下代码：

pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words

然而，Anand S Kumar 解决了这个问题。

【讨论】：