计算每行的字数答案

【问题标题】：Count number of words per row计算每行的字数
【发布时间】：2018-10-03 17:23:43
【问题描述】：

我正在尝试在数据框中创建一个新列，其中包含相应行的字数。我正在寻找单词的总数，而不是每个不同单词的频率。我以为会有一种简单/快速的方法来完成这项常见任务，但是在谷歌搜索并阅读了一些 SO 帖子（1、2、3、4）之后，我被卡住了。我已经尝试了链接的 SO 帖子中提出的解决方案，但得到了很多属性错误。

words = df['col'].split()
df['totalwords'] = len(words)

结果

AttributeError: 'Series' object has no attribute 'split'

和

f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)

结果

AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')

【问题讨论】：

标签： python string python-3.x pandas dataframe

【解决方案1】：

`str.split` + `str.len`

str.len 适用于任何非数字列。

df['totalwords'] = df['col'].str.split().str.len()

`str.count`

如果你的单词是单空格分隔的，你可以简单地计算空格加 1。

df['totalwords'] = df['col'].str.count(' ') + 1

列表理解

这比你想象的要快！

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

【讨论】：

@lucid_dreamer 它将空格上的每个字符串拆分为单词列表，然后返回每个列表的长度。
但为什么不只是df['totalwords'] = df['col'].str.split().len()？
@lucid_dreamer 因为那不正确？ Series 上没有定义 len() 函数。
@lucid_dreamer 看看Working with Text Data，pandas 在对象 dtype 列上定义了一套 str 方法，其中一些（例如 str.len() 和 str.count()）适用于任意容器。
列表理解要快得多。我用了 1 分钟，而 split+len 方法用了 3 分钟

【解决方案2】：

这是使用.apply()的一种方式：

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

示例

鉴于此df：

>>> df
                    col
0  This is one sentence
1           and another

申请.apply()后

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

>>> df
                    col  number_of_words
0  This is one sentence                4
1           and another                2

注意：正如在 cmets 中指出的，在 this answer 中，.apply 不一定是最快的方法。如果速度很重要，最好使用@cᴏʟᴅsᴘᴇᴇᴅ's 方法之一。

【讨论】：

apply = 较慢的循环版本。这是一个好主意，但像[len(x.split()) for x in df['col']] 这样的东西会很重要。它在我的答案中，但也可以随意添加到您的答案中。

【解决方案3】：

这是使用pd.Series.str.split 和pd.Series.map 的一种方式：

df['word_count'] = df['col'].str.split().map(len)

以上假设df['col']是一系列字符串。

例子：

df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})

df['word_count'] = df['col'].str.split().map(len)

print(df)

#                   col  word_count
# 0  This is an example           4
# 1     This is another           3
# 2             A third           2

【讨论】：

【解决方案4】：

使用来自冷的list 和map 数据

list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]

【讨论】：

str.split + str.len

str.count

列表理解

`str.split` + `str.len`

`str.count`