在 Pandas 数据框列中查找最长字符串的长度答案

【问题标题】：Find length of longest string in Pandas dataframe column在 Pandas 数据框列中查找最长字符串的长度
【发布时间】：2014-01-22 22:26:49
【问题描述】：

有没有比下面示例中显示的更快的方法来查找 Pandas DataFrame 中最长字符串的长度？

import numpy as np
import pandas as pd

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])

print df.col1.map(lambda x: len(x)).max()
# result --> 6

使用 IPython 的 %timeit 计时时，运行 df.col1.map(lambda x: len(x)).max() 大约需要 10 秒。

【问题讨论】：

您可以通过简单地使用map(len) 来节省一些时间——lambda 在这里只会浪费时间。我猜大概是 25% 左右。

标签： python pandas

【解决方案1】：

DSM 的建议似乎是在不进行手动微优化的情况下获得的最佳效果：

%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop

%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop

%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop

请注意，显式使用str.len() 方法似乎并没有太大的改进。如果您不熟悉 IPython，这是非常方便的 %timeit 语法的来源，我绝对建议您试一试以快速测试此类内容。

更新添加截图：

【讨论】：

当帝斯曼对map(len) 发表评论时，我得出了同样的结论。与len(lambda x: len(x)) 方法相比，减少了约 40%。
需要说明的一点是 str.len 方法是 NaN 等。

【解决方案2】：

有时您想要最长字符串的长度以字节为单位。这与使用花哨的 Unicode 字符的字符串相关，在这种情况下，字节长度大于常规长度。这在特定情况下可能非常相关，例如用于数据库写入。

col_bytes_len = int(df[col_name].astype(bytes).str.len().max())

备注：

使用astype(bytes) 比使用str.encode(encoding='utf-8') 更可靠。这是因为astype(bytes) 也可以正确处理混合 dtype 的列。
输出包含在 int() 中，因为否则输出是一个 numpy 对象。
如果出现编码错误，则不要考虑df[col_name].astype(bytes)，而是考虑：
- df[col_name].str.encode('utf-8')
- df[col_name].str.encode('ascii', errors='backslashreplace')（最后选择）

【讨论】：

收到此错误“UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 126: ordinal not in range(128)”
@BlueClouds 我现在在答案中添加了额外的注释。请尝试一下。

【解决方案3】：

import pandas as pd
import numpy as np

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])

# get longest string index from column
indx = df["col1"].str.len().idxmax()

# get longest string value
df["col1"][indx] # <---------------------

【讨论】：

【解决方案4】：

优秀的答案，尤其是 Marius 和 Ricky，他们非常有帮助。

鉴于我们大多数人都在优化编码时间，这里是对这些答案的快速扩展，将所有列的最大项目长度作为一个系列返回，按每列的最大项目长度排序：

mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)

或作为一个班轮：

pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)

改编原样，可以演示为：

import pandas as pd

x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])

print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))

输出：

col2    6
col1    3
dtype: int64

【讨论】：

【解决方案5】：

作为一个小补充，您可能希望遍历数据框中的所有对象列：

for c in df:
    if df[c].dtype == 'object':
        print('Max length of column %s: %s\n' %  (c, df[c].map(len).max()))

这将防止 bool、int 类型等引发错误。

可以扩展为其他非数字类型，例如 'string_'、'unicode_' 即

if df[c].dtype in ('object', 'string_', 'unicode_'):

【讨论】：

除非数据框中有NaN 表示的空值，否则您将收到以下错误：object of type 'float' has no len()。上面的 A-B-Bs 答案转换为 str 以适应这种情况。