python - pandas 通过部分字符串 ValueError 选择答案

【问题标题】：python - pandas select by partial string ValueErrorpython - pandas 通过部分字符串 ValueError 选择
【发布时间】：2015-11-04 15:13:05
【问题描述】：

我有一个要加载到数据框中的 csv。我只需要Organization 列包含目标字符串affiliation 的行。

当我尝试使用str.contains() 时，我得到ValueError: cannot index with vector containing NA / NaN values。

我查看了 Value Error when Slicing in Pandas 和 pandas + dataframe - select by partial string 以及以下对我都有效的解决方案：

df = df[df['Organization'].str.contains(affiliation)==True]

或

df = df[df['Organization'].str.contains(affiliation).fillna(False)]

但是，作为测试，我这样做了：

print(len(df)) #99228
df = df[pd.notnull(df['Organization'])] #or df = df.dropna(subset=['Organization'])
print(len(df)) #99228
df = df[df['Organization'].str.contains(affiliation).fillna(False)]
print(len(df)) #1605

我的问题是：没有==True 或fillna(False) 附加到str.contains() 的ValueError 似乎暗示Organization 列有NaNs。但是，为什么在只保留非空 Organization 行之后我会得到相同大小的 df 呢？我在这里错过了什么？

谢谢！

【问题讨论】：

标签： python pandas

【解决方案1】：

检查您的专栏Organization 的内容。它很可能包含字符串和其他数据类型。因此，对于具有其他数据类型的这些值，df['Organization'].str.contains(affiliation) 会导致 NaN。您不能使用NaN 进行索引，但需要将其转换为False。

【讨论】：

【解决方案2】：

您需要指定str.contains('affiliation', na=False)。 [docs]

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: s = pd.Series(['foo','bar',np.nan,'this contains affiliation','baz',np.nan])

In [4]: s.str.contains('affiliation')
Out[4]:
0    False
1    False
2      NaN
3     True
4    False
5      NaN
dtype: object

In [5]: s.str.contains('affiliation', na=False)
Out[5]:
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

然后您可以使用该布尔数组索引您的 DataFrame。

【讨论】：

谢谢，但我知道我需要这样做。如果不是因为 NaN，我不明白为什么会发生这种情况