在 python 中使用 np.where 函数时如何避免 NaN？答案

【问题标题】：How to avoid NaN when using np.where function in python?在 python 中使用 np.where 函数时如何避免 NaN？
【发布时间】：2020-01-17 16:07:48
【问题描述】：

我有一个这样的数据框，

col1    col2   col3
1       apple   a,b 
2       car      c
3       dog     a,c
4       dog     NaN

我尝试创建三个新列，a、b 和 c，如果它包含特定字符串，则为“1”，否则为“0”。

df['a']= np.where(df['col3'].str.contains('a'),1,0)
df['b']= np.where(df['col3'].str.contains('b'),1,0)
df['c']= np.where(df['col3'].str.contains('c'),1,0)

但似乎没有正确处理 NaN 值。它给了我这样的结果，

col1  col2  col3    a   b   c
1    apple   a,b    1   1   0
2     car     c     0   0   1
3     dog    a,c    1   0   1
4     dog    NaN    1   1   1

第 4 行应全为“0”。如何更改我的代码以获得正确答案？

【问题讨论】：

为什么在使用 np.where 函数（如 df = df.dropna()）之前不删除 NaN
@Kapil 这是一种可能性，但似乎 OP 希望保留框架结构并将解析的列附加回来，如果先完成 dropna，这将不起作用。
使用df.join(df['col2'].str.get_dummies(','))
你显然需要get_dummies，但为了你的问题，NaNs 是True 值，所以不要相信 numpy 的判断 - 在末尾明确填写以避免歧义：df.col2.str.contains('a').fillna(False)
NaNs 为 True 的原因可以在 on the docs 中找到 - 转换为 False 的对象数量非常有限，其余的都是 True

标签： python pandas numpy dataframe nan

【解决方案1】：

您可以使用fillna(False)。您正在使用布尔索引，因此与 NaN 对应的值始终为 0

df['a']= np.where(df['col2'].str.contains('a').fillna(False),1,0)
df['b']= np.where(df['col2'].str.contains('b').fillna(False),1,0)
df['c']= np.where(df['col2'].str.contains('c').fillna(False),1,0)

输出：

   col1   col2 col3  a  b  c
0     1  apple  a,b  1  0  0
1     2    car    c  1  0  1
2     3    dog  a,c  0  0  0
3     4    dog  NaN  0  0  0

【讨论】：

【解决方案2】：

我会做什么

s=df.col2.str.get_dummies(sep=',')
Out[29]: 
   a  b  c
0  1  1  0
1  0  0  1
2  1  0  1
3  0  0  0
df=pd.concat([df,s],axis=1)

【讨论】：

this 和 user3483203 的评论都很棒，但是如果该列没有分隔并且实际上需要 str.contains，那么它将无法工作:(