警告:小心使用 NaN
请务必小心,如果您的数据有缺失值,np.where 可能难以使用,并且可能会无意中给您错误的结果。
考虑这种情况:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
或者,您可以使用一个嵌套的np.where 用于中等而不是难看的 nan。
恕我直言,最好的方法是pd.cut。它处理 NaN 并且易于使用。
示例:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium