替换熊猫数据框中多列的列特定范围之外的值答案

【问题标题】：Replacing values outside column-specific ranges of multiple columns in a pandas data frame替换熊猫数据框中多列的列特定范围之外的值
【发布时间】：2020-08-08 01:07:45
【问题描述】：

我是 pandas 的新手，我想清理一个包含大量列的数据框。

我想保留每列特定范围内的值，例如，对于名为“年龄”的列，我想保留大于 5 且小于 25 的值。如果值超出该范围，我想用 NaN 替换它，例如，在“年龄”列中有我要替换的值 918。

在我的尝试中，我使用了字典，因为就像我说的那样，我有很多列。此代码不起作用，因为它实际上并没有更改我原始数据框中的任何值（没有错误消息）。

感谢您的帮助！

# PACKAGES 
import pandas as pd
import numpy as np 


# STARTING DATA 
data = [[1.0, 10, 0], [0.0, 12, 0.4], [2.0, 918, 0.9]]   
df = pd.DataFrame(data, columns = ['TriGly', 'Age', 'Chol']) 

dict = {
    'Age': (5, 25),
    'Chol': (0.2, 1.2),
    'TriGly': (0.0, 1.0)
}


# CLEAN 
for column_name in df.columns:                                             
    if column_name in dict:                                                
        for row in df[column_name]:                                        
            if dict[column_name][0] < row < dict[column_name][1]:       
                row = row                                                   
            else:
                row = np.nan                                               

# DESIRED DATA 
data2 = [[1.0, 10, np.nan], [0.0, 12, 0.4], [np.nan, np.nan, 0.9]]   
df2 = pd.DataFrame(data2, columns = ['TriGly', 'Age', 'Chol'])

【问题讨论】：

标签： python pandas data-science data-cleaning

【解决方案1】：

对于每一列，您可以使用.between(min_val, max_val) 来识别有效值。然后您可以使用.where 将其他值屏蔽为nan。最后，对列进行快速应用：

df.apply(lambda x: x.where(x.between(*(my_dict[x.name])) ) )

输出：

   TriGly   Age  Chol
0     1.0  10.0   NaN
1     0.0  12.0   0.4
2     NaN   NaN   0.9

【讨论】：