根据某些列中的空值删除行（熊猫）答案

【问题标题】：Delete row based on nulls in certain columns (pandas)根据某些列中的空值删除行（熊猫）
【发布时间】：2017-06-26 18:42:56
【问题描述】：

我知道如何从包含所有空值或单个空值的 DataFrame 中删除一行，但是您可以根据指定列集的空值删除一行吗？

例如，假设我正在处理包含地理信息（城市、纬度和经度）以及许多其他字段的数据。我想保留至少包含城市值或纬度和经度值的行，但删除所有三个都具有空值的行。

我无法在 pandas 文档中找到此功能。任何指导将不胜感激。

【问题讨论】：

伙计，它在文档中。查看dropna函数的帮助
@GeneBurinsky，不，dropna() 在这种情况下将无法正常工作。在我的示例中检查索引为 4 的行。 df.dropna(subset=['city','latitude','longitude'], how='all') 会放弃它...
@MaxU，这是一个公平的观点。但是，至少对于您的示例，这将起作用 df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2) 但总的来说，您是对的，所需的明确逻辑语句优于 dropna 解决方案
@GeneBurinsky，哇！我完全错过了这个参数......你能把它写成答案吗？

标签： python pandas

【解决方案1】：

试试这个：

In [25]: df
Out[25]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
2  NaN       NaN        NaN  3  4
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1

In [26]: df.query("city == city or (latitude == latitude and longitude == longitude)")
Out[26]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

如果我正确理解 OP，则必须删除索引为 4 的行，因为不是两个坐标都不为空。所以dropna() 在这种情况下不会“正常”工作：

In [62]: df.dropna(subset=['city','latitude','longitude'], how='all')
Out[62]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1   # this row should be dropped...

【讨论】：

没错，索引 4 需要被删除。这似乎是我一直在寻找的。我不知道您可以以这种方式将布尔值用于查询（）。谢谢！

【解决方案2】：

dropna 有一个参数可以仅将测试应用于列的子集：

dropna(axis=0, how='all', subset=[your three columns in this list])

【讨论】：

请注意，正如 MaxU 在 cmets 中提到的，这在示例测试集上不太适用。

【解决方案3】：

您可以通过利用位运算符来执行选择。

## create example data
df = pd.DataFrame({'City': ['Gothenburg', None, None], 'Long': [None, 1, 1], 'Lat': [1, None, 1]})

## bitwise/logical operators
~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())
0     True
1    False
2     True
dtype: bool

## subset using above statement
df[~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())]
         City  Lat  Long
0  Gothenburg  1.0   NaN
2        None  1.0   1.0

【讨论】：

【解决方案4】：

您可以使用pd.dropna，但不使用how='all' 和subset=[]，您可以使用thresh 参数来要求在删除一行之前一行中的最少NA 数。在城市中，long/lat 示例，thresh=2 将起作用，因为我们仅在 3 个 NA 的情况下下降。使用 MaxU 设置的出色数据示例，我们会这样做

## get the data
df = pd.read_clipboard()

## remove undesired rows
df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)

这会产生：

In [5]: df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)
Out[5]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

【讨论】：

谢谢！清晰简洁的解决方案。

【解决方案5】：

使用布尔掩码和一些巧妙的dot 乘积（这是针对@Boud）

subset = ['city', 'latitude', 'longitude']
df[df[subset].notnull().dot([2, 1, 1]).ge(2)]

  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

【讨论】：