过滤熊猫行，其中列中的第一个字母是/不是某个值答案

【问题标题】：Filter pandas row where 1st letter in a column is/is-not a certain value过滤熊猫行，其中列中的第一个字母是/不是某个值
【发布时间】：2019-03-06 08:55:45
【问题描述】：

如何过滤掉我不希望第一个字母为“Z”或任何其他字符的一系列数据（在 pandas dataFrame 中）。

我有以下 pandas dataFrame，df，（其中有 > 25,000 行）。

TIME_STAMP  Activity    Action  Quantity    EPIC    Price   Sub-activity    Venue
0   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
1   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
3   2017-08-30 08:00:09.000 Allocation  BUY 91  BATS    47.875  CPTY    PXINLN
4   2017-08-30 08:00:10.000 Allocation  BUY 43  PNN 8.07    CPTY    WCAPD
5   2017-08-30 08:00:10.000 Allocation  BUY 270 SGE 6.93    CPTY    PROBDMAD

我正在尝试删除 Venue 第一个字母为“Z”的所有行。

例如，我通常的过滤器代码类似于（过滤掉 Venue = '066' 的所有行

df = df[df.Venue != '066']

我可以看到这个过滤器行通过数组过滤掉了我需要的东西，但我不确定如何在过滤器上下文中指定它。

[k for k in df.Venue if 'Z' not in k]

【问题讨论】：

标签： python python-3.x pandas dataframe filter

【解决方案1】：

如果您没有有NaN 值，您可以将系列的NumPy 表示转换为类型'<U1' 并测试相等性：

df1 = df[df['A'].values.astype('<U1') != 'Z']

性能基准测试

from string import ascii_uppercase
from random import choice

L = [''.join(choice(ascii_uppercase) for _ in range(10)) for i in range(100000)]
df = pd.DataFrame({'A': L})

%timeit df['A'].values.astype('<U1') != 'Z'       # 4.05 ms per loop
%timeit [x[0] != 'Z' for x in df['A']]            # 11.9 ms per loop
%timeit [not x.startswith('Z') for x in df['A']]  # 23.7 ms per loop
%timeit ~df['A'].str.startswith('Z')              # 53.6 ms per loop
%timeit df['A'].str[0] != 'Z'                     # 53.7 ms per loop
%timeit ~df['A'].str.contains('^Z')               # 127 ms per loop

【讨论】：

【解决方案2】：

使用str[0] 选择第一个值或使用startswith、contains 和正则表达式^ 作为字符串的开头。对于反转布尔掩码使用~：

df1 = df[df.Venue.str[0] != 'Z']

df1 = df[~df.Venue.str.startswith('Z')]

df1 = df[~df.Venue.str.contains('^Z')]

如果没有NaNs 值，则使用列表理解更快：

df1 = df[[x[0] != 'Z' for x in df.Venue]]

df1 = df[[not x.startswith('Z') for x in df.Venue]]

【讨论】：