如何在熊猫中选择不以某些 str 开头的行？答案

【问题标题】：How to select rows that do not start with some str in pandas?如何在熊猫中选择不以某些 str 开头的行？
【发布时间】：2017-06-01 01:53:10
【问题描述】：

我想选择值不以某些 str 开头的行。比如我有一个pandasdf，我想选择不以t开头的数据，还有c。在此示例中，输出应为 mext1 和 okl1。

import pandas as pd

df=pd.DataFrame({'col':['text1','mext1','cext1','okl1']})
df

    col
0   text1
1   mext1
2   cext1
3   okl1

我想要这个：

    col
0   mext1
1   okl1

【问题讨论】：

标签： python pandas numpy

【解决方案1】：

您可以使用apply 方法。

以你的问题为例，代码是这样的

df[df['col'].apply(lambda x: x[0] not in ['t', 'c'])]

我认为apply是一种更通用、更灵活的方法。

【讨论】：

【解决方案2】：

选项 1
使用str.match 并否定前瞻

df[df.col.str.match('^(?![tc])')]

选项 2
在query 内

df.query('col.str[0] not list("tc")')

选项 3
numpy 广播

df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]

         col
1  mext1
3   okl1

时间测试

def ted(df):
    return df[~df.col.str.get(0).isin(['t', 'c'])]

def adele(df):
    return df[~df['col'].str.startswith(('t','c'))]

def yohanes(df):
    return df[df.col.str.contains('^[^tc]')]

def pir1(df):
    return df[df.col.str.match('^(?![tc])')]

def pir2(df):
    return df.query('col.str[0] not in list("tc")')

def pir3(df):
    df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]

functions = pd.Index(['ted', 'adele', 'yohanes', 'pir1', 'pir2', 'pir3'], name='Method')
lengths = pd.Index([10, 100, 1000, 5000, 10000], name='Length')
results = pd.DataFrame(index=lengths, columns=functions)

from string import ascii_lowercase

for i in lengths:
    a = np.random.choice(list(ascii_lowercase), i)
    df = pd.DataFrame(dict(col=a))
    for j in functions:
        results.set_value(
            i, j,
            timeit(
                '{}(df)'.format(j),
                'from __main__ import df, {}'.format(j),
                number=1000
            )
        )

fig, axes = plt.subplots(3, 1, figsize=(8, 12))
results.plot(ax=axes[0], title='All Methods')
results.drop('pir2', 1).plot(ax=axes[1], title='Drop `pir2`')
results[['ted', 'adele', 'pir3']].plot(ax=axes[2], title='Just the fast ones')
fig.tight_layout()

【讨论】：

很高兴看到情节输出@piRSquared 阅读您的帖子总是很有趣，先生，+1

【解决方案3】：

您可以使用str.startswith 并否定它。

    df[~df['col'].str.startswith('t') & 
       ~df['col'].str.startswith('c')]

col
1   mext1
3   okl1

或者更好的选择，根据@Ted Petrou 在一个元组中包含多个字符：

df[~df['col'].str.startswith(('t','c'))]

    col
1   mext1
3   okl1

【讨论】：

看起来您可以将startswith 与元组一起使用，而不是使用包含多个值的列表。不知道为什么元组有效但列表无效。
很好，我尝试了列表而不是元组，所以感谢@TedPetrou

【解决方案4】：

您可以使用 str 访问器来获取字符串功能。 get 方法可以获取字符串的给定索引。

df[~df.col.str.get(0).isin(['t', 'c'])]

     col
1  mext1
3   okl1

看起来您也可以将 startswith 与要排除的值的元组（而不是列表）一起使用。

df[~df.col.str.startswith(('t', 'c'))]

【讨论】：

这个解决方案可以更好地扩展 +1
TKS，如果我添加一个新列来填充值怎么办？ @Ted Petrou
谢谢 - 理解 ~ 符号似乎至关重要（按位非）。

【解决方案5】：

如果您更喜欢正则表达式，这只是另一种选择：

df1[df1.col.str.contains('^[^tc]')]

【讨论】：

这里不需要 lambda