根据列表中的值过滤数据框答案

【问题标题】：filtering a dataframe on values in a list根据列表中的值过滤数据框
【发布时间】：2018-11-02 13:21:18
【问题描述】：

我有以下数据框：-

我想过滤claim_status 中有 11 个的地方

对于 claim_ststaus_reason，对于 aa1。

我正在尝试下面的代码，但它只是给了我所有的行

my_list = 'aa1'

df[df['claim_status_reason'].str.contains( "|".join(my_list), regex=True)].reset_index(drop=True)

预期输出：-

1.) where there is 11 in claim_ststus 
2.) where there is aa1 in the claim_status_reason

【问题讨论】：

【解决方案1】：

您可以使用apply 来获取您想要的过滤器，例如：

df[(df['claim_staus'].apply(lambda x: 11 in x)) & (df['claim_status_reason'].apply(lambda x: 'a1' in x))]

【讨论】：

【解决方案2】：

不要对系列中的列表使用字符串操作。您可以改用列表推导式。你的数据结构选择是反熊猫的，因为你应该尽量避免将列表放在首位。这些操作不可矢量化。

mask1 = np.array([11 in x for x in df['claim_staus']])
mask2 = np.array(['aa1' in x for x in df['claim_status_reason']])

df = df[mask1 & mask2]

【讨论】：

好一个@jpp，我今天读到的列表在 pandas dataFrame 中不好！
@jpp 这种方法比按元素应用更好吗？
这比 apply + lambda 更快（或者在最坏的情况下可比）（这只是另一个 Python 级别的循环）。
pd.DataFrame(df['claim_staus'].tolist()).eq('aa1').any()
@W-B，我不建议使用锯齿状列表长度。您最终可能会得到大量 NaN 值（即内存效率低下）。此外，它只是表面上的矢量化，因为昂贵的位（测试字符串）无法使用object dtype 进行矢量化。