熊猫中的条件字符串匹配答案

【问题标题】：Conditional String matching in pandas熊猫中的条件字符串匹配
【发布时间】：2018-02-14 23:14:06
【问题描述】：

我有以下数据框a

  a=pd.DataFrame([[1,'bayern'],[2,'bayern_leverkusen'],[3,'Chelsea'],
                  [4,'manunited'],[5,'westhamunited'],[6,'mancity']]
                  ,columns=['no','club'])

我想迭代列 club，使 club 中的每个值都与 club 中的所有其他值一起迭代，并仅选择匹配 4 个或更多连续字符的那些。

对于 eq bayern 和 bayern_leverkusen 应该被过滤，因为它们包含相同的子字符串 bayern。同样，manunited 和 westhamunited 应该被过滤，因为它们包含相同的子字符串 united。

mancity 不应被过滤，因为匹配的子字符串 man 仅为 3。

预期输出：

     no    club
 0   1    bayern    
 1   2    bayern_leverkusen
 3   4    manunited
 4   5    westhamunited

【问题讨论】：

您的尝试效果如何？
首先，我无法为每个 club 值动态创建长度为四或更多的子字符串。
另外，您的预期输出是什么？
旁注：德国足球俱乐部是bayer_leverkusen，与拜仁无关（=巴伐利亚）:)
在问题中添加了预期的输出

标签： python regex pandas dataframe

【解决方案1】：

import itertools
import pandas as pd
selector = pd.Series(False,index = a.index)
for first_index,second_index in itertools.combinations(a.index,2):
    club1 = a['club'][first_index]
    club2 = a['club'][second_index]
    for start in range(len(club1)-3):
        if club1[start:start+3] in club2:
            selector[first] = True
            selector[second] = True
            break
new_df = a.loc[selector]

【讨论】：

我看到结果包括 mancity。另外，我认为您需要将first 替换为first_index / second 替换为second_index。
好吧，如果我将声明 if club1[start:start+3] in club2 更改为 if club1[start:start+4] in club2:，我将获得所需的输出。谢谢你们俩。