如何将正则表达式应用于列表的每个子列表？答案

【问题标题】：How to apply a regex to each sublists of a list?如何将正则表达式应用于列表的每个子列表？
【发布时间】：2015-05-19 05:08:30
【问题描述】：

假设我有一个这样的列表：

lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]

我想删除每个子列表的链接，所以我尝试了这个正则表达式：

new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)

我使用了MULTILINE 标志，因为当我打印list_ 时，它看起来像：

[]
[]
[]
...
[]

上述方法的问题是我得到了一个TypeError: expected string or buffer，显然我不能像这样将子列表传递给正则表达式。 如何将上述正则表达式应用于 list_ 中的子列表集？ 以获得类似的东西（即没有任何类型链接的子列表）：

[['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
        ['I just became the mayor of Porta Romana on @username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "@username Don't use my family surname for your app ????\t\t"]
        ]

这可以通过地图完成还是有其他有效的方法？

在此先感谢各位

【问题讨论】：

您应该修复您的list_ 示例，因为现在它不是有效的 Python，因此很难确切知道它是什么。我猜它是一个包含字符串列表的列表，但我们不应该这样猜测。
您的预期输出是什么？
@AvinashRaj 我编辑了，谢谢大家的帮助！

标签： python regex list python-2.7 parsing

【解决方案1】：

你需要使用\b而不是行锚的开始。

>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t'], ['Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["@username Don't use my family surname for your app ???? "]]

或

>>> l = []
>>> for i in lis_:
        m = []
        for j in i:
            m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
        l.append(m)


>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t', 'Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "@username Don't use my family surname for your app ???? "]]

【讨论】：

感谢您的帮助。我尝试了this 方法，我得到了这个：[[], []]
它对我有用。我提供了您提供的确切输入。
感谢您的帮助，另一方面，如果我得到这样的列表怎么样：['"Fun is the enjoyment of pleasure"\t\t', ' fanns ett utvik med "a stitch". @username\t\t'] ['I just ! http://4sq.com/9QROVv\t\t', 'Torino on @username! http://4sq.com/9iydG3\t\t'] ... [another sentence in a list]。我的意思是只是一堆列表而不在列表中？。
如何将一堆列表分配给一个变量？
当然，它们必须在列表中。顺便说一句，我很好奇，谢谢

【解决方案2】：

您似乎有一个 list 和 lists 和 strings。

在这种情况下，您只需以正确的方式遍历这些列表：

list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]

def remove_links(sublist):
    return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]

final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]

如果您想在之后删除任何空子列表：

final_final_list = [l for l in final_list if l]

【讨论】：

感谢您的帮助。问题是我有这样的每个子列表的字符串： [blablablalba blabalbablbla blablala] 而不是每个子列表上的['blablablalba', 'blabalbablbla',' blablala'] 我有一个很大的评论。
[blablablalba blabalbablbla blablala] 不是有效的 Python 代码。能说清楚一点吗？
抱歉使用blabla，我试图用一种简单的方式来解释它。我编辑了
如果您在新输入上运行代码，它会返回 [['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t']]，这似乎是正确的？
我得到了这个：[['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t'], []] 在最后一个子列表中它删除了所有内容，而不是仅仅删除链接。