【问题标题】:How to apply a regex to each sublists of a list?如何将正则表达式应用于列表的每个子列表?
【发布时间】:2015-05-19 05:08:30
【问题描述】:

假设我有一个这样的列表:

lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]

我想删除每个子列表的链接,所以我尝试了这个正则表达式:

new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)

我使用了MULTILINE 标志,因为当我打印list_ 时,它看起来像:

[]
[]
[]
...
[]

上述方法的问题是我得到了一个TypeError: expected string or buffer,显然我不能像这样将子列表传递给正则表达式。 如何将上述正则表达式应用于 list_ 中的子列表集? 以获得类似的东西(即没有任何类型链接的子列表):

[['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
        ['I just became the mayor of Porta Romana on @username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "@username Don't use my family surname for your app ????\t\t"]
        ]

这可以通过地图完成还是有其他有效的方法?

在此先感谢各位

【问题讨论】:

  • 您应该修复您的list_ 示例,因为现在它不是有效的 Python,因此很难确切知道它是什么。我猜它是一个包含字符串列表的列表,但我们不应该这样猜测。
  • 您的预期输出是什么?
  • @AvinashRaj 我编辑了,谢谢大家的帮助!

标签: python regex list python-2.7 parsing


【解决方案1】:

你需要使用\b而不是行锚的开始。

>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t'], ['Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["@username Don't use my family surname for your app ???? "]]

>>> l = []
>>> for i in lis_:
        m = []
        for j in i:
            m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
        l.append(m)


>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t', 'Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "@username Don't use my family surname for your app ???? "]]

【讨论】:

  • 感谢您的帮助。我尝试了this 方法,我得到了这个:[[], []]
  • 它对我有用。我提供了您提供的确切输入。
  • 感谢您的帮助,另一方面,如果我得到这样的列表怎么样:['"Fun is the enjoyment of pleasure"\t\t', ' fanns ett utvik med "a stitch". @username\t\t'] ['I just ! http://4sq.com/9QROVv\t\t', 'Torino on @username! http://4sq.com/9iydG3\t\t'] ... [another sentence in a list]。我的意思是只是一堆列表而不在列表中?。
  • 如何将一堆列表分配给一个变量?
  • 当然,它们必须在列表中。顺便说一句,我很好奇,谢谢
【解决方案2】:

您似乎有一个 listlists 和 strings。

在这种情况下,您只需以正确的方式遍历这些列表:

list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]

def remove_links(sublist):
    return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]

final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]

如果您想在之后删除任何空子列表:

final_final_list = [l for l in final_list if l]

【讨论】:

  • 感谢您的帮助。问题是我有这样的每个子列表的字符串: [blablablalba blabalbablbla blablala] 而不是每个子列表上的['blablablalba', 'blabalbablbla',' blablala'] 我有一个很大的评论。
  • [blablablalba blabalbablbla blablala] 不是有效的 Python 代码。能说清楚一点吗?
  • 抱歉使用blabla,我试图用一种简单的方式来解释它。我编辑了
  • 如果您在新输入上运行代码,它会返回 [['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t']],这似乎是正确的?
  • 我得到了这个:[['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t'], []] 在最后一个子列表中它删除了所有内容,而不是仅仅删除链接。
猜你喜欢
  • 1970-01-01
  • 2021-10-23
  • 2019-03-26
  • 2019-02-15
  • 2019-09-28
  • 2021-07-04
  • 2021-12-15
  • 2019-07-04
  • 1970-01-01
相关资源
最近更新 更多