用于根据结尾过滤字符串列表的 Python 脚本答案

【问题标题】：Python script to filter a list of strings based on ending用于根据结尾过滤字符串列表的 Python 脚本
【发布时间】：2013-12-11 06:46:49
【问题描述】：

我不知道任何 python，但我需要稍微自定义一个脚本。脚本中解析了一些字符串并将其放入列表中（我猜）。然后根据它们是否以“http”开头来过滤这些字符串。我要添加的是基于文件扩展名的过滤器。所有以html 或xml 结尾的链接都应被过滤掉。

这是过滤所有超链接的代码：

links = filter (lambda x:x.startswith("http://") , links)

我不知道为 .endswith(".html) OR .endswith("xml") 之类的东西放置 OR 运算符的正确语法

我知道这会过滤所有以.html 结尾的链接，但我还需要.xml 链接。

links = filter (lambda x:x.startswith("http://") , links) 
links = filter (lambda x:x.endswith(".html") , links)

【问题讨论】：

那些以可选协议规范开头的字符串不是hyperlinks，而是URLs。

标签： python string filter

【解决方案1】：

如果您至少使用 2.5，则可以将一组后缀传递给 endswith。感谢@hcwhsa 指出这一点：

links = filter(lambda x:x.endswith((".html", ".xml")), links)

如果您使用的是早期版本，则可以使用 or 运算符：

links = filter(lambda x:x.endswith(".html") or x.endswith(".xml"), links)

如果您不确定 x 是否已经小写，您可能希望将其小写。

我可能会使用生成器表达式而不是 filter 来执行此操作，并且肯定不会连续调用 filter：

links = [link for link in links if link.startswith('http://') and link.endswith(('.html', '.xml'))]

【讨论】：

【解决方案2】：

我认为最好的方法是用正则表达式检查这个

>>> import re
>>> c = r"^http://.+\.(html|xml)"
>>> re.match(c, 'hello')
>>> re.match(c, 'http://data.com/word.html')
<_sre.SRE_Match object at 0x1d2a100>

答案是

import re
regex = r"^http://.+\.(html|xml)"
links = filter(lambda x: re.match(regex, x), links)

【讨论】：

【解决方案3】：

links = list(filter(lambda x: x.endswith(".html"), links));

【讨论】：

虽然此代码可能会回答问题，但提供有关此代码为何和/或如何回答问题的额外上下文可提高其长期价值。
修订版 1 到 3 的代码没有回答 proper syntax to put an OR operator for something like .endswith(".html) OR .endswith("xml")。它非常接近问题中提出的一行。