【发布时间】:2021-07-27 00:11:27
【问题描述】:
假设您有以下 html 列表:
['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
我想查询这个列表,让输出变成如下:
https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf
所以我可以访问这些 url 并从这些链接迭代下载文件。
所以我开发了以下正则表达式和代码:
import re
r = re.compile(((?<=uploaden:\s).+))
newlist = list(filter(r.match, mylist)) # Note 1
print(newlist)
但是,这不会返回任何内容(我认为是因为列表是 html):
[]
当将正则表达式调整为 .* 时,所有内容都会匹配。这怎么可能?
所以我的问题是如何从 html 代码创建一个新的字符串列表?
【问题讨论】:
标签: python html regex pandas list