【发布时间】:2015-09-09 14:05:49
【问题描述】:
所以我有这段代码(可能效率极低,但那是另一回事)从博客的 html 代码中提取 url。我在 .csv 中有 html,我将其放入 python,然后运行正则表达式来获取 url。代码如下:
import csv, re # required imports
infile = open('Book1.csv', 'rt') # open the csv file
reader = csv.reader(infile) # read the csv file
strings = [] # initialize a list to read the rows into
for row in reader: # loop over all the rows in the csv file
strings += row # put them into the list
link_list = [] # initialize list that all the links will be put in
for i in strings: # loop over the list to access each string for regex (can't regex on lists)
links = re.search(r'((https?|ftp)://|www\.)[^\s/$.?#].[^\s]*', i) # regex to find the links
if links != None: # if it finds a link..
link_list.append(links) # put it into the list!
for link in link_list: # iterate the links over a loop so we can have them in a nice column format
print(link)
但是,当我打印结果时,它会以以下形式出现:
<_sre.SRE_Match object; span=(49, 80), match='http://buy.tableausoftware.com"'>
<_sre.SRE_Match object; span=(29, 115), match='https://c.velaro.com/visitor/requestchat.aspx?sit>
<_sre.SRE_Match object; span=(34, 117), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(32, 115), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(76, 166), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(9, 34), match='http://twitter.com/share"'>
有没有办法让我从包含的其他废话中提取链接?另外,这只是正则表达式搜索的一部分吗?谢谢!
【问题讨论】: