【发布时间】:2021-08-09 07:34:53
【问题描述】:
我正在尝试从here 中抓取文章的日期和网址。虽然我确实获得了日期列表和文章的标题(以文本形式),但我无法获得相同的网址。 这就是我在文本和日期中获取 url 标题的方式。
def sb_rum():
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
n_links = [ele.text for ele in news_links]
dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
n_dates = [ele.text for ele in dates]
print(n_links)
print(n_dates)
这给了我这样的输出
['Harpalion Spirits expands UK distribution', 'Bacardí gets fruity with new tropical rum', 'The world’s biggest-selling rums', 'Havana Club releases Tributo 2021 rum', 'Ron Santiago de Cu
ba rum revamps range', 'Michael B Jordan to change rum name after backlash', 'WIRD recognised for sustainable sugarcane practices', 'Rockstar Spirits advocates for UK-Australia trade deal
', 'Rum Brand Champion 2021: Tanduay', 'Dictador and Niepoort partner on new rum', 'Rockstar Spirits secures £25,000 Dragons’ Den funding', 'SB meets… Lucia Alliegro, Ron Carúpano', 'Brun
o Mars debuts Selvarey Coconut rum', 'Diplomático launches Mixed Consciously cocktail comp', 'Foursquare Distillery backs rum history research', 'Ron Cabezon signs distribution with Gordo
n & MacPhail', 'Havana Club launches smoky rum finished in whisky casks', 'Ron Colón and Bacoo Rum expand distribution', 'Harpalion Spirits launches Pedro Ximénez cask-finished rum', 'Rum
’s journey to premiumisation']
['July 13th, 2021', 'July 8th, 2021', 'July 6th, 2021', 'June 30th, 2021', 'June 29th, 2021', 'June 24th, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 18th, 2021'
, 'June 11th, 2021', 'June 7th, 2021', 'June 4th, 2021', 'June 2nd, 2021', 'May 28th, 2021', 'May 28th, 2021', 'May 26th, 2021', 'May 26th, 2021', 'May 24th, 2021', 'May 20th, 2021']
但我只是想获取相同的 url 链接。例如,我可以提取一个链接,但我无法提取所有链接。 为了获取所有链接,我尝试了类似
n_links = [ele.get_attribute('href') for ele in news_links.find_elements_by_tag_name('a')]
怎么做?请帮忙。
【问题讨论】:
-
可以使用beautifulsoup解析html。内置的 selenium 解析器很慢并且会带来奇怪的问题。
标签: python selenium selenium-webdriver web-scraping