【问题标题】:Unable to scrape multiple URLs from a website using selenium python无法使用 selenium python 从网站上抓取多个 URL
【发布时间】:2021-08-09 07:34:53
【问题描述】:

我正在尝试从here 中抓取文章的日期和网址。虽然我确实获得了日期列表和文章的标题(以文本形式),但我无法获得相同的网址。 这就是我在文本和日期中获取 url 标题的方式。

def sb_rum():
    websites = ['https://www.thespiritsbusiness.com/tag/rum/']
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
        n_links = [ele.text for ele in news_links]
        dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
        n_dates = [ele.text for ele in dates]
        print(n_links)
        print(n_dates)

这给了我这样的输出

['Harpalion Spirits expands UK distribution', 'Bacardí gets fruity with new tropical rum', 'The world’s biggest-selling rums', 'Havana Club releases Tributo 2021 rum', 'Ron Santiago de Cu
ba rum revamps range', 'Michael B Jordan to change rum name after backlash', 'WIRD recognised for sustainable sugarcane practices', 'Rockstar Spirits advocates for UK-Australia trade deal
', 'Rum Brand Champion 2021: Tanduay', 'Dictador and Niepoort partner on new rum', 'Rockstar Spirits secures £25,000 Dragons’ Den funding', 'SB meets… Lucia Alliegro, Ron Carúpano', 'Brun
o Mars debuts Selvarey Coconut rum', 'Diplomático launches Mixed Consciously cocktail comp', 'Foursquare Distillery backs rum history research', 'Ron Cabezon signs distribution with Gordo
n & MacPhail', 'Havana Club launches smoky rum finished in whisky casks', 'Ron Colón and Bacoo Rum expand distribution', 'Harpalion Spirits launches Pedro Ximénez cask-finished rum', 'Rum
’s journey to premiumisation']
['July 13th, 2021', 'July 8th, 2021', 'July 6th, 2021', 'June 30th, 2021', 'June 29th, 2021', 'June 24th, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 18th, 2021'
, 'June 11th, 2021', 'June 7th, 2021', 'June 4th, 2021', 'June 2nd, 2021', 'May 28th, 2021', 'May 28th, 2021', 'May 26th, 2021', 'May 26th, 2021', 'May 24th, 2021', 'May 20th, 2021']

但我只是想获取相同的 url 链接。例如,我可以提取一个链接,但我无法提取所有链接。 为了获取所有链接,我尝试了类似

n_links = [ele.get_attribute('href') for ele in news_links.find_elements_by_tag_name('a')]

怎么做?请帮忙。

【问题讨论】:

  • 可以使用beautifulsoup解析html。内置的 selenium 解析器很慢并且会带来奇怪的问题。

标签: python selenium selenium-webdriver web-scraping


【解决方案1】:

工作解决方案,

n_links  = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]

【讨论】:

    【解决方案2】:

    我认为您不需要selenium 来抓取此网页。我已经使用beautifulsoup 来抓取您需要的数据。

    代码如下:

    import bs4 as bs
    import requests
    
    url = 'https://www.thespiritsbusiness.com/tag/rum/'
    resp = requests.get(url)
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    divs = soup.findAll('div', class_='archiveEntry')
    urls = []
    titles = []
    dates = []
    for i in divs:
        urls.append(i.find('a')['href'].strip())
        titles.append(i.find('h3').text.strip())
        dates.append(i.find('small').text.strip())
    

    【讨论】:

    • 非常感谢 Ram 的即时帮助,我使用 selenium 只是因为我想抓取类似的网站,尽管我只提到过一个。但是您的解决方案运行良好!
    猜你喜欢
    • 1970-01-01
    • 2021-09-23
    • 1970-01-01
    • 2022-06-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-01-15
    • 1970-01-01
    相关资源
    最近更新 更多