【问题标题】:How to scrape a page with BeautifulSoup and Python?如何使用 BeautifulSoup 和 Python 抓取页面?
【发布时间】:2015-06-07 22:23:12
【问题描述】:

我正在尝试从 BBC 美食网站提取信息,但在缩小我收集的数据范围时遇到了一些问题。

这是我目前所拥有的:

from bs4 import BeautifulSoup
import requests

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("a")

for anchor in links:
    print(anchor.get('href')), anchor.text

这会返回相关页面中的所有链接以及链接的文本描述,但我想从页面上的“文章”类型对象中提取链接。这些是特定食谱的链接。

通过一些实验,我设法从文章中返回文本,但我似乎无法提取链接。

【问题讨论】:

    标签: python python-2.7 web-scraping


    【解决方案1】:

    我看到的与文章标签相关的唯一两件事是 href 和 img.src:

    from bs4 import BeautifulSoup
    import requests
    
    webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
    soup = BeautifulSoup(webpage.content)
    links = soup.find_all("article")
    
    for ele in links:
        print(ele.a["href"])
        print(ele.img["src"])
    

    链接在"class=node-title"

    from bs4 import BeautifulSoup
    import requests
    
    webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
    soup = BeautifulSoup(webpage.content)
    
    
    links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"})
    
    for l in links:
        print(l.a["href"])
    
    /recipes/681646/tomato-tart
    /recipes/4468/stuffed-tomatoes
    /recipes/1641/charred-tomatoes
    /recipes/tomato-confit
    /recipes/1575635/roast-tomatoes
    /recipes/2536638/tomato-passata
    /recipes/2518/cherry-tomatoes
    /recipes/681653/stuffed-tomatoes
    /recipes/2852676/tomato-sauce
    /recipes/2075/tomato-soup
    /recipes/339605/tomato-sauce
    /recipes/2130/essence-of-tomatoes-
    /recipes/2942/tomato-tarts
    /recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa
    /recipes/3509/honey-and-thyme-tomatoes
    

    要访问,您需要在前面加上 http://www.bbcgoodfood.com:

    for l in links:
           print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status
    200
    200
    200
    200
    200
    200
    200
    200
    200
    200
    

    【讨论】:

      【解决方案2】:

      BBC 美食页面的结构现已更改。

      我已经设法修改了这样的代码,虽然不完美,但可以构建:

      import numpy as np
      #Create empty list
      listofurls = []
      pages = np.arange(1, 10, 1)
      ingredientlist = ['milk','eggs','flour']
      for ingredient in ingredientlist:
          for page in pages:
              page = requests.get('https://www.bbcgoodfood.com/search/recipes/page/' + str(page) + '/?q=' + ingredient + '&sort=-relevance')
              soup = BeautifulSoup(page.content)
              for link in soup.findAll(class_="standard-card-new__article-title"):
                  listofurls.append("https://www.bbcgoodfood.com" + link.get('href'))
      

      【讨论】:

        猜你喜欢
        • 2013-01-29
        • 1970-01-01
        • 2014-12-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-03-20
        • 1970-01-01
        • 2019-10-27
        相关资源
        最近更新 更多