【问题标题】:iterating same scraper code over various urls在各种 url 上迭代相同的刮板代码
【发布时间】:2019-05-05 21:33:23
【问题描述】:

现在我需要在多个子域上重复相同的代码。这是我当前的代码:


我已经编辑了我的代码以更好地反映我的问题:

for base in urls:
    urls = ["https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/almagro/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/palermo/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/villa-crespo/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/balvanera/empanadas-delivery",]
    page = 1
    restaurants = []

while True:
    soup = bs(requests.get(base + str(page)).text, "html.parser")
    page += 1
    sections = soup.find_all("section", attrs={"class": "restaurantData"})

    if not sections: break

    for section in sections:
        for elem in section.find_all("a", href=True, attrs={"class": "arrivalName"}):
            restaurants.append({"name": elem.text, "url": elem["href"],})

我需要一个包含以下列的 .CSV:

[(url, name of all restaurants in each url, url for each restaurant)]

【问题讨论】:

  • 所以你只想遍历一个列表,并且对于列表中的每个项目,将其附加到一个作为子域的字符串?
  • 对,我想在 url 列表上迭代上面的代码,最后得到三列:子域 - 名称 - url 示例:pedidosya.com.ar/restaurantes/buenos-aires/monserrat/… - pedidosya.com.ar/restaurantes/buenos-aires/… - El Noble Galerías Pacífico
  • 好的,所以你的代码的第一个输出是{'name': 'El Noble Galerías Pacífico', 'url': 'https://www.pedidosya.com.ar/restaurantes/buenos-aires/el-noble-galerias-pacifico-menu'}你想把它转换成csv格式吗? (输出)
  • 在 .CSV 中。每个子域有 50-100 家餐厅,我想要一个包含三列的 .CSV。有意义吗?
  • 是的。但我希望它通过一个列表来改变上面代码中的“base”,并带有一个 url 列表。

标签: python loops web-scraping beautifulsoup


【解决方案1】:

我想这就是你要找的:

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
import bs4
import requests
import csv

urls = ["https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/almagro/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/palermo/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/villa-crespo/empanadas-delivery","https://www.pedidosya.com.ar/restaurantes/buenos-aires/balvanera/empanadas-delivery",]

#writing

with open("output.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    writer.writerow(['subdomain', 'name', 'url']) #delete this line if you don't want the header

    for url in urls:
        base = url+ "?bt=RESTAURANT&page="
        page = 1
        restaurants = []

        while True:
            soup = bs(requests.get(base + str(page)).text, "html.parser")                
            sections = soup.find_all("section", attrs={"class": "restaurantData"})

            if not sections: break

            for section in sections:
                for elem in section.find_all("a", href=True, attrs={"class": "arrivalName"}):
                    restaurants.append({"name": elem.text, "url": elem["href"],})
                    writer.writerow([base+str(page),elem.text,elem["href"]])
            page += 1    

#reading

file = open("output.csv", 'r')    
reader = csv.reader(file)

for row in reader:
    #the output is a bunch of lists, which you can do what you want with
    print(row)

这是输出:

subdomain,name,url
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,Cümen-Cümen Empanadas Palermo,https://www.pedidosya.com.ar/restaurantes/buenos-aires/cumen-cumen-empanadas-palermo-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,El Maitén Empanadas - Al horno o fritas,https://www.pedidosya.com.ar/restaurantes/buenos-aires/el-maiten-empanadas-al-horno-o-fritas-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,Cümen-Cümen Empanadas - Barrio Norte,https://www.pedidosya.com.ar/restaurantes/buenos-aires/cumen-cumen-empanadas-barrio-norte-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,La Carbonera,https://www.pedidosya.com.ar/restaurantes/buenos-aires/la-carbonera-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,Tatú Empanadas Salteñas Palermo,https://www.pedidosya.com.ar/restaurantes/buenos-aires/tatu-empanadas-saltenas-palermo-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,Morita Palermo,https://www.pedidosya.com.ar/restaurantes/buenos-aires/morita-palermo-menu
https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1,Doña Eulogia,https://www.pedidosya.com.ar/restaurantes/buenos-aires/dona-eulogia-menu
...
...
...

使用 python 读取 csv 时的输出:

['subdomain', 'name', 'url']
['https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1', 'Cümen-Cümen Empanadas Palermo', 'https://www.pedidosya.com.ar/restaurantes/buenos-aires/cumen-cumen-empanadas-palermo-menu']
['https://www.pedidosya.com.ar/restaurantes/buenos-aires/recoleta/empanadas-delivery?bt=RESTAURANT&page=1', 'El Maitén Empanadas - Al horno o fritas', 'https://www.pedidosya.com.ar/restaurantes/buenos-aires/el-maiten-empanadas-al-horno-o-fritas-menu']
...
...
...

因此,当您阅读 csv 文件时,您会得到(上图),这是一堆您可以迭代的列表。

祝你好运!

【讨论】:

  • 你是绝对的队长!多谢!接下来:抓取每个单独的菜单。另外,如何在我的电脑上将其保存为 .csv?
  • 那么代码会自动将其保存为 output.csv,但您可以将路径更改为您想要的任何内容with open("output.csv", 'w', newline='') as csvfile:
  • 所以当你运行它时,它会在你运行程序的文件夹中创建一个名为 output.csv 的文件
猜你喜欢
  • 2017-01-24
  • 2021-10-04
  • 2021-10-19
  • 1970-01-01
  • 2016-11-28
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-07-13
相关资源
最近更新 更多