【发布时间】:2017-12-20 19:28:54
【问题描述】:
我一直在研究餐厅食品卫生刮刀。我已经能够让刮刀根据邮政编码刮取餐馆的名称、地址和卫生等级。由于食品卫生是通过在线图像显示的,因此我设置了刮板以读取“alt =”参数,该参数包含食品卫生分数的数值。
包含我针对食品卫生评级的 img alt 标签的 div 如下所示:
<div class="rating-image" style="clear: right;">
<a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
<img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
</a>
</div>
我已经能够将食品卫生分数输出到每家餐厅旁边。
我的问题是,我注意到一些餐馆旁边显示的读数不正确,例如食品卫生评级为 3 而不是 4(这存储在 img alt 标签中)
scraper最初连接到scrape的链接是
我认为这可能与“g_data for 循环中的 for item”中的 rating for 循环的位置有关。
我发现如果我移动了
appendhygiene(scrape=[name,address,bleh])
下面循环外的一段代码
for rating in ratings:
bleh = rating['alt']
使用正确的卫生分数正确抓取数据,唯一的问题是并非所有记录都被抓取,在这种情况下它只输出前 9 家餐厅。
感谢任何可以查看下面我的代码并提供帮助以解决问题的人。
PS,我使用邮政编码 BT367NG 来抓取餐馆(如果你测试了脚本,你可以使用它来查看不显示正确卫生值的餐馆,例如 Lins Garden 在网站上是 4,并且抓取的数据显示3).
我的完整代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
hygiene = []
def deletelist():
hygiene.clear()
def savefile():
filename = input("Please input name of file to be saved")
with open (filename + '.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in hygiene:
writer.writerow(row)
print("File Saved Successfully")
def appendhygiene(scrape):
hygiene.append(scrape)
def makesoup(url):
page=requests.get(url)
print(url + " scraped successfully")
return BeautifulSoup(page.text,"lxml")
def hygienescrape(g_data, ratings):
for item in g_data:
try:
name = (item.find_all("a", {"class": "name"})[0].text)
except:
pass
try:
address = (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
def hygieneratings():
search = input("Please enter postcode")
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel": "next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don't get kicked by server
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel" : "next"}, href=True)
def menu():
strs = ('Enter 1 to search Food Hygiene ratings \n'
'Enter 2 to Exit\n' )
choice = input(strs)
return int(choice)
while True: #use while True
choice = menu()
if choice == 1:
hygieneratings()
savefile()
deletelist()
elif choice == 2:
break
elif choice == 3:
break
【问题讨论】:
标签: python beautifulsoup screen-scraping python-3.6 scrape