【发布时间】:2020-07-29 08:44:13
【问题描述】:
我正在尝试使用 Nominatim 对从网络上抓取的地址集进行地理定位。 Nominatim 适用于“标准”地址,例如。 123 StreetName St., ExampleSuburb 但我抓取的一些地址有“非标准”元素,例如。 仓库 3,123 StreetName.,ExampleSuburb。
有没有一种方法可以去除“非标准”元素,让 Nominatim 更容易找到它们?或者有没有办法让 Nominatim 尝试在非标准元素的情况下对地址进行地理定位?
例如,下面的代码在执行代码时引发类型错误,我不知道如何修复重新格式化地址以阻止这种情况发生,因为它是直接从网站上刮下来的,而我根本没有干预。
from bs4 import BeautifulSoup
import requests
from requests import get
import sqlite3
import geopandas
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
def scrapecafes(city, area):
#url = 'https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury' #go to the website
url = f"https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}"
response = requests.get(url, timeout=5)
soup_cafe_names = BeautifulSoup(response.content, "html.parser")
type(soup_cafe_names)
cafeNames = soup_cafe_names.findAll('h2', attrs={"class":"venue-title", }) #scrape the elements
cafeNamesClean = [cafe.text.strip() for cafe in cafeNames] #clean the elements
#cafeNameTuple = [(cafe,) for cafe in cafeNamesClean]
#print(cafeNamesClean)
#addresses
soup_cafe_addresses = BeautifulSoup(response.content, "html.parser")
type(soup_cafe_addresses)
cafeAddresses = soup_cafe_addresses.findAll( attrs={"class":"address-content" })
cafeAddressesClean = [address.text for address in cafeAddresses]
#cafeAddressesTuple = [(address,) for address in cafeAddressesClean]
#print(cafeAddressesClean)
##geocode addresses
locator = Nominatim(user_agent="myGeocoder")
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
try:
for item in cafeAddressesClean:
location = (locator.geocode(item))
lat = [location.latitude for item in location]
long = [location.longitude for item in location]
print(location)
except:
pass
#zip up for table
fortable = zip(cafeNamesClean, cafeAddressesClean, lat, long)
print(fortable)
scrapecafes(melbourne, fitzroy)
【问题讨论】:
标签: python web-scraping geocoding geopy nominatim