【发布时间】:2021-12-30 16:52:36
【问题描述】:
我正在开发一个收集属性信息的 scaper。
原始代码完美运行。
URL = "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/{}"
class huntsmanCSS(scrapy.Spider):
name = "huntsman"
allowed_domains = ["orion.lancaster.ne.gov"]
f = open('parcel_ids.txt')
start_urls = [URL.format(pid.strip()) for pid in f.readlines()]
def parse(self, response):
yield {
'propId': response.css('#dnn_ctr388_View_tdPropertyID::text').extract_first(),
'address': response.css('#dnn_ctr388_View_tdPropertyAddress::text').extract_first(),
'owner': response.css('#dnn_ctr388_View_divOwnersLabel::text').extract_first(),
'propertyClass': response.css('#dnn_ctr388_View_tdGIPropertyClass::text').extract_first(),
'hood': response.css('#dnn_ctr388_View_tdGINeighborhood::text').extract_first(),
'buildType': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(3)::text').extract_first(),
'improveType': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(4)::text').extract_first(),
'yrBuilt': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(5)::text').extract_first(),
'saleDate': response.css('#dnn_ctr388_View_tblSalesHistoryData tr:nth-child(2) > td:nth-child(1)::text').extract_first(),
'TAV': response.css('#dnn_ctr388_View_tdPropertyValueHeader::text').extract_first(),
'price': response.css('#dnn_ctr388_View_tblSalesHistoryData > tr:nth-child(2) > td:nth-child(5)::text').extract_first(),
'sqFt': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(6)::text').extract_first()
}
使用所有包裹的列表,调整 URL 以转到下一页。
破解密码:
有一个嵌入在 javascript 按钮中的 pdf 链接。 pdf 包含更多我想抓取的信息。
它将检索第一个链接,但随后会引发错误。
URL = "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/{}"
class resDatasheetLink(scrapy.Spider):
name = "resDatasheetLink"
allowed_domains = ["orion.lancaster.ne.gov"]
f = open('residential.txt')
start_urls = [URL.format(pid.strip()) for pid in f.readlines()]
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
while True:
try:
btn = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="btnDataSheet"]')))
btn.click()
except TimeoutException:
break
time.sleep(5)
link = self.driver.current_url
self.driver.close()
yield {
'datasheet': link
}
错误:
2021-12-30 10:40:36 [scrapy.core.engine] DEBUG:
Crawled (200) <GET
https://orion.lancaster.ne.gov/Property-
Detail/PropertyQuickRefID/R402438> (referer: None)
2021-12-30 10:40:36
[selenium.webdriver.remote.remote_connection]
DEBUG: POST
http://localhost:19113/session/5acb1d8f4ebdb13482ab40a67f846d1d/url {"url": "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/R402438"}
2021-12-30 10:40:36 [urllib3.connectionpool] DEBUG: http://localhost:19113 "POST /session/5acb1d8f4ebdb13482ab40a67f846d1d/url HTTP/1.1" 404 878
2021-12-30 10:40:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-30 10:40:36 [scrapy.core.scraper] ERROR: Spider error processing <GET https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/R402438> (referer: None)
Traceback (most recent call last):
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
【问题讨论】: