【问题标题】:Scrapy Selenium: extracts first link then throws errorScrapy Selenium:提取第一个链接然后抛出错误
【发布时间】:2021-12-30 16:52:36
【问题描述】:

我正在开发一个收集属性信息的 scaper。

原始代码完美运行。

URL = "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/{}"

class huntsmanCSS(scrapy.Spider):

name = "huntsman"
allowed_domains = ["orion.lancaster.ne.gov"]
f = open('parcel_ids.txt')
start_urls = [URL.format(pid.strip()) for pid in f.readlines()]
   
def parse(self, response):
     
        yield {

            'propId': response.css('#dnn_ctr388_View_tdPropertyID::text').extract_first(),
            'address': response.css('#dnn_ctr388_View_tdPropertyAddress::text').extract_first(),
            'owner': response.css('#dnn_ctr388_View_divOwnersLabel::text').extract_first(),
            'propertyClass': response.css('#dnn_ctr388_View_tdGIPropertyClass::text').extract_first(),
            'hood':  response.css('#dnn_ctr388_View_tdGINeighborhood::text').extract_first(),
            'buildType': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(3)::text').extract_first(),
            'improveType': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(4)::text').extract_first(),
            'yrBuilt': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(5)::text').extract_first(),
            'saleDate': response.css('#dnn_ctr388_View_tblSalesHistoryData tr:nth-child(2) > td:nth-child(1)::text').extract_first(),
            'TAV': response.css('#dnn_ctr388_View_tdPropertyValueHeader::text').extract_first(),
            'price': response.css('#dnn_ctr388_View_tblSalesHistoryData > tr:nth-child(2) > td:nth-child(5)::text').extract_first(),
            'sqFt': response.css('#resImprovementTable0 > tr:nth-child(2) > td:nth-child(6)::text').extract_first() 
      
         }

使用所有包裹的列表,调整 URL 以转到下一页。

破解密码:

有一个嵌入在 javascript 按钮中的 pdf 链接。 pdf 包含更多我想抓取的信息。

它将检索第一个链接,但随后会引发错误。

URL = "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/{}"

class resDatasheetLink(scrapy.Spider):

name = "resDatasheetLink"
allowed_domains = ["orion.lancaster.ne.gov"]
f = open('residential.txt')
start_urls = [URL.format(pid.strip()) for pid in f.readlines()]

def __init__(self):
    self.driver = webdriver.Chrome()

def parse(self, response):
    self.driver.get(response.url)
    
    while True:
        try: 
            btn = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="btnDataSheet"]')))
            btn.click()
        except TimeoutException:
            break
        time.sleep(5)
        link = self.driver.current_url
        self.driver.close()

        yield {

             'datasheet': link

        }

错误:

2021-12-30 10:40:36 [scrapy.core.engine] DEBUG: 
Crawled (200) <GET 
https://orion.lancaster.ne.gov/Property- 
Detail/PropertyQuickRefID/R402438> (referer: None)
2021-12-30 10:40:36 
[selenium.webdriver.remote.remote_connection] 
DEBUG: POST 
http://localhost:19113/session/5acb1d8f4ebdb13482ab40a67f846d1d/url {"url": "https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/R402438"}
2021-12-30 10:40:36 [urllib3.connectionpool] DEBUG: http://localhost:19113 "POST /session/5acb1d8f4ebdb13482ab40a67f846d1d/url HTTP/1.1" 404 878
2021-12-30 10:40:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-30 10:40:36 [scrapy.core.scraper] ERROR: Spider error processing <GET https://orion.lancaster.ne.gov/Property-Detail/PropertyQuickRefID/R402438> (referer: None)
Traceback (most recent call last):

selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id

【问题讨论】:

    标签: python selenium scrapy


    【解决方案1】:

    break 将带您脱离 while 循环。您需要取消缩进 try-except{} 正下方的最后几行,并在解析结束时调用 self.driver.close()(最好是 self.driver.quit())行,如下所示:

    def 解析(自我,响应): self.driver.get(response.url)

    while True:
        try: 
            btn = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="btnDataSheet"]')))
            btn.click()
        except TimeoutException:
            break
    time.sleep(5)
    link = self.driver.current_url
    
    yield {
    
         'datasheet': link
    
    }
    
    self.driver.close()
    

    【讨论】:

    • 你还没有告诉我们你的用例,但我仍然试图帮助你。祝你好运。
    【解决方案2】:

    根据蜘蛛的配置方式,问题是循环。

    class rDataLink(scrapy.Spider):
    
    name = "rDataLink"
    allowed_domains = ["orion.lancaster.ne.gov"]
    f = open('residential.txt')
    start_urls = [URL.format(pid.strip()) for pid in f.readlines()]
    
    def __init__(self):
        self.driver = webdriver.Chrome()
    
    def parse(self, response):
        self.driver.get(response.url)
        btn = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="btnDataSheet"]')))
        btn.click()
        WebDriverWait(self.driver, 7).until(EC.url_changes(response.url))
        link = self.driver.current_url
        
        yield {
    
            'datasheet': link
    
        }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-12-02
      • 1970-01-01
      • 2019-04-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多