【问题标题】:SCRAPY FORM REQUEST doesn't return any dataSCAPY FORM REQUEST 不返回任何数据
【发布时间】:2020-09-05 17:42:23
【问题描述】:

我正在向网站提出表单请求。请求成功,但没有返回任何数据。

日志:

2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我的密码:

# -*- coding: utf-8 -*-
import scrapy

codes = open('codes.txt').read().split('\n')

class MainSpider(scrapy.Spider):
    name = 'main'
    form_url = 'https://safer.fmcsa.dot.gov/query.asp'
    start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']

    def parse(self, response):

        for code in codes:
        
            data = {
                'searchtype': 'ANY',
                'query_type': 'queryCarrierSnapshot',
                'query_param': 'USDOT',
                'query_string': code,
            }

            yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)

    def parse_form(self, response):
        cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
        for each in cargo:
            each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()

            yield {
                "X Values": each_x if each_x else "N/A",
            }

以下是我用于 POST REQUEST 的一些示例代码。

2146709

273286

120670

2036998

690147

【问题讨论】:

    标签: python web-scraping scrapy http-post


    【解决方案1】:

    我相信您只需在此处从您的 XPath 中删除 tbody

        cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
    

    这样使用:

        cargo = response.xpath('//table[@summary="Cargo Carried"]/tr[2]') 
        # I also removed the () inside the path because you don't need it, but that didn't cause the problem.
    

    这样做的原因是 Scrapy 会从页面解析原始代码,而您的浏览器可能会呈现 tbody 以防它不在源代码中。更多信息here

    【讨论】:

      猜你喜欢
      • 2023-02-14
      • 1970-01-01
      • 1970-01-01
      • 2017-11-23
      • 2021-12-13
      • 2016-12-06
      • 2011-04-04
      • 2015-06-09
      • 1970-01-01
      相关资源
      最近更新 更多