SCAPY FORM REQUEST 不返回任何数据答案

【问题标题】：SCRAPY FORM REQUEST doesn't return any dataSCAPY FORM REQUEST 不返回任何数据
【发布时间】：2020-09-05 17:42:23
【问题描述】：

我正在向网站提出表单请求。请求成功，但没有返回任何数据。

日志：

2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我的密码：

# -*- coding: utf-8 -*-
import scrapy

codes = open('codes.txt').read().split('\n')

class MainSpider(scrapy.Spider):
    name = 'main'
    form_url = 'https://safer.fmcsa.dot.gov/query.asp'
    start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']

    def parse(self, response):

        for code in codes:
        
            data = {
                'searchtype': 'ANY',
                'query_type': 'queryCarrierSnapshot',
                'query_param': 'USDOT',
                'query_string': code,
            }

            yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)

    def parse_form(self, response):
        cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
        for each in cargo:
            each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()

            yield {
                "X Values": each_x if each_x else "N/A",
            }

以下是我用于 POST REQUEST 的一些示例代码。

2146709

273286

120670

2036998

690147

【问题讨论】：

标签： python web-scraping scrapy http-post

【解决方案1】：

我相信您只需在此处从您的 XPath 中删除 tbody：

    cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')

这样使用：

    cargo = response.xpath('//table[@summary="Cargo Carried"]/tr[2]') 
    # I also removed the () inside the path because you don't need it, but that didn't cause the problem.

这样做的原因是 Scrapy 会从页面解析原始代码，而您的浏览器可能会呈现 tbody 以防它不在源代码中。更多信息here。

【讨论】：