【问题标题】:Scrapy not returning elementsScrapy不返回元素
【发布时间】:2020-10-14 14:16:02
【问题描述】:

我徒劳地试图从这里检索数据:https://www.etoro.com/discover/people/results。假设我想先获取昵称元素。在 HTML 源代码中以如下格式出现:<div _ngcontent-bqd-c27="" automation-id="trade-item-name" class="symbol">markaungier</div>

我尝试了以下三种方法:

  1. 使用 CSS 选择器 nickname = response.css("[automation-id=trade-item-name]")
  2. 使用 XPATH 相对路径 nickname = response.xpath("//div[@automation-id='trade-item-name']")
  3. 使用完整的 XPATH response.xpath("/html/body/ui-layout/div/div/div[2]/et-discovery-people-results/div/div/et-discovery-people-results-grid/div/div/div/et-user-card[1]/div/header/et-card-avatar/a/div[2]/div[1]")

奇怪的是,他们都没有返回任何东西。这里发生了什么?问题是否因为this 而出现,即“某些网页在您在网络浏览器中加载它们时会显示所需的数据。但是,当您使用 Scrapy 下载它们时,您无法使用选择器获得所需的数据” em> ?

我的完整代码如下:

import scrapy
import requests
from lxml import html
from scrapy.crawler import CrawlerProcess

class EtoroSpider(scrapy.Spider):
    name = "traders"
    start_urls = [
         "https://www.etoro.com/discover/people/results",
    ]

    def parse(self, response):

        nickname = response.xpath("//div[@automation-id='trade-item-name']")
        print(nickname)

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})

process.crawl(EtoroSpider)
process.start()

这里是scrapy的输出:

2020-10-14 16:29:08 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Windows-10-10.0.18362-SP0
2020-10-14 16:29:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-14 16:29:08 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet Password: adf8b7868ee25c32
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 16:29:08 [scrapy.core.engine] INFO: Spider opened
2020-10-14 16:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 16:29:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.etoro.com/discover/people/results> (referer: None)
[]
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 16:29:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 23288,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.353381,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 14, 14, 29, 9, 150136),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 10, 14, 14, 29, 8, 796755)}
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Spider closed (finished)

编辑 我使用 scrapy fetch --nolog https://www.etoro.com/discover/people/results &gt; response.html 获取了 Scrapy 看到的源代码,发现它包含一个注入的 JavaScript,并且没有上面的 &lt;div&gt; 标签的痕迹。

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    您可以使用开发工具的网络选项卡检查 ajax 数据获取。在这种情况下,有几个相当重的响应,很可能它们包含所需的数据。 因此,即使不解析主页,也可以通过 API 获取。

    【讨论】:

    • 谢谢!但是,如果网站的 API 处于 alpha 模式并且他们目前没有为用户提供任何 API 密钥怎么办?
    • 大多数时候您不需要任何密钥。您可以直接使用前端使用的数据源。只需模拟前端请求并从 api 获取纯 JSON。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多