【发布时间】:2020-10-14 14:16:02
【问题描述】:
我徒劳地试图从这里检索数据:https://www.etoro.com/discover/people/results。假设我想先获取昵称元素。在 HTML 源代码中以如下格式出现:<div _ngcontent-bqd-c27="" automation-id="trade-item-name" class="symbol">markaungier</div>
我尝试了以下三种方法:
- 使用 CSS 选择器
nickname = response.css("[automation-id=trade-item-name]") - 使用 XPATH 相对路径
nickname = response.xpath("//div[@automation-id='trade-item-name']") - 使用完整的 XPATH
response.xpath("/html/body/ui-layout/div/div/div[2]/et-discovery-people-results/div/div/et-discovery-people-results-grid/div/div/div/et-user-card[1]/div/header/et-card-avatar/a/div[2]/div[1]")
奇怪的是,他们都没有返回任何东西。这里发生了什么?问题是否因为this 而出现,即“某些网页在您在网络浏览器中加载它们时会显示所需的数据。但是,当您使用 Scrapy 下载它们时,您无法使用选择器获得所需的数据” em> ?
我的完整代码如下:
import scrapy
import requests
from lxml import html
from scrapy.crawler import CrawlerProcess
class EtoroSpider(scrapy.Spider):
name = "traders"
start_urls = [
"https://www.etoro.com/discover/people/results",
]
def parse(self, response):
nickname = response.xpath("//div[@automation-id='trade-item-name']")
print(nickname)
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(EtoroSpider)
process.start()
这里是scrapy的输出:
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Windows-10-10.0.18362-SP0
2020-10-14 16:29:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-14 16:29:08 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet Password: adf8b7868ee25c32
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 16:29:08 [scrapy.core.engine] INFO: Spider opened
2020-10-14 16:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 16:29:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.etoro.com/discover/people/results> (referer: None)
[]
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 16:29:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.353381,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 14, 29, 9, 150136),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 10, 14, 14, 29, 8, 796755)}
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Spider closed (finished)
编辑
我使用 scrapy fetch --nolog https://www.etoro.com/discover/people/results > response.html 获取了 Scrapy 看到的源代码,发现它包含一个注入的 JavaScript,并且没有上面的 <div> 标签的痕迹。
【问题讨论】:
标签: python web-scraping scrapy