【发布时间】:2021-12-29 02:52:41
【问题描述】:
我正在尝试使用此规则创建一个爬虫,它将点击进入每个属性的页面并获取详细信息。但是 URL 是一个相对 URL,不能在 Scrapy Crawler Rule 中使用,因为它只接受绝对 URL。这是我使用 process_value 提出的解决方案,但它不起作用。谁能帮忙推荐另一种方法来解决这个问题,谢谢!
这是目前的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class EdgepropSpider(CrawlSpider):
name = 'edgeprop'
allowed_domains = ['edgeprop.my']
start_urls = ['https://www.edgeprop.my/buy/malaysia/all-residential']
rules = (
Rule(LinkExtractor(restrict_xpaths=("//div[@class='card tep-listing-card']/a/@href"), process_value= lambda x: 'https://edgeprop.my'+x), callback='parse_item', follow=True),
#Rule(LinkExtractor(restrict_xpaths=("//nav[@aria-label='Listing Page navigation']//li[position() = last()]/a")), follow=True)
)
def parse_item(self, response):
yield {
'Name': response.xpath("//div[@class='save-share']/following-sibling::h1/text()").get()
}
这是输出:
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider opened
2021-12-29 10:42:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-29 10:42:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-29 10:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.edgeprop.my/buy/malaysia/all-residential> (referer: None)
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-29 10:42:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 4126,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.237148,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 936521),
'httpcompression/response_bytes': 10918,
'httpcompression/response_count': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 699373)}
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider closed (finished)
【问题讨论】:
-
即使
process_value是正确的,它也无法解决问题——它是dynamically-loaded content。尝试在没有 javascript 的情况下运行并查看。 -
啊,是的,你是对的。关于如何抓取这个、scrapy-splash 或 scrapy-selenium 的任何建议?
标签: python html web-scraping scrapy web-crawler