【发布时间】:2017-11-28 16:38:12
【问题描述】:
我刚开始使用 XPath 进行 html 抓取,所以我对语法有点困惑。我正在尝试从源代码的以下 sn-p 中提取 url:
<a href="/realestateandhomes-detail/15645-SW-74th-Circle-Dr-Apt-5_Miami_FL_33193_M69309-37779">
<img alt="15645 Sw 74th Circle Dr Apt 5, Miami, FL 33193" title="15645 Sw 74th Circle Dr Apt 5, Miami, FL 33193" class="js-srp-listing-photos" itemprop="image" data-src="https://ap.rdcpix.com/1980533383/49e7a93da461352c04b8e7146a8d2ceel-m0xd-w480_h480_q80.jpg" data-omtag="srp-listMap:result:photo" src="https://ap.rdcpix.com/1980533383/49e7a93da461352c04b8e7146a8d2ceel-m0xd-w480_h480_q80.jpg" />
</a>
html路径如下:
<body>
<li>
<div>
<a></a>
我正在使用 scrapy 来解析 html 页面,这是我目前的代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from realtor.items import RealtorItem
class RealtorSpider(BaseSpider):
name = "realtor"
allowed_domains = ["realtor.com"]
start_urls = [
"http://www.realtor.com/realestateandhomes-search/Miami_FL"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//li/div/a/@href')
items = []
for site in sites:
item = RealtorItem()
item['link'] = site.select('div/a/@href').extract()
items.append(item)
return items
当我运行代码时,它会在第 16 行返回错误,即 item[] = site.select().extract()。我不确定语法是否正确,或者我还缺少另一个潜在的问题。
错误是
KeyError: 'RealtorItem does not supprot field: link'
我的 items.py 代码如下:
from scrapy.item import Item, Field
class RealtorItem(Item):
link = scrapy.Field()
【问题讨论】:
-
你用的是什么版本的scrapy?
-
它是scrapy v 1.4.0
标签: python xpath web-scraping scrapy