【问题标题】:Web scraping with scrapy用 scrapy 抓取网页
【发布时间】:2014-06-10 15:56:50
【问题描述】:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from botg.items import BotgItem


URL = "http://store.tcgplayer.com/magic/born-of-the-gods?PageNumber=%d"

class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["tcgplayer.com"]
start_urls = [URL % 1]

def __init__(self):
    self.page_number = 1

def parse(self, response):
    print self.page_number
    print "--------------------BREAK-------------------------"

    sel = Selector(response)
    titles = sel.xpath("//div[@class='magicCard']")
    if not titles:
        raise CloseSpider('No more pages')

    for title in titles:
        item = BotgItem()
        item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
        item["rarity"] = title.xpath(".//li[@href='/magic/born-of-the-gods']/text()").extract()

        vendor = title.xpath(".//tr[@class='vendor ']")
        item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
        item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
        item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
        item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
        item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()

        yield item

    self.page_number += 1
    yield Request(URL % self.page_number)

我正在使用此代码来抓取页面,但无法获取“稀有”来抓取。任何帮助将不胜感激。其他一切似乎都有效,也有人能告诉我“[0]”在卡片名称项的行中的 .extract() 之后做了什么。

【问题讨论】:

  • 你能详细说明but am not able to get the "rarity" to scrape吗?你看到错误了吗? extract() 返回一个列表,所以[0] 返回extract() 输出的第一个元素。

标签: web-scraping scrapy


【解决方案1】:

对于稀有领域,我建议:

  • 你会得到一个包含<li class="cardName"><ul>的文本表示,
  • 使用正则表达式提取“Rarity:”之后的内容

类似这样的:

for title in titles:
    item = BotgItem()
    item["rarity"] = title.xpath('string(.//ul[li[@class="cardName"]])').re(r'Rarity:\s*(\w+)')

关于您的第二个问题,.extract() 提取字符串列表,因此 [0] 只需选择该列表的第一个元素

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-09-15
    • 2023-04-03
    • 1970-01-01
    • 2015-04-15
    • 2021-01-13
    • 2021-12-07
    • 2018-07-24
    • 1970-01-01
    相关资源
    最近更新 更多