【问题标题】:Web scraping stock details from Business Insider using Scrapy使用 Scrapy 从 Business Insider 网络抓取股票详细信息
【发布时间】:2020-05-26 04:13:46
【问题描述】:

我正在尝试从以下站点提取每只股票的“名称”、“最新价格”和“百分比”字段: https://markets.businessinsider.com/index/components/s&p_500

但是,即使我已确认我的 XPath 在 Chrome 控制台中适用于这些字段,我也没有抓取任何数据。

作为参考,我一直在使用本指南: https://realpython.com/web-scraping-with-scrapy-and-mongodb/

items.py

from scrapy.item import Item, Field

class InvestmentItem(Item):
    ticker = Field()
    name = Field()
    px = Field()
    pct = Field()

investment_spider.py

from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem

class InvestmentSpider(Spider):
    name = "investment"
    allowed_domains = ["markets.businessinsider.com"]
    start_urls = [
            "https://markets.businessinsider.com/index/components/s&p_500",
            ]

    def parse(self, response):
        stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')

        for stock in stocks:
            item = InvestmentItem()
            item['name'] = stock.xpath('td[1]/a/text()').extract()[0]
            item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]
            item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]

            yield item

控制台输出:

...
2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)

【问题讨论】:

    标签: javascript python reactjs web-scraping scrapy


    【解决方案1】:

    您在请求 xpath 表达式时缺少“./”。 我已经简化了你的 xpath:

    def parse(self, response):
        stocks = response.xpath('//table[@class="table table-small"]/tr')
    
        for stock in stocks[1:]:
            item = InvestmentItem()
            item['name'] = stock.xpath('./td[1]/a/text()').get()
            item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()
            item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()
    
            yield item
    

    【讨论】:

    • 非常感谢!你能解释一下为什么我的 XPath 不起作用(我试着在开头插入“./”,但它仍然没有返回任何东西)?还有为什么可以在'//table[@class="table table-small"]/tr'中排除"/tr"前的"tbody"?
    • 您可以对link上的点符号和 xpath 搜索有一个很好的解释
    【解决方案2】:

    XPATH 版本

        def parse(self, response):
    
            rows = response.xpath('//*[@id="index-list-container"]/div[2]/table/tr')
            for row in rows:
                yield{
                    'name' : row.xpath('td[1]/a/text()').extract(),
                    'price':row.xpath('td[2]/text()[1]').extract(),
                    'pct':row.xpath('td[5]/span[2]/text()').extract(),
                    'datetime':row.xpath('td[7]/span[2]/text()').extract(),
                }
    

    CSS 版本

        def parse(self, response):
    
            table = response.css('div#index-list-container table.table-small') 
            rows = table.css('tr') 
    
            for row in rows:
                name = row.css("a::text").get()
                high_low = row.css('td:nth-child(2)::text').get()
                date_time = row.css('td:nth-child(7) span:nth-child(2) ::text').get()
    
                yield {      
                    'name' : name, 
                    'high_low': high_low,
                    'date_time' : date_time                
                }
    

    结果

    {"high_low": "\r\n146.44", "name": "3M", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    {"high_low": "\r\n42.22", "name": "AO Smith", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    {"high_low": "\r\n91.47", "name": "Abbott Laboratories", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    {"high_low": "\r\n92.10", "name": "AbbVie", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    {"high_low": "\r\n193.71", "name": "Accenture", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    {"high_low": "\r\n73.08", "name": "Activision Blizzard", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
    {"high_low": "\r\n385.26", "name": "Adobe", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
    {"high_low": "\r\n133.48", "name": "Advance Auto Parts", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
    

    【讨论】:

    • 很高兴,我能帮上忙
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-01-08
    • 2013-07-15
    相关资源
    最近更新 更多