网页抓取 - 麦肯锡文章答案

【问题标题】：Web Scraping - McKinsey Articles网页抓取 - 麦肯锡文章
【发布时间】：2019-02-13 16:03:25
【问题描述】：

我正在寻找文章标题。我不知道如何提取标题文本。你能看看我下面的代码并提出解决方案吗？

我是scrapy的新手。感谢您的帮助！

网页开发者视图的屏幕截图 https://imgur.com/a/O1lLquY

import scrapy



class BrickSetSpider(scrapy.Spider):
    name = "brickset_spider"
    start_urls = ['https://www.mckinsey.com/search?q=Agile&start=1']

    def parse(self, response):
        for quote in response.css('div.text-wrapper'):
            item = {
                'text': quote.css('h3.headline::text').extract(),
            }
            print(item)
            yield item

【问题讨论】：

标签： python web web-scraping scrapy

【解决方案1】：

看起来很适合新手开发者！我只改变了你parse函数中的选择器：

for quote in response.css('div.block-list div.item'):
    yield {
        'text': quote.css('h3.headline::text').get(),
    }

UPD：嗯，您的网站似乎提出了额外的数据请求。

打开开发者工具并检查对https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search 的请求，参数为{"q":"Agile","page":1,"app":"","sort":"default","ignoreSpellSuggestion":false}。您可以使用这些参数和适当的标头制作scrapy.Request，并获取带有数据的json。使用json lib 可以轻松解析。

UPD2：从这个 curl curl 'https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search' -H 'content-type: application/json' --data-binary '{"q":"Agile","page”:1,”app":"","sort":"default","ignoreSpellSuggestion":false}' --compressed 可以看出，我们需要以这种方式发出请求：

from scrapy import Request
import json

data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
headers = {"content-type": "application/json"}
url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
yield Request(url, headers=headers, body=json.dumps(data), callback=self.parse_api)

然后在parse_api 函数中只解析响应：

def parse_api(self, response):
    data = json.loads(response.body)
    # and then extract what you need

所以你可以在请求中迭代参数page并获取所有页面。

UPD3：工作解决方案：

from scrapy import Spider, Request
import json


class BrickSetSpider(Spider):
    name = "brickset_spider"

    data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
    headers = {"content-type": "application/json"}
    url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"

    def start_requests(self):
        yield Request(self.url, headers=self.headers, method='POST',
                  body=json.dumps(self.data), meta={'page': 1})

    def parse(self, response):
        data = json.loads(response.body)
        results = data.get('data', {}).get('results')
        if not results:
            return

        for row in results:
            yield {'title': row.get('title')}

        page = response.meta['page'] + 1
        self.data['page'] = page
        yield Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'page': page})

【讨论】：

您能否提供一个示例，说明我将如何使用您建议的参数向上述 url 发出请求？
@jwalman 我已经更新了帖子的所有细节。希望对您有所帮助！
我已将您的代码复制到我的 scraper.py 中。我现在收到以下错误： SyntaxError: 'yield' outside function。我已经完全复制了您的代码，并使用scrapy runspider scrapy.py 从命令行运行它。你能给些建议么？十分感谢你的帮助！我非常感谢！ -- @vezunchik
您是否插入了屈服于parse 函数？
@robots.txt 因为这里我们要调用parse方法，我们可以省略它，因为它是默认值：docs.scrapy.org/en/latest/topics/…If a Request doesn’t specify a callback, the spider’s parse() method will be used.

【解决方案2】：

如果你只想选择h1标签的文本，你所要做的就是

[tag.css('::text').extract_first(default='') for tag in response.css('.attr')]

这是使用xpath，可能更容易。

 //h1[@class='state']/text()

另外，我建议查看 BeautifulSoup for python。在阅读整个页面的 html 和提取文本方面非常简单有效。 https://www.crummy.com/software/BeautifulSoup/bs4/doc/

一个非常简单的例子是这样的。

from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())

【讨论】：