【问题标题】:Scrapy is returning content from a different webpageScrapy 正在从不同的网页返回内容
【发布时间】:2021-03-04 01:26:22
【问题描述】:

我正在尝试从 Tapology.com 抓取战斗数据,但我通过 Scrapy 提取的内容为我提供了完全不同网页的内容。例如,我想从以下链接中提取战斗机名称:

https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii

所以我打开了scrapy shell:

scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'

然后我尝试使用以下代码提取战斗机名称:

response.css('.fighterNames ::text').getall()

我认为这是一个结果:

['\n', '\n', '\n', “比利·阿亚什”, '\n', '\n', '\n', “丹尼斯·里德”, '\n', '\n', '\n', '\n', '“惩罚者”', '\n', '\n', '\n']

正如您在网页上看到的,如果您检查 HTML,返回的名称应该是“Robbie Lawler”和“Rory MacDonald”。更奇怪的是,每次我在shell环境中测试这个网页时,Scrapy都会返回不同的内容。它不会总是从比利·阿亚什和丹尼斯·里德的战斗网页返回内容。

Scrapy 有问题吗? Tapology.com 有问题吗?任何帮助,将不胜感激!我在 ufcstats.com 上使用过 Scrapy,在此测试之前和之后都没有任何问题。

这是完整的代码:

(base) davidwismer@Davids-MacBook-Pro ~ % scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-03 17:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-03 17:18:03 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2021-03-03 17:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: b44d20b5d1bbeb73
2021-03-03 17:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-03 17:18:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-03 17:18:04 [scrapy.core.engine] INFO: Spider opened
2021-03-03 17:18:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii> (referer: None)
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc4d97c5730>
[s]   item       {}
[s]   request    <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   response   <200 https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   settings   <scrapy.settings.Settings object at 0x7fc4d97c5e50>
[s]   spider     <DefaultSpider 'default' at 0x7fc4d9e26100>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
In [1]: response.css('.fighterNames ::text').getall()
Out[1]: 
['\n',
 '\n',
 '\n',
 'Billy Ayash',
 '\n',
 '\n',
 '\n',
 'Dennis Reed',
 '\n',
 '\n',
 '\n',
 '\n',
 '"The Punisher"',
 '\n',
 '\n',
 '\n']

【问题讨论】:

    标签: python python-3.x web-scraping scrapy


    【解决方案1】:

    我用requests + BeautifulSoup4 对其进行了测试,得到了相同的结果。

    但是,当我将 User-Agent 标头设置为其他内容(在下面的示例中从我的网络浏览器中获取的值)时,我得到了有效的结果。代码如下:

    from requests import get
    from bs4 import BeautifulSoup
    
    
    def get_names(with_user_agent: bool):
        if with_user_agent:
            headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
        else:
            headers = {}
    
        r = get('https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii', headers=headers)
        r.raise_for_status()
    
        soup = BeautifulSoup(r.text, features='html.parser')
        names = soup.select('.fighterNames span')
    
        print('Names:')
        for n in names:
            print(n.text.strip())
        print('---')
    
    
    if __name__ == '__main__':
        print('Without user agent:')
        for i in range(3):
            get_names(False)
    
        print('\nWith user agent:')
        for i in range(3):
            get_names(True)
    

    输出:

    Without user agent:
    Names:
    Jared Downing
    Danny Tims
    "Demon Eyes"
    
    ---
    Names:
    Allen Hope
    Mike Kent
    "Bunzy"
    
    ---
    Names:
    Paweł Sikora
    Patryk Domke
    "Ponczek"
    "Patrykos"
    ---
    
    With user agent:
    Names:
    Robbie Lawler
    Rory MacDonald
    "Ruthless"
    "Red King"
    ---
    Names:
    Robbie Lawler
    Rory MacDonald
    "Ruthless"
    "Red King"
    ---
    Names:
    Robbie Lawler
    Rory MacDonald
    "Ruthless"
    "Red King"
    ---
    

    【讨论】:

    • 谢谢!这解决了我的问题。在scrapy shell 中,我没有提供任何用户代理信息。但在实际的蜘蛛代码中,我确实提供了这些信息。通过蜘蛛和繁荣运行它,出现了正确的内容。我确实发现这个特定网站可能有一些反抓取措施。
    猜你喜欢
    • 1970-01-01
    • 2018-10-21
    • 2020-04-24
    • 1970-01-01
    • 1970-01-01
    • 2018-06-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多