【问题标题】:WebScraping / Identical sites not working?WebScraping /相同的网站不起作用?
【发布时间】:2021-06-10 18:17:45
【问题描述】:

我想从这两个链接中抓取标题元素 - 对我来说,这 2 个网站看起来完全一样 - 图片见下文

为什么只有第二个链接的抓取有效,而第一个链接无效?

import time
import requests
from bs4 import BeautifulSoup

# not working
link = "https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4"
page = requests.get (link)
time.sleep (1)
soup = BeautifulSoup (page.content, "html.parser")
erg = soup.find("header")
print(f"First Link: {erg}")

# working
link = "https://apps.apple.com/us/app/jackpot-boom-casino-slots/id1554995201?uo=4"
page = requests.get (link)
time.sleep (1)
soup = BeautifulSoup (page.content, "html.parser")
erg = soup.find("header")
print(f"Second Link: {len(erg)}")

工作:

不工作:

【问题讨论】:

    标签: web web-scraping beautifulsoup


    【解决方案1】:

    页面有时是由 JavaScript 加载的,所以request 不会支持它。

    您可以使用while 循环来检查header 是否出现在soup 然后break

    import requests
    from bs4 import BeautifulSoup
    
    
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
    }
    link = "https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4"
    
    while True:
        soup = BeautifulSoup(requests.get(link).content, "html.parser")
        header = soup.find("header")
        if header:
            break
    
    print(header)
    

    【讨论】:

    • 不是 100% - 有时服务器返回 Js-only 页面。您需要重复它,直到服务器发送正确的版本。
    • @AndrejKesely 很有趣,因为我用浏览器测试了几次禁用 JS。您可以代替我发布答案。我会投票
    • 没关系 - 只需在您对 OP 的回答中提及即可。 +1 :)
    • 我做的第一件事是设置User-Agent...但是当我已经写好答案并尝试复制输出时,我再次运行脚本但它失败了:/所以没有魔法:)
    • @AndrejKesely 我更新了我的答案以检查回复。似乎根本不需要user-agent
    【解决方案2】:

    试试这个以获取您希望从这些链接中获取的任何字段。目前它获取了标题。您可以修改res.json()['data'][0]['attributes']['name'] 以获取您感兴趣的任何领域。 Mkae 确保将网址放在此列表中urls_to_scrape

    import json
    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import unquote
    
    urls_to_scrape = {
        'https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4',
        'https://apps.apple.com/us/app/jackpot-boom-casino-slots/id1554995201?uo=4'
    }
    
    base_url = 'https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4'
    link = 'https://amp-api.apps.apple.com/v1/catalog/US/apps/{}'
    
    params = {
        'platform': 'web',
        'additionalPlatforms': 'appletv,ipad,iphone,mac',
        'extend': 'customPromotionalText,customScreenshotsByType,description,developerInfo,distributionKind,editorialVideo,fileSizeByDevice,messagesScreenshots,privacy,privacyPolicyText,privacyPolicyUrl,requirementsByDeviceFamily,supportURLForLanguage,versionHistory,websiteUrl',
        'include': 'genres,developer,reviews,merchandised-in-apps,customers-also-bought-apps,developer-other-apps,app-bundles,top-in-apps,related-editorial-items',
        'l': 'en-us',
        'limit[merchandised-in-apps]': '20',
        'omit[resource]': 'autos',
        'sparseLimit[apps:related-editorial-items]': '5'
    }
    
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
        res = s.get(base_url)
        soup = BeautifulSoup(res.text,"lxml")
        token_raw = soup.select_one("[name='web-experience-app/config/environment']").get("content")
        token = json.loads(unquote(token_raw))['MEDIA_API']['token']
        s.headers['Accept'] = 'application/json'
        s.headers['Referer'] = 'https://apps.apple.com/'
        s.headers['Authorization'] = f'Bearer {token}'
    
        for url in urls_to_scrape:
            id_ = url.split("/")[-1].strip("id").split("?")[0]
            res = s.get(link.format(id_),params=params)
            title = res.json()['data'][0]['attributes']['name']
            print(title)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-01-25
      • 2012-02-09
      相关资源
      最近更新 更多