Python 屏幕抓取 Forbes.com答案

【问题标题】：Python Screen Scraping Forbes.comPython 屏幕抓取 Forbes.com
【发布时间】：2019-01-17 08:23:29
【问题描述】：

我正在编写一个 Python 程序来从有趣的在线技术文章中提取和存储元数据：“og:title”、“og:description”、“og:image”、og:url 和 og:site_name。

这是我正在使用的代码...

# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"

# Create the Request
http = urllib3.PoolManager()

# Create the Response
response = http.request('GET ', url, headers)

# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')

# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
   if len(tag.get("content", None)) > len(title):
      title = tag.get("content", None)

该程序在除一个站点之外的所有站点上运行良好。在“forbes.com”上，我无法使用 Python 访问文章：

网址= https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

我无法绕过此同意页面；这似乎是来自“TrustArc”的“Cookie 同意管理器”解决方案。在计算机上，您基本上提供了您的同意...并且每次连续运行，您都可以访问这些文章。

如果我引用“toURL”网址： https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

绕过“https://www.forbes.com/consent/”页面，我被重定向回这个页面。

我尝试查看是否有可以在标题中设置的 cookie，但找不到魔术键。

谁能帮帮我？

【问题讨论】：

标签： python redirect web-scraping

【解决方案1】：

有一个必需的 cookie notice_gdpr_prefs 需要发送才能查看数据：

import requests
from bs4 import BeautifulSoup

src = requests.get(
    "https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
    headers= {
        "cookie": "notice_gdpr_prefs"
    })

soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta",  property="og:title")
print(title["content"])

【讨论】：

谢谢。该cookie的价值应该是多少？约定通常不是“cookie”：“notice_gdpr_prefs=SOMEVALUE”吗？
我做了一些测试，发现内容是这样的“0,1,2:”和第二个cookie（同名）“2:”。如果我在标题中解析这些（即使使用我从浏览器 cookie 复制/粘贴的 uuid），我仍然无法访问该网站。 :(