使用 Python 和 Beautiful Soup 抓取 Google 新闻结果只检索没有标题的第一页答案

【问题标题】：Scraping Google News results with Python and Beautiful Soup retrieves only the first page without headlines使用 Python 和 Beautiful Soup 抓取 Google 新闻结果只检索没有标题的第一页
【发布时间】：2019-11-17 11:25:27
【问题描述】：

我想根据搜索的字词从 Google 新闻搜索页面中抓取标题和段落文本。我想为前 n 个页面执行此操作。

我写了一段只抓取第一页的代码，但我不知道如何修改我的url，以便我可以转到其他页面（第2、3...）。这是我遇到的第一个问题。

第二个问题是我不知道如何抓取标题。它总是给我返回空列表。我尝试了多种解决方案，但它总是返回空列表。（我不认为该页面是动态的）。

另一方面，在标题下方抓取段落文本效果很好。你能告诉我如何解决这两个问题吗？

这是我的代码：

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works

【问题讨论】：

假设谷歌新闻类名称保持不变是个好主意吗？
总是一样的。

标签： python web-scraping beautifulsoup

【解决方案1】：

问题一：翻页。

为了移动到下一页，您需要在 URL 格式字符串中包含 start 关键字：

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

问题二：刮掉头条。

Google 会重新生成 DOM 元素的类名称、ID 等，因此每次检索一些新的未缓存信息时，您的方法都可能会失败。

【讨论】：

谢谢，您能告诉我如何访问头条新闻吗？
出于某种原因（我不知道为什么）我得到KeyError: 'page'...这很奇怪
谢谢，但我认为类的名称总是相同的。段落类的名称（在我的代码中起作用的那个）在过去一年中是相同的
好吧，那么就没有办法上头条了吗？

【解决方案2】：

只需在搜索词中添加参数“start=10”即可。喜欢： https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

对于响应页面的动态行为/循环，请使用以下内容：

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')

【讨论】：

但是如何让它动态化，有变量term和for循环页面range(1,5)

【解决方案3】：

Link 我之前回答的部分相同的问题。

或者，您可以使用来自 SerpApi 的 Google News Result API。这是一个免费试用的付费 API。

部分 JSON 输出：

"news_results": [
  {
    "position": 1,
    "link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
    "title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
    "source": "St. Louis Post-Dispatch",
    "date": "1 week ago",
    "snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
    "thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
  }
]

Сode 集成：

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "best cookies",
  "tbm": "nws",
  "start": "10",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\n")

输出：

Title: 10 Of The Absolute Best Cookies In Sydney
    
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality

Title: Family cookies by Taimur Ali Khan is the best thing on internet

Title: Gibson Dunn Ranked Among Top Three Firms for Client ...

Title: Livingston CARES: Saying thank you to one cookie at a time

Title: Google's plan to replace cookies is the web's best hope for a more private internet

Title: The 12 Best Cookies in NYC

Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...

Title: Best Cookie Delivery Services - Where to Order Cookies Online

Title: How to make the best cookies for the holidays

免责声明，我为 SerpApi 工作。

【讨论】：