【问题标题】:Scraping Google News results with Python and Beautiful Soup retrieves only the first page without headlines使用 Python 和 Beautiful Soup 抓取 Google 新闻结果只检索没有标题的第一页
【发布时间】:2019-11-17 11:25:27
【问题描述】:

我想根据搜索的字词从 Google 新闻搜索页面中抓取标题和段落文本。 我想为前 n 个页面执行此操作。

我写了一段只抓取第一页的代码,但我不知道如何修改我的url,以便我可以转到其他页面(第2、3...)。这是我遇到的第一个问题

第二个问题是我不知道如何抓取标题。它总是给我返回空列表。我尝试了多种解决方案,但它总是返回空列表。 (我不认为该页面是动态的)。

另一方面,在标题下方抓取段落文本效果很好。 你能告诉我如何解决这两个问题吗?

这是我的代码:

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works

【问题讨论】:

  • 假设谷歌新闻类名称保持不变是个好主意吗?
  • 总是一样的。

标签: python web-scraping beautifulsoup


【解决方案1】:

问题一:翻页。

为了移动到下一页,您需要在 URL 格式字符串中包含 start 关键字:

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

问题二:刮掉头条。

Google 会重新生成 DOM 元素的类名称、ID 等,因此每次检索一些新的未缓存信息时,您的方法都可能会失败。

【讨论】:

  • 谢谢,您能告诉我如何访问头条新闻吗?
  • 出于某种原因(我不知道为什么)我得到KeyError: 'page'...这很奇怪
  • 谢谢,但我认为类的名称总是相同的。段落类的名称(在我的代码中起作用的那个)在过去一年中是相同的
  • 好吧,那么就没有办法上头条了吗?
【解决方案2】:

只需在搜索词中添加参数“start=10”即可。喜欢: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

对于响应页面的动态行为/循环,请使用以下内容:

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')

【讨论】:

  • 但是如何让它动态化,有变量term和for循环页面range(1,5)
【解决方案3】:

Link 我之前回答的部分相同的问题。


或者,您可以使用来自 SerpApi 的 Google News Result API。这是一个免费试用的付费 API。

部分 JSON 输出:

"news_results": [
  {
    "position": 1,
    "link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
    "title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
    "source": "St. Louis Post-Dispatch",
    "date": "1 week ago",
    "snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
    "thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
  }
]

Сode 集成:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "best cookies",
  "tbm": "nws",
  "start": "10",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\n")

输出:

Title: 10 Of The Absolute Best Cookies In Sydney
    
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality

Title: Family cookies by Taimur Ali Khan is the best thing on internet

Title: Gibson Dunn Ranked Among Top Three Firms for Client ...

Title: Livingston CARES: Saying thank you to one cookie at a time

Title: Google's plan to replace cookies is the web's best hope for a more private internet

Title: The 12 Best Cookies in NYC

Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...

Title: Best Cookie Delivery Services - Where to Order Cookies Online

Title: How to make the best cookies for the holidays

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 2019-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-23
    • 1970-01-01
    相关资源
    最近更新 更多