使用 BeautifulSoup 从 <cite> 标签中抓取 URL答案

【问题标题】：Scrape URLs from <cite> tags using BeautifulSoup使用 BeautifulSoup 从 <cite> 标签中抓取 URL
【发布时间】：2018-02-02 22:07:37
【问题描述】：

我正在尝试使用 Requests 和 Beautiful Soup 网络抓取库从 Google 抓取网址。

for URL in soup.find_all('cite'):
    print(URL.text)

我之前试图通过搜索链接然后获取链接的 href 来获取 URL，但是这种方法的问题似乎是这些 URL 被 Google 缓存了，并且在尝试访问 URL 时链接经常坏。

我注意到 Google 使用 cite 标签来保存 URL。虽然这适用于绝大多数 URL，但有时页面上的其他文本位也在 cite 标签内。

大多数标签都有 class= "_Rm" 或 class= "Rm bc"。我如何告诉 Beautiful Soup 搜索带有子字符串“Rm”类的标签？

我知道可能有更好的方法来完成这一切。有谁知道我可以如何做到这一点/另一种返回网站实际 URL 的方法？

这是我之前用来获取 URL 的代码

for URL in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):

    print ("\n" + URL.text + "\n")

    print re.split(":(?=http)",URL["href"].replace("/url?q=",""))'''

【问题讨论】：

我想它是通过JS加载的，所以beautifulsoup找不到它。
使用selenium 而不是requests。
啊，是的，我想我将不得不使用 selenium 来抓取动态生成的内容。感谢您的回复

标签： python beautifulsoup python-requests bs4

【解决方案1】：

您可以转到父容器并使用.text 方法，因为在这种情况下，下面没有不需要的文本。这样，它将返回所有“引用”链接。
使用第三方 API SerpApi（见下文）

代码和full example：

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=java',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

for container in soup.findAll('div', class_='TbwUpd NJjxre'):
  link = container.text
  print(link)

输出：

https://www.java.com
https://www.oracle.com › java › technologies
https://www.oracle.com › java › technologies › javase-d...
https://en.wikipedia.org › wiki › Java_(programming_l...
https://en.wikipedia.org › wiki › Java
https://www.supremecourt.gov › opinions
https://openjdk.java.net

或者，您可以使用来自 SerpApi 的 Google Search Engine Results API。这是一个免费试用的付费 API。

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "java",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Link: {result['displayed_link']}")

输出：

Link: https://www.java.com
Link: https://www.oracle.com › java › technologies
Link: https://www.oracle.com › java › technologies › javase-d...
Link: https://en.wikipedia.org › wiki › Java_(programming_l...
Link: https://en.wikipedia.org › wiki › Java
Link: https://www.supremecourt.gov › opinions
Link: https://openjdk.java.net

免责声明，我为 SerpApi 工作。

【讨论】：