谷歌通过 BeautifulSoup 提供的确切网站链接答案

【问题标题】：Exact website links from google through BeautifulSoup谷歌通过 BeautifulSoup 提供的确切网站链接
【发布时间】：2017-12-12 19:51:48
【问题描述】：

我想使用 BeautifulSoup 搜索 google 并打开第一个链接。但是当我打开链接时，它显示错误。我认为的原因是因为谷歌没有提供网站的确切链接，它在 url 中添加了几个参数。如何获得准确的网址？

当我尝试使用 cite 标签时，它可以工作，但对于大 URL，它的创建问题。

我使用 soup.h3.a['href'][7:] 获得的第一个链接是： 'http://www.wikipedia.com/wiki/White_holes&sa=U&ved=0ahUKEwi_oYLLm_rUAhWJNI8KHa5SClsQFggbMAI&usg=AFQjCNGN-vlBvbJ9OPrnq40d0_b8M0KFJQ'

这是我的代码：

import requests
from bs4 import Beautifulsoup
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+Black+hole&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw')
soup = BeautifulSoup(r.text, "html.parser")
print(soup.h3.a['href'][7:])

【问题讨论】：

可能有更好的解决方案，但如果问题是 Google 附加的参数总是以“&”开头，并且链接之前从来没有“&”，您可以尝试对其进行切片：@987654323 @

标签： python beautifulsoup

【解决方案1】：

你可以拆分返回的字符串：

url = soup.h3.a['href'][7:].split('&')
print(url[0])

【讨论】：

【解决方案2】：

希望通过将上面给出的所有答案组合在一起，您的代码看起来像这个：

from bs4 import BeautifulSoup
import requests
import csv
import os
import time

url = "https://www.google.co.in/search?q=site:wikipedia.com+Black+hole&dcr=0&gbv=2&sei=Nr3rWfLXMIuGvQT9xZOgCA"
r = requests.get(url)
data = r.text

url1 = "https://www.google.co.in"

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("div", attrs={"class":"g"})
final_data = []
for details in get_details:
    link = details.find_all("h3")
    #links = ""
    for mdetails in link:
        links = mdetails.find_all("a")
        lmk = ""
        for lnk in links:
            lmk = lnk.get("href")[7:].split("&")
            sublist = []
            sublist.append(lmk[0])
        final_data.append(sublist)

filename = "Google.csv"
with open("./"+filename, "w")as csvfile:
    csvfile = csv.writer(csvfile, delimiter=",")
    csvfile.writerow("")
    for i in range(0, len(final_data)):
        csvfile.writerow(final_data[i])

【讨论】：

【解决方案3】：

这要简单得多。你正在寻找这个：

# instead of this:
soup.h3.a['href'][7:].split('&')

# use this:
soup.select_one('.yuRUbf a')['href']

代码和example in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "site:wikipedia.com black hole",     # query
  "gl": "us",                               # country to search from
  "hl": "en"                                # language    
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)

# https://en.wikipedia.com/wiki/Primordial_black_hole

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您只需要从结构化 JSON 中提取数据，而不是弄清楚为什么事情不工作，然后在某些选择器发生变化时随着时间的推移对其进行维护。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "site:wikipedia.com black hole",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)

# https://en.wikipedia.com/wiki/Primordial_black_hole

免责声明，我为 SerpApi 工作。

【讨论】：