无法从 DuckDuckGo 搜索结果中抓取链接答案

【问题标题】：Unable to scrape the link from DuckDuckGo search result无法从 DuckDuckGo 搜索结果中抓取链接
【发布时间】：2021-06-28 17:38:55
【问题描述】：

我想从 DuckDuckGo 搜索结果中抓取第一个链接。我写了下面的代码：

import requests
from bs4 import BeautifulSoup
class Bse:
      def currentPrice(self,symbol):
            headers = {
                  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0"
            }
            duckDuckUrl=f'https://duckduckgo.com/?q=bse+{symbol}+stock+price'
            response=requests.get(duckDuckUrl,headers=headers)
            soup=BeautifulSoup(response.text,"html.parser")
             bseIndiaLink=soup.find_all('a')
            # bseIndiaLink=soup.find_all('a',class_="result__a")  #giving empty list
            print(bseIndiaLink)


bse=Bse()
bse.currentPrice('reliance')

首先我在 beautifulSoup 中使用了 find_all() ，没有使用 class_ 参数。它返回给我一些对我没有任何用处的随机锚标记列表。我还尝试了带有 class_ 参数的 find_all()，但它返回了一个空列表。

我试图打印汤对象。它打印的是网页的 HTML，但不是那些包含 div 的结果。我不知道为什么 BeautifulSoup 没有抓取包含 div 的结果。请看截图，突出显示的 HTML 语法是我要抓取的：

我找到了一个答案，即 DuckDuckGo 使用 javascript 作为搜索结果，而 beautifulSoup 无法抓取 javascript，但在 StackOverflow 上的其他帖子中，我发现人们能够从它的结果中抓取链接。
但是，如果我使用 Google 而不是 DuckDuckGo，我可以抓取所需的链接。

我想知道为什么我不能从 DuckDuckGo 中抓取，而是使用相同的代码从 Google 抓取。我很好奇。

如果有人知道我忽略或遗漏了什么，请告诉我。这将有助于我的学习之旅。

谢谢

【问题讨论】：

试试这个网址html.duckduckgo.com/html/?q=nse%20reliance%20stock%20price
@artanik itls 向我显示此错误：- requests.exceptions.MissingSchema：无效的 URL 'html.duckduckgo.com/html/?q=nse%20depend%20stock%20price'：未提供架构。也许你的意思是html.duckduckgo.com/html/?q=nse%20reliance%20stock%20price？

标签： python html web-scraping beautifulsoup

【解决方案1】：

首先我在 beautifulSoup 中使用了 find_all()，没有使用 class_ 参数。它返回给我一些随机锚标记的列表，这些标记对我没有任何用处。

这是正确的行为，因为您要求 bs4 获取所有 <a> 标记，它返回了找到的所有 <a> 标记。

您可以更改您的 URL 以抓取非 JavaScript 版本：

from this (JS): https://duckduckgo.com/?q=bse+reliance+stock+price&t=hx&va=g&ia=web
to this (non-JS): https://html.duckduckgo.com/html/?q=bse%20reliance%20stock%20price

如果您每次只需要提取第一个链接，那么您可以这样做：

>>> first_url = soup.select_one('.result__url')['href'].replace('//', '')
"duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.bseindia.com%2Fstock%2Dshare%2Dprice%2Freliance%2Dindustries%2Dltd%2Freliance%2F500325%2F&rut=b13b3c373de61ffd03dee7ad51f9fb9274dac16d098f25920d7946dbd9a73cc7"

代码和full example in the online IDE：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "bse reliance stock price",
  "kl": "us-en" # language
}

html = requests.get('https://html.duckduckgo.com/html', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

first_url = soup.select_one('.result__url')['href'].replace('//', '')
print(first_url)

# duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.bseindia.com%2Fstock%2Dshare%2Dprice%2Freliance%2Dindustries%2Dltd%2Freliance%2F500325%2F&rut=b13b3c373de61ffd03dee7ad51f9fb9274dac16d098f25920d7946dbd9a73cc7

或者，您可以使用来自 SerpApi 的 DuckDuckGo Organic Results API。这是一个带有免费计划的付费 API。查看playground。

不同之处在于它会抓取 DuckDuckGo 的 JavaScript 版本，唯一需要做的就是遍历 JSON 字符串并提取您要查找的内容。

要集成的代码：

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "duckduckgo",
  "q": "bse reliance stock price",
  "kl": "us-en"
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - index of the first organic result
first_link = results['organic_results'][0]['link']
print(first_link)

# https://www.bseindia.com/stock-share-price/reliance-industries-ltd/reliance/500325/

免责声明，我为 SerpApi 工作。

【讨论】：

【解决方案2】：

这应该会根据您当前的搜索关键字生成结果。您需要发送 post http 请求以及适当的参数以访问内容。为了使您当前的尝试成功，我在有效负载中使用了一些字符串格式。

import requests
from bs4 import BeautifulSoup

class Bse:
    def __init__(self):
        self.duckDuckUrl = 'https://html.duckduckgo.com/html/'
        self.payload = {'q': 'bse {} stock price','b': ''}
        self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0'}

    def currentPrice(self,symbol):
        self.payload['q'] = self.payload['q'].format(symbol)
        res = requests.post(self.duckDuckUrl,data=self.payload,headers=self.headers)
        soup = BeautifulSoup(res.text,'html.parser')
        return soup.find('a',class_='result__a').get("href")

if __name__ == '__main__':
    bse = Bse()
    print(bse.currentPrice('reliance'))

使用获取请求：

link = "https://html.duckduckgo.com/html/?"
params = {'q': 'nse {} stock price'}

def fetch_first_link(s,symbol):
    params['q'] = params['q'].format(symbol)
    res = s.get(link,params=params)
    soup = BeautifulSoup(res.text,"lxml")
    item = soup.select_one(".result__title > a.result__a").get("href")
    return item

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
        print(fetch_first_link(s,'reliance'))

【讨论】：

现在可以使用了。但是为什么你使用了 post request 为什么没有 get request 呢？