beautifulsoup：如何抓取多个以不同结尾的网址答案

【问题标题】：beautifulsoup: how to scrape multiple urls that end differentlybeautifulsoup：如何抓取多个以不同结尾的网址
【发布时间】：2022-01-04 11:58:54
【问题描述】：

我想删除这个dictionary，因为它是不同的动词。动词出现在这个“https://www.spanishdict.com/conjugate/”加上动词。所以，例如：对于动词“hacer”，我们将有：https://www.spanishdict.com/conjugate/hacer

我想抓取包含每个动词变位的所有可能链接，并将它们作为字符串列表返回。所以我做了以下事情：

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    reqs = requests.get(url + str())
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))

    print(urls)

但是当我打印 url 时，我只得到一些空列表。

预期输出样本：

['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]

【问题讨论】：

所以你想把动词从 javascript 变量中取出来？ check the edit -- 问题不是很清楚，可以改进。

标签： python web-scraping beautifulsoup request

【解决方案1】：

当您遍历“url”时，您正在遍历一个字符串。看这段代码：

url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    print(i)

这会产生 URL 的每个字母：

h
t
t
p
s
:
/
/
w
w
w
<truncated>

你在这里也做错了：

reqs = requests.get(url + str())

我不确定您要做什么，但 'url + str()' 只是 URL 加上一个空字符串，即 URL。

如果你删除 for 循环和不必要的空字符串，你会得到我认为你想要得到的东西：

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

print(urls)

这会产生：

['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation%20hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign%3Dadhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source%3Dsd-footer']

这个链接列表是您的目标吗？

【讨论】：

非常感谢您的解释，但至于输出，我期待类似以下内容：[spanishdict.com/conjugate/hacer', spanishdict.com/conjugate/tener',...]
这些页面没有链接到 spanishdict.com/conjugate 的源代码中，所以你不会通过 requests + BeautifulSoup 获得它们。当您单击搜索框时，它们会与 JavaScript 一起出现，因此您必须使用像 Selenium 这样的库。或者，您可以从“spanishdict.com/conjugate/hacer”之类的页面开始，然后浏览您找到的以 /conjugate 开头的链接，例如urls = [] for link in soup.find_all('a'): if link.get('href').startswith("/translate"): urls.append(link.get('href')) 但你可能会错过页面。
谢谢你的提示，那我试试 selenium

【解决方案2】：

编辑

希望明白你的意思 - 如果是这样，应该改进问题。要从 javascript 中获取信息，您可以使用正则表达式解析响应：

import requests
import json
import re

r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/'+w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]

输出

['https://www.spanishdict.com/conjugate/tener',
 'https://www.spanishdict.com/conjugate/hacer',
 'https://www.spanishdict.com/conjugate/ser',
 'https://www.spanishdict.com/conjugate/estar',
 'https://www.spanishdict.com/conjugate/haber',
 'https://www.spanishdict.com/conjugate/ir',
 'https://www.spanishdict.com/conjugate/poder',
 'https://www.spanishdict.com/conjugate/decir',
 'https://www.spanishdict.com/conjugate/cerrar',
 'https://www.spanishdict.com/conjugate/mentir',
 'https://www.spanishdict.com/conjugate/dormir',
 'https://www.spanishdict.com/conjugate/recordar',
 'https://www.spanishdict.com/conjugate/seguir',
 'https://www.spanishdict.com/conjugate/medir',
 'https://www.spanishdict.com/conjugate/adquirir',
 'https://www.spanishdict.com/conjugate/jugar',
 'https://www.spanishdict.com/conjugate/vestirse',
 'https://www.spanishdict.com/conjugate/divertirse',
 'https://www.spanishdict.com/conjugate/acostarse',
 'https://www.spanishdict.com/conjugate/ponerse',
 'https://www.spanishdict.com/conjugate/despertarse',
 'https://www.spanishdict.com/conjugate/sentirse',
 'https://www.spanishdict.com/conjugate/levantarse',
 'https://www.spanishdict.com/conjugate/sentarse',
 'https://www.spanishdict.com/conjugate/gustar',
 'https://www.spanishdict.com/conjugate/alegrar',
 'https://www.spanishdict.com/conjugate/quedar',
 'https://www.spanishdict.com/conjugate/encantar',
 'https://www.spanishdict.com/conjugate/parecer',
 'https://www.spanishdict.com/conjugate/faltar',
 'https://www.spanishdict.com/conjugate/doler',
 'https://www.spanishdict.com/conjugate/interesar']

要获得预期的输出，您应该有一个动词列表。虽然您的问题中没有提供来源，但这是生成此类信息的良好开端，但我使用了列表 verbs-top-500 和列表理解。

对于在其href 中包含translate 的所有<a>，它将您的网址与直接子<div><a> 中的文本动词连接起来：

['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate"]')]

示例

import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

urls = ['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate/"]')]

输出

['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]

【讨论】：