【问题标题】:beautifulsoup: how to scrape multiple urls that end differentlybeautifulsoup:如何抓取多个以不同结尾的网址
【发布时间】:2022-01-04 11:58:54
【问题描述】:

我想删除这个dictionary,因为它是不同的动词。动词出现在这个“https://www.spanishdict.com/conjugate/”加上动词。所以,例如:对于动词“hacer”,我们将有:https://www.spanishdict.com/conjugate/hacer

我想抓取包含每个动词变位的所有可能链接,并将它们作为字符串列表返回。所以我做了以下事情:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    reqs = requests.get(url + str())
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))

    print(urls)

但是当我打印 url 时,我只得到一些空列表。

预期输出样本:

['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]

【问题讨论】:

  • 所以你想把动词从 javascript 变量中取出来? check the edit -- 问题不是很清楚,可以改进。

标签: python web-scraping beautifulsoup request


【解决方案1】:

当您遍历“url”时,您正在遍历一个字符串。看这段代码:

url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    print(i)

这会产生 URL 的每个字母:

h
t
t
p
s
:
/
/
w
w
w
<truncated>

你在这里也做错了:

reqs = requests.get(url + str())

我不确定您要做什么,但 'url + str()' 只是 URL 加上一个空字符串,即 URL。

如果你删除 for 循环和不必要的空字符串,你会得到我认为你想要得到的东西:

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

reqs = requests.get(url + str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

print(urls)

这会产生:

['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation%20hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign%3Dadhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source%3Dsd-footer']

这个链接列表是您的目标吗?

【讨论】:

  • 非常感谢您的解释,但至于输出,我期待类似以下内容:[spanishdict.com/conjugate/hacer', spanishdict.com/conjugate/tener',...]
  • 这些页面没有链接到 spanishdict.com/conjugate 的源代码中,所以你不会通过 requests + BeautifulSoup 获得它们。当您单击搜索框时,它们会与 JavaScript 一起出现,因此您必须使用像 Selenium 这样的库。或者,您可以从“spanishdict.com/conjugate/hacer”之类的页面开始,然后浏览您找到的以 /conjugate 开头的链接,例如urls = [] for link in soup.find_all('a'): if link.get('href').startswith("/translate"): urls.append(link.get('href')) 但你可能会错过页面。
  • 谢谢你的提示,那我试试 selenium
【解决方案2】:

编辑

希望明白你的意思 - 如果是这样,应该改进问题。要从 javascript 中获取信息,您可以使用正则表达式解析响应:

import requests
import json
import re

r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/'+w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]

输出

['https://www.spanishdict.com/conjugate/tener',
 'https://www.spanishdict.com/conjugate/hacer',
 'https://www.spanishdict.com/conjugate/ser',
 'https://www.spanishdict.com/conjugate/estar',
 'https://www.spanishdict.com/conjugate/haber',
 'https://www.spanishdict.com/conjugate/ir',
 'https://www.spanishdict.com/conjugate/poder',
 'https://www.spanishdict.com/conjugate/decir',
 'https://www.spanishdict.com/conjugate/cerrar',
 'https://www.spanishdict.com/conjugate/mentir',
 'https://www.spanishdict.com/conjugate/dormir',
 'https://www.spanishdict.com/conjugate/recordar',
 'https://www.spanishdict.com/conjugate/seguir',
 'https://www.spanishdict.com/conjugate/medir',
 'https://www.spanishdict.com/conjugate/adquirir',
 'https://www.spanishdict.com/conjugate/jugar',
 'https://www.spanishdict.com/conjugate/vestirse',
 'https://www.spanishdict.com/conjugate/divertirse',
 'https://www.spanishdict.com/conjugate/acostarse',
 'https://www.spanishdict.com/conjugate/ponerse',
 'https://www.spanishdict.com/conjugate/despertarse',
 'https://www.spanishdict.com/conjugate/sentirse',
 'https://www.spanishdict.com/conjugate/levantarse',
 'https://www.spanishdict.com/conjugate/sentarse',
 'https://www.spanishdict.com/conjugate/gustar',
 'https://www.spanishdict.com/conjugate/alegrar',
 'https://www.spanishdict.com/conjugate/quedar',
 'https://www.spanishdict.com/conjugate/encantar',
 'https://www.spanishdict.com/conjugate/parecer',
 'https://www.spanishdict.com/conjugate/faltar',
 'https://www.spanishdict.com/conjugate/doler',
 'https://www.spanishdict.com/conjugate/interesar']

要获得预期的输出,您应该有一个动词列表。虽然您的问题中没有提供来源,但这是生成此类信息的良好开端,但我使用了列表 verbs-top-500 和列表理解。

对于在其href 中包含translate 的所有&lt;a&gt;,它将您的网址与直接子&lt;div&gt;&lt;a&gt; 中的文本动词连接起来:

['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate"]')]

示例

import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

urls = ['https://www.spanishdict.com/conjugate/'+a.div.text for a in soup.select('a[href*="translate/"]')]

输出

['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]

【讨论】:

    猜你喜欢
    • 2015-05-25
    • 1970-01-01
    • 2020-06-27
    • 1970-01-01
    • 2019-02-20
    • 1970-01-01
    • 1970-01-01
    • 2023-02-23
    • 1970-01-01
    相关资源
    最近更新 更多