如何使用 BeautifulSoup 获取完整链接答案

【问题标题】：How to get the full link using BeautifulSoap如何使用 BeautifulSoup 获取完整链接
【发布时间】：2019-06-02 03:56:45
【问题描述】：

函数get("href") 未返回完整链接。在 html 文件中存在链接：

但是，函数link.get("href")返回：

"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"

sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"

response = urllib.request.urlopen(sub_site)

data = response.read()

soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):

    url = link.get("href")
    print (url)

【问题讨论】：

我在page 上没有看到您试图废弃的任何类似链接。

标签： python html python-3.x web-scraping beautifulsoup

【解决方案1】：

使用 select 并且似乎打印正常

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])

使用

print([item['href'] for item in soup.select('[href]')])

所有链接。

【讨论】：

我正在寻找一个通用的解决方案，在不知道类的情况下我可以收集所有可用的链接。
然后使用 print([item['href'] for item in soup.select('a')])
这样我就有了：return self.attrs[key] KeyError: 'href'
现在查看上面的编辑 print([item['href'] for item in soup.select('[href]')])

【解决方案2】：

让我在 html 中关注您问题的具体部分：

<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
                              </a>

你可以这样做：

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href")
    break

你发现url是：

'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'

你可以在字符串的开头看到两个重要的模式：

// 是保持当前协议的一种方式，见this；
\r 是 ASCII 回车 (CR)。

当你打印它时，你只是失去了这部分：

//www.fotoregistro.com.br/\r

如果您需要原始字符串，可以在for 循环中使用repr：

print(repr(url))

你会得到：

//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO

如果需要路径，可以替换开头部分：

base = 'www.fotoregistro.com.br/'

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
    print(url)

你会得到：

www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.

不指定类：

for link in soup.find_all('a'):
    url = link.get("href")
    print(repr(url))

【讨论】：

我正在寻找一个通用的解决方案，在不知道类的情况下我可以收集所有可用的链接