如何使用 Beautiful soup 从页面中提取链接答案

【问题标题】：How to extract links from a page using Beautiful soup如何使用 Beautiful soup 从页面中提取链接
【发布时间】：2019-06-03 04:26:56
【问题描述】：

我有一个HTML Page，其中包含多个 div，例如：

<div class="post-info-wrap">
  <h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2>
  <div class="post-meta clearfix">

    <div class="post-info-wrap">
      <h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2>
      <div class="post-meta clearfix">

我需要使用 post-info-wrap 类获取所有 div 的值我是 BeautifulSoup 的新手

所以我需要这些网址：

我试过了：

import re
import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.example.com/blog/author/abc") 
data = r.content  # Content of response

soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
   print link.find('a').attrs['href']

此代码似乎不起作用。我不熟悉美丽的汤。如何提取链接？

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

你可以使用soup.find_all:

from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]

输出：

['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']

【讨论】：

代码不工作。我想这是因为你忘记了在 href 之前有一个
。

【解决方案2】：

link = i.find('a',href=True)总是不返回anchor tag (a)，可能是返回NoneType，所以需要验证链接为None，继续for循环，否则获取链接href值。

通过网址抓取链接：

import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
   link = i.find('a',href=True)
   if link is None:
       continue
   print(link['href'])

通过 HTML 抓取链接：

from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post &#8211; Example 1 Post" rel="bookmark">sample post &#8211; example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post &#8211; Example 2 Post" rel="bookmark">sample post &#8211; example 2 post</a></h2><div class="post-meta clearfix">'''

soup = BeautifulSoup(html, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
   link = i.find('a',href=True)
   if link is None:
       continue
   print(link['href'])

更新：

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")

soup = BeautifulSoup(driver.page_source, "html.parser")

for i in soup.find_all('div', {'class': 'post-info-wrap'}):
    link = i.find('a', href=True)
    if link is None:
        continue
    print(link['href'])

O/P：

https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/

对于 chrome 浏览器：

http://chromedriver.chromium.org/downloads

为 chrome 浏览器安装网络驱动程序：

https://christopher.su/2015/selenium-chromedriver-ubuntu/

硒教程

https://selenium-python.readthedocs.io/

'/usr/bin/chromedriver'chrome webdriver 路径在哪里。

【讨论】：

@seeker2345 https://www.example.com/blog/author/abc 页面 div 不包含 .post-info-wrap 类。
@seeker2345 您应该使用有效的网站 URL，即 div 包含 .post-info-wrap 类。
查看示例中页面的输出。我猜代码不起作用，因为您忘记了在 href 之前有一个
。
@seeker2345 你应该尝试第二种解决方案会得到像这样的o / p example.com/blog/111/this-is-1st-post example.com/blog/111/this-is-2nd-post
@seeker2345 你的“getastra.com/blog/author/vikas” URL 动态呈现(js or ajax) 请求，你正在使用简单的请求模块来获取下载网站的 URL。 requests 模块只下载静态页面数据。您应该使用selenium automation 库来废弃动态数据。