使用 BeautifulSoup 从 url 中提取 url 列表答案

【问题标题】：Extracting list of urls from url using BeautifulSoup使用 BeautifulSoup 从 url 中提取 url 列表
【发布时间】：2021-01-21 02:07:20
【问题描述】：

我想从此链接中提取有关网站相似性的信息：

https://www.alexa.com/siteinfo/amazon.com

我正在查看 class='site'，试图从中提取信息

<a href="/siteinfo/ebay.com" class="truncation">ebay.com</a>

但我只能看到一个值。是否可以提取所有 4 个值和相关的重叠分数？

我想要实现的是一个包含这些信息的表格

W                      amazon.com              
eBay.com                   70.1
pinterest.com              54.7
wikipedia.org              51.3
facebook.com               50.4

我试过了

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.site")])

但由于代码中的一些错误参数，这似乎足以获取信息。

【问题讨论】：

出现你想要的span.truncation，a.trunctation，或div.site
感谢您的评论，OneCricketeer。我只能从 Google Chrome 上的检查工具中看到重叠分数和站点的跨度。我看不到你提到的标签
此页面使用JavaScript添加元素-但BeautifulSoup和requests无法运行JavaScript-您可能需要Selenium来控制可以运行JavaScript的真实Web浏览器
这不是真的@furas。虽然它确实将 JS 用于某些功能，但 OP 引用的表也可以正常加载，无需无头浏览器即可检测到
a.truncation 是您在问题中显示的元素。分数看起来像<span class="truncation">38.0</span>，所以span.truncation。对于站点类，这些仅在 div 元素上

标签： python web-scraping beautifulsoup

【解决方案1】：

您的 CSS 选择器是一个好的开始，但过于狭窄。您应该使用的 CSS 选择器是：

网站：#card_mini_audience .site>a
得分：#card_mini_audience .overlap>.truncation

这些选择器将焦点缩小到存储表格的 div，然后使用类标签来提取您想要的信息。

我在下面附上了一些示例代码，可以解决您的问题。我只是将结果打印到屏幕上，但它可以很容易地更改为对值做任何你想做的事情。

from bs4 import BeautifulSoup
import requests

#Getting the website and processing it
url = "https://www.alexa.com/siteinfo/amazon.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

#Using CSS Selectors to grab content
websites = soup.select("#card_mini_audience .site>a")   #Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation")    #Selects the corresponding scores

#Goes through the list and extracts just the text
websites = [website.text.strip() for website in websites]
scores = [float(score.text.strip()) for score in scores]    #Converts the scores to floats

#Ordinary print to screen. You can change this to add to a dataframe or whatever else you want for your project
for pair in zip(websites, scores):
    print(pair)

输出如下所示：

('ebay.com', 70.1)
('pinterest.com', 54.7)
('wikipedia.org', 51.3)
('facebook.com', 50.4)
('reddit.com', 49.6)

【讨论】：