用beautifulsoup 抓取选择href答案

【问题标题】：Web scaping selected href with beautifulsoup用beautifulsoup 抓取选择href
【发布时间】：2022-10-22 15:25:22
【问题描述】：

我想用 python/BeautifulSoup 抓取一个网站，包括这篇文章：
https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/

在每篇文章的末尾，您总能找到来源。在上面的链接的情况下，这是：

在本网站上的某些文章中，只给出了一个来源，但有时会给出两个或三个不同的来源。所以代码需要考虑这一点。

理想情况下，我想要以下输出格式：“文本（href）”

xchuxing.com (https://xchuxing.com/article/45850)
cnevpost.com (https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/)

这是我的第一个代码：

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
article = soup.find()

source = [c for c in article.find('section', class_='content').find_all('a')]
for link in source[3:]:
        link.get('href')
print (link)

截至目前的输出：

<a href="https://cnevpost.com/2022/02/18/byd-seal-set-to-become-new-tesla-model-3-challenger/" rel="noopener" target="_blank">cnevpost.com</a>
[Finished in 345ms]

【问题讨论】：

link.get('href') line Effective 什么都不做 - 你检索 href 并把它扔掉。将其存储/绑定到名称或打印。您遍历所有链接（标签）并打印最后一个，而不是 href

标签： python html web-scraping beautifulsoup

【解决方案1】：

我认为来源总是在文章的最后一段，所以请按照以下步骤提取它们：

from bs4 import BeautifulSoup
import requests
import csv

URL = 'https://www.electrive.com/2022/02/20/byd-planning-model-3-like-800-volt-sedan-called-seal/'
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')


paragraphs = soup.find('section', class_='content').find_all('p')
# the sources in the last paragraph
sources = paragraphs[-1].find_all('a')
# put the sources name and link in a dict
sources_links = []
for source in sources:
    sources_links.append((source.text, source['href']))

for l in sources_links:
    print(l)

# write in csv
with open('electrive_scrape_source.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Source', 'Link'])
    csv_writer.writerows(sources_links)

将数据保存到 csv 文件

【讨论】：

这很棒。问题如何在 csv 中写入多个条目？ .请看我的后续问题
我更新它以将数据保存到 csv 文件 @webscrapeartist
太感谢了。最后一个问题：有没有办法保护 1 行和 1 个单元格中的结果（例如 source.com (link)、source2.com(link2...)？
我认为这不是一个好主意，因为数据必须在列下，如果您想分隔任何文章的来源，请添加一个新列并在其中删除文章的名称或链接@webscrapeartist
但是我需要一个用逗号分隔的单元格中的数据，因为这是一个更大的网络脚本（多篇文章）的一部分。有没有机会将它添加到代码 sn-p 中？