【发布时间】:2020-10-20 15:25:09
【问题描述】:
我创建了一个抓取代码来从当地报纸网站获取信息。当前代码存在两个问题。
-
当它检索段落数据并将其保存到 CSV 时。它将“,”识别为中断并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生?
-
我想让他们抓取行中的信息。即段落、标题、链接
代码如下;
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://neweralive.na/today/"
ne_url = "https://neweralive.na/posts/"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("article", {"class": "post-item"})
filename = "newera.csv"
headers = "paragraph,title,link\n"
f = open(filename, "w")
f.write(headers)
for container in containers:
paragraph_container = container.findAll("p", {"class": "post-excerpt"})
paragraph = paragraph_container[0].text
title_container = container.findAll("h3", {"class": "post-title"})
title = title_container[0].text
weblink = ne_url + title_container[0].a["href"]
f.write(paragraph + "," + title + "," + weblink + "\n")
f.close()
【问题讨论】:
标签: python html csv web-scraping beautifulsoup