如何确保 BeautifulSoup 不会将逗号视为制表符答案

【问题标题】：How do I ensure that BeautifulSoup does not look at commas as tabs如何确保 BeautifulSoup 不会将逗号视为制表符
【发布时间】：2020-10-20 15:25:09
【问题描述】：

我创建了一个抓取代码来从当地报纸网站获取信息。当前代码存在两个问题。

当它检索段落数据并将其保存到 CSV 时。它将“，”识别为中断并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生？
我想让他们抓取行中的信息。即段落、标题、链接

代码如下；

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

page_url = "https://neweralive.na/today/"

ne_url = "https://neweralive.na/posts/"

uClient = uReq(page_url)

page_soup = soup(uClient.read(), "html.parser")
uClient.close()


containers = page_soup.findAll("article", {"class": "post-item"})

filename = "newera.csv"
headers = "paragraph,title,link\n"

f = open(filename, "w")
f.write(headers)

for container in containers:
    paragraph_container = container.findAll("p", {"class": "post-excerpt"})
    paragraph = paragraph_container[0].text

    title_container = container.findAll("h3", {"class": "post-title"})
    title = title_container[0].text
    weblink = ne_url + title_container[0].a["href"]

    f.write(paragraph + "," + title + "," + weblink + "\n")

f.close()

【问题讨论】：

标签： python html csv web-scraping beautifulsoup

【解决方案1】：

您可以使用 built-in csv module 编写格式正确的 CSV，并在需要的字符串（例如包含逗号的字符串）周围加上引号。

同时，我重构了您的代码以使用可重用函数：

get_soup_from_url() 下载 URL 并从中获取 BeautifulSoup
parse_today_page() 是一个生成器函数，可以遍历该汤并返回每篇文章的字典
主代码现在只在打开的文件上使用csv.DictWriter；解析后的 dicts 会打印到控制台以方便调试，并提供给 CSV 编写器进行输出。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import csv


base_url = "https://neweralive.na/posts/"


def get_soup_from_url(url):
    resp = urlopen(url)
    page_soup = BeautifulSoup(resp.read(), "html.parser")
    resp.close()
    return page_soup


def parse_today_page(page_soup):
    for container in page_soup.findAll("article", {"class": "post-item"}):
        paragraph_container = container.findAll(
            "p", {"class": "post-excerpt"}
        )
        paragraph = paragraph_container[0].text
        title_container = container.findAll("h3", {"class": "post-title"})
        title = title_container[0].text
        weblink = base_url + title_container[0].a["href"]
        yield {
            "paragraph": paragraph,
            "title": title.strip(),
            "link": weblink,
        }


print("Downloading...")
page_soup = get_soup_from_url("https://neweralive.na/today/")

with open("newera.csv", "w") as f:
    writer = csv.DictWriter(f, ["paragraph", "title", "link"])
    writer.writeheader()
    for entry in parse_today_page(page_soup):
        print(entry)
        writer.writerow(entry)

生成的 CSV 最终看起来像例如

paragraph,title,link
"The mayor of Helao Nafidi, Elias Nghipangelwa, has expressed disappointment after Covid-19 relief food was stolen and sold by two security officers entrusted to guard the warehouse where the food was stored.","Guards arrested for theft of relief food",https://neweralive.na/posts/posts/guards-arrested-for-theft-of-relief-food
"Government has decided to construct 1 200 affordable homes, starting Thursday this week.","Govt to construct  1 200 low-cost houses",https://neweralive.na/posts/posts/govt-to-construct-1-200-low-cost-houses
...

【讨论】：

这个世界上有英雄。非常感谢。
您好 - 哇，这真是太棒了，我喜欢这个解决方案 - 它是 python 中一个清晰的 csv 示例。非常感谢！

【解决方案2】：

您可以使用 pandas 模块并将 dataframe-table 轻松转换为 csv。

import pandas as pd
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

page_url = "https://neweralive.na/today/"

ne_url = "https://neweralive.na/posts/"

uClient = uReq(page_url)

page_soup = soup(uClient.read(), "html.parser")

uClient.close()

containers = page_soup.findAll("article", {"class": "post-item"})

filename = "newera.csv"

rows = []  # Initialize list of list which is converted to dataframe.

for container in containers:

    paragraph_container = container.findAll("p", {"class": "post-excerpt"})
    paragraph = paragraph_container[0].text

    title_container = container.findAll("h3", {"class": "post-title"})
    title = title_container[0].text
    weblink = ne_url + title_container[0].a["href"]
    
    rows.append([paragraph, title, weblink])  # each row is appended 

df = pd.DataFrame(rows, columns = ["paragraph","title","link"])  # col-name is headers 

df.to_csv(filename, index=None)

【讨论】：

这是一个绝对不需要 Pandas 的任务。
我同意。只是一种选择。
亲爱的 Shimo - 非常感谢您展示了我们在 Pandas 中拥有的选项和可能性 - 我喜欢它 - 拥有我们所拥有的各种方式和路径的想法。这使得 Python 如此强大！因此，感谢您提出您的想法