【问题标题】:How do I ensure that BeautifulSoup does not look at commas as tabs如何确保 BeautifulSoup 不会将逗号视为制表符
【发布时间】:2020-10-20 15:25:09
【问题描述】:

我创建了一个抓取代码来从当地报纸网站获取信息。当前代码存在两个问题。

  1. 当它检索段落数据并将其保存到 CSV 时。它将“,”识别为中断并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生?

  2. 我想让他们抓取行中的信息。即段落、标题、链接

代码如下;

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

page_url = "https://neweralive.na/today/"

ne_url = "https://neweralive.na/posts/"

uClient = uReq(page_url)

page_soup = soup(uClient.read(), "html.parser")
uClient.close()


containers = page_soup.findAll("article", {"class": "post-item"})

filename = "newera.csv"
headers = "paragraph,title,link\n"

f = open(filename, "w")
f.write(headers)

for container in containers:
    paragraph_container = container.findAll("p", {"class": "post-excerpt"})
    paragraph = paragraph_container[0].text

    title_container = container.findAll("h3", {"class": "post-title"})
    title = title_container[0].text
    weblink = ne_url + title_container[0].a["href"]

    f.write(paragraph + "," + title + "," + weblink + "\n")

f.close()

【问题讨论】:

    标签: python html csv web-scraping beautifulsoup


    【解决方案1】:

    您可以使用 built-in csv module 编写格式正确的 CSV,并在需要的字符串(例如包含逗号的字符串)周围加上引号。

    同时,我重构了您的代码以使用可重用函数:

    • get_soup_from_url() 下载 URL 并从中获取 BeautifulSoup
    • parse_today_page() 是一个生成器函数,可以遍历该汤并返回每篇文章的字典
    • 主代码现在只在打开的文件上使用csv.DictWriter;解析后的 dicts 会打印到控制台以方便调试,并提供给 CSV 编写器进行输出。
    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    import csv
    
    
    base_url = "https://neweralive.na/posts/"
    
    
    def get_soup_from_url(url):
        resp = urlopen(url)
        page_soup = BeautifulSoup(resp.read(), "html.parser")
        resp.close()
        return page_soup
    
    
    def parse_today_page(page_soup):
        for container in page_soup.findAll("article", {"class": "post-item"}):
            paragraph_container = container.findAll(
                "p", {"class": "post-excerpt"}
            )
            paragraph = paragraph_container[0].text
            title_container = container.findAll("h3", {"class": "post-title"})
            title = title_container[0].text
            weblink = base_url + title_container[0].a["href"]
            yield {
                "paragraph": paragraph,
                "title": title.strip(),
                "link": weblink,
            }
    
    
    print("Downloading...")
    page_soup = get_soup_from_url("https://neweralive.na/today/")
    
    with open("newera.csv", "w") as f:
        writer = csv.DictWriter(f, ["paragraph", "title", "link"])
        writer.writeheader()
        for entry in parse_today_page(page_soup):
            print(entry)
            writer.writerow(entry)
    

    生成的 CSV 最终看起来像例如

    paragraph,title,link
    "The mayor of Helao Nafidi, Elias Nghipangelwa, has expressed disappointment after Covid-19 relief food was stolen and sold by two security officers entrusted to guard the warehouse where the food was stored.","Guards arrested for theft of relief food",https://neweralive.na/posts/posts/guards-arrested-for-theft-of-relief-food
    "Government has decided to construct 1 200 affordable homes, starting Thursday this week.","Govt to construct  1 200 low-cost houses",https://neweralive.na/posts/posts/govt-to-construct-1-200-low-cost-houses
    ...
    

    【讨论】:

    • 这个世界上有英雄。非常感谢。
    • 您好 - 哇,这真是太棒了,我喜欢这个解决方案 - 它是 python 中一个清晰的 csv 示例。非常感谢!
    【解决方案2】:

    您可以使用 pandas 模块并将 dataframe-table 轻松转换为 csv。

    import pandas as pd
    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    
    page_url = "https://neweralive.na/today/"
    
    ne_url = "https://neweralive.na/posts/"
    
    uClient = uReq(page_url)
    
    page_soup = soup(uClient.read(), "html.parser")
    
    uClient.close()
    
    containers = page_soup.findAll("article", {"class": "post-item"})
    
    filename = "newera.csv"
    
    rows = []  # Initialize list of list which is converted to dataframe.
    
    for container in containers:
    
        paragraph_container = container.findAll("p", {"class": "post-excerpt"})
        paragraph = paragraph_container[0].text
    
        title_container = container.findAll("h3", {"class": "post-title"})
        title = title_container[0].text
        weblink = ne_url + title_container[0].a["href"]
        
        rows.append([paragraph, title, weblink])  # each row is appended 
    
    df = pd.DataFrame(rows, columns = ["paragraph","title","link"])  # col-name is headers 
    
    df.to_csv(filename, index=None)
    

    【讨论】:

    • 这是一个绝对不需要 Pandas 的任务。
    • 我同意。只是一种选择。
    • 亲爱的 Shimo - 非常感谢您展示了我们在 Pandas 中拥有的选项和可能性 - 我喜欢它 - 拥有我们所拥有的各种方式和路径的想法。这使得 Python 如此强大!因此,感谢您提出您的想法
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-07-30
    • 1970-01-01
    • 1970-01-01
    • 2012-05-10
    • 2017-10-22
    • 2017-11-23
    相关资源
    最近更新 更多