字典中的某些字段未写入 csv 文件答案

【问题标题】：some fields from dictionary are not getting written in csv file字典中的某些字段未写入 csv 文件
【发布时间】：2019-12-05 08:57:45
【问题描述】：

我从duckduckgo.com 抓取了结果并将结果存储在标题、链接、描述中链接和描述被打印，但标题没有被打印

我已经用 print(title) 打印了标题它给出了输出

class DuckduckgoScraper(web_scraping):
    def scrape(self,search_Term):
        self.filename = search_Term
        self.url = 'https://duckduckgo.com/html?q='+search_Term
        r = requests.get(self.url,headers=USER_AGENT)
        soup = BeautifulSoup(r.content,'html5lib')
        result_block = soup.find_all(class_ = 'result__body')
        for result in result_block:
            link = result.find('a', attrs={'class':'result__a'}, href=True)
            title = result.find('h2')
            description = result.find(attrs={'class':'result__snippet'})
            if link and title:
                link = link['href']
                title = title.get_text()
                if description:
                    description = description.get_text()
                    with open(self.filename+'.csv', 'a', encoding='utf-8',newline='') as csv_file:
                        file_is_empty = os.stat(self.filename+'.csv').st_size==0
                        fieldname = ['title','link','description']
                        writer = csv.DictWriter(csv_file,fieldnames=fieldname)
                        if file_is_empty:
                            writer.writeheader()
                        writer.writerow({'title':title,'link':link,'description':description})

它不会给出任何错误

【问题讨论】：

标签： python csv web-scraping beautifulsoup duckduckgo

【解决方案1】：

您在每行迭代中打开 ng 并将其写入 csv 文件。取而代之的是，将行存储在列表中，并在末尾使用.writerows() 函数一次写入。

注意：对行的每个项目都使用.strip() 很有用，否则 Excel/LibreOffice/...可能会在打开文件时感到困惑。

import os
import csv
import requests
from bs4 import BeautifulSoup

USER_AGENT = {'User-Agent':'Mozilla/5.0'}

def scrape(search_Term):
    filename = search_Term
    url = 'https://duckduckgo.com/html?q='+search_Term
    r = requests.get(url,headers=USER_AGENT)
    soup = BeautifulSoup(r.content,'html5lib')
    result_block = soup.find_all(class_ = 'result__body')
    for result in result_block:
        link = result.find('a', attrs={'class':'result__a'}, href=True)
        title = result.find('h2')
        description = result.find(attrs={'class':'result__snippet'})

        rows = []
        if link and title:
            link = link['href']
            title = title.get_text()
            if description:
                description = description.get_text()
                rows.append({'title':title.strip(), 'link':link.strip(), 'description':description.strip()})
                # print(title.strip(), link.strip())
                # print(description.strip())
                # print('*'* 80)

        with open(filename+'.csv', 'a', encoding='utf-8',newline='') as csv_file:
            file_is_empty = os.stat(filename+'.csv').st_size==0
            fieldname = ['title','link','description']
            writer = csv.DictWriter(csv_file,fieldnames=fieldname)
            if file_is_empty:
                writer.writeheader()
            writer.writerows(rows)

scrape('tree')

这将创建tree.csv。在 LibreOffice 中，它看起来像这样：

【讨论】：

他正在以追加模式写作。这应该会产生相同的结果，并且内存效率更高。
@MisterMiyagi 我怀疑当他将行写入文件时，他没有在做.strip()。因此，当导入文件时 Excel 会感到困惑。

【解决方案2】：

您可以使用适当的有效负载发出一个 post http 请求，以获取所需的内容并将它们写入 csv 文件。我使用 python 作为搜索关键字，这就是它产生的结果：

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://duckduckgo.com/html/"

payload = {
    'q': 'python',
    'b': '',
    'kl': 'us-en'
}

r = requests.post(URL,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")

with open("output.csv","w",newline="",encoding="UTF-8") as infile:
    writer = csv.writer(infile)
    for item in soup.select(".result__body"):
        title = item.select_one(".result__a").text
        link = item.select_one(".result__a").get("href")
        desc = item.select_one(".result__snippet").text
        desc_link = item.select_one(".result__snippet").get("href")
        print(f'{title}\n{link}\n{desc}\n{desc_link}\n')
        writer.writerow([title,link,desc,desc_link])

结果如下：

Welcome to Python.org
https://www.python.org/
The official home of the Python Programming Language. Compound Data Types. Lists (known as arrays in other languages) are one of the compound data types that Python understands.
https://www.python.org/

Python (programming language) - Wikipedia
https://en.wikipedia.org/wiki/Python_%28programming_language%29
Python is an interpreted, high-level, general-purpose programming language.Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
https://en.wikipedia.org/wiki/Python_%28programming_language%29

Python Tutorial - w3schools.com
https://www.w3schools.com/python/
Python is a programming language. Python can be used on a server to create web applications. Start learning Python now »
https://www.w3schools.com/python/

【讨论】：