网页抓取数据到 python 上的 csv 文件，以及抓取链接的代码答案

【问题标题】：web scraping data to csv file on python, and the code to scrape a link网页抓取数据到 python 上的 csv 文件，以及抓取链接的代码
【发布时间】：2022-01-06 20:55:48
【问题描述】：

1 - 当我检查 csv 文件时，我只能从最后一个链接 (Tugende) 中找到数据。但是当我打印数据时，我得到了我想要的一切。如何获取 csv 文件中的所有数据？

2 - 对于 'source' 变量，我如何才能仅从中获取文章链接并将其添加到 csv 文件中。

import requests
from bs4 import BeautifulSoup as bs
import csv

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']
for startup in startups:
    u = url.format(startup)
    html_text = requests.get(u).text
    soup = bs(html_text, 'lxml')
    
    list1 = soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark')
    source1 =soup.find_all('div',class_='col-md-2 mt-3 mt-lg-0')
    file = open('funding.csv', 'w',newline='')
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors'])
    writer.writerow(mama)



    for L in list1:      
        name1 = L.find('span', class_="line-height-1").text
        amount1 = L.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = L.find('span', class_="pt-0").text
        funding_type1 = L.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = L.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source =L.find('div',class_="col-md-2 mt-3 mt-lg-0")
        
        print(name1, funding_type1, date1,amount1, investor1)

        writer.writerow([name1, funding_type1, date1,amount1, investor1])
    file.close()

【问题讨论】：

标签： python csv web-scraping scrape write

【解决方案1】：

1：您应该在写入 csv 文件时使用上下文管理器来处理它。我已经在下面修复了您的代码，首先我在“w”模式下添加标题（因此它会在您第一次运行代码时写入文件）然后我在抓取每一页时将“a”数据附加到它。

2：你需要找到源链接所在的'a'标签，然后像这样获取href属性：find('a')['href']见下文

import requests
from bs4 import BeautifulSoup as bs
import csv

#write header
with open('funding.csv','w',newline='') as file:
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors','source'])
    writer.writerow(mama)

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']

for startup in startups:

    html_text = requests.get(url.format(startup))
    soup = bs(html_text.text,'lxml')

    for list1 in soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark'):
        name1 = list1.find('span', class_="line-height-1").text
        amount1 = list1.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = list1.find('span', class_="pt-0").text
        funding_type1 = list1.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = list1.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source = list1.find('div',class_="col-md-2 mt-3 mt-lg-0").find('a')['href']

        print(name1, funding_type1, date1,amount1, investor1, source)

        with open('funding.csv','a',newline='') as file:
            writer = csv.writer(file)
            writer.writerow([name1, funding_type1, date1,amount1, investor1, source])

【讨论】：

【解决方案2】：

您仅获得最终启动数据的原因是您打开输出文件的方式：

    file = open('funding.csv', 'w',newline='')

这会根据要求打开文件进行写入，但会将“文件开头”指针放在文件的开头。这很好第一次你通过循环，但不是随后。

如果你真的想在循环中打开文件，你需要使用a（表示“为写入而打开，但在附加模式下如果它已经存在”）。

但是，在循环内执行此操作效率不高。我建议在开始 for 循环之前打开文件进行写入，然后也创建 writer 对象：

writer = csv.writer(open('funding.csv', 'w', newline=''))
for startup in startups
....

[do loop operations]
....
writer.close()

并在循环结束后执行close() 操作。

【讨论】：

【解决方案3】：

当您打印(element.find()) 并保存您的元素时，结果会有所不同。
实际上 element.find() 返回 bs4.element.Tag 而不是 str。
在您的情况下，您看不到它，因为 python 在打印某些内容时会应用 str(element.find())。
您需要进行强制转换，否则会导致不需要的结果。
示例：

element = BeautifulSoup('<div></div>')
print(type(element.find()))
print(type(str(element.find())))

【讨论】：