如何读取 CSV 或文本文件的行，遍历每一行并为每一行读取保存到新文件答案

【问题标题】：How To Read Lines of CSV or Text File, Loop Over Each Line and Save To a New File For Each Line Read如何读取 CSV 或文本文件的行，遍历每一行并为每一行读取保存到新文件
【发布时间】：2018-08-24 09:26:58
【问题描述】：

我有一个独特的问题，我认为我已经解决了，直到我使用 While 循环来控制这个程序的流程。

简介：

我有一个平面文件（CSV 或文本），其中包含一些我想要抓取的 URL，使用 BeautifulSoup 将新标签附加到 HTML（有效），然后将每个抓取的 URL 保存为新文件名。

我需要什么：

遍历每一行
获取网址
抓取每个网址的页面
添加新的 HTML 标记
保存文件，尽可能使用 HTML 文件的名称
再次重新启动同一个程序，它会转到下一行。

我很确定这与我无法理解基础知识有关，我仍在努力解决这个问题。这是我的代码：

怎么了：

使用 Python3，代码确实有效，我使用 Jupyter 逐行观察代码和一系列打印语句，看看当 While 循环运行时返回了什么。

问题是只保存了一个文件，并且文件末尾的 URL 是唯一保存的内容。其他 URL 被抓取。

在转到下一行之前，如何让每一行迭代和抓取以唯一保存？我是否错误地使用了这些构造？

网址：

https://www.imgacademy.com/media/headline/img-academy-alumna-jacqueline-bendrick-ready-tee-against-men-golfbc-championship

https://www.imgacademy.com/media/headline/img-academy-u19-girls-win-fysa-state-cup-u19-championship

https://www.imgacademy.com/media/headline/img-academy-celebrates-largest-commencement-ceremony-date-200-ascenders-earn

代码：

import csv
import requests
from bs4 import BeautifulSoup as BS

filename = 'urls.csv'

with open(filename, 'r+') as file:


    while True:

        line = file.readline()

        user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'

        headers = {'User-Agent':user_agent}

        response = requests.get(line, headers)

        print(response)

        soup = BS(response.content, 'html.parser')

        html = soup

        title = soup.find('title')
        meta = soup.new_tag('meta')
        meta['name'] = "robots"
        meta['content'] = "noindex, nofollow"
        title.insert_after(meta)

        for i 
        with open('{}'".txt".format("line"), 'w', encoding='utf-8') as f:
            outf.write(str(html))

            if (line) == 0:
                break

【问题讨论】：

好的，我自己解决了这个问题。我需要的是一个 for 循环，它使用索引和枚举函数来确保循环运行，运行代码/抓取，然后用打开的过程进行调整，切片变量（拿走 http 的东西并保存真实的文件名）从而保存文件。

标签： python web-scraping beautifulsoup readlines

【解决方案1】：

filename = 'urls.csv'

with open(filename, 'r+') as file:

    #line = line.replace('\n', '')

    print(line)

    for index, line  in enumerate(file):

        user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'

        headers = {'User-Agent':user_agent}

        print(headers)

        response = requests.get(line, headers)

        print(response)

        soup = BS(response.content, 'html.parser')

        html = soup

        title = soup.find('title')
        meta = soup.new_tag('meta')
        meta['name'] = "robots"
        meta['content'] = "noindex, nofollow"
        title.insert_after(meta)

        with open('{}.html'.format(line[41:]), 'w', encoding='utf-8') as f:
            f.write(str(html))

【讨论】：

'r+' 是错误的文件模式。第 7 行中的 print(line) 将导致 NameError，因为此时 line 未定义。 enumerate() 没有意义，因为 index 没有在任何地方使用。将名称 soup 和 html 用于同一个对象会使代码更难理解。
该程序确实为我运行，实际上做得很好，但是，我会尝试你的建议。