如何为每组链接保存一个 txt 文件？答案

【问题标题】：How do I save a txt file for each set of links?如何为每组链接保存一个 txt 文件？
【发布时间】：2016-07-01 14:21:23
【问题描述】：

我正在尝试抓取黄页的多个页面并将打印输出存储在 txt 文件中。我知道获取这些页面上的数据不需要登录，我只是想练习一下登录请求。会话（）。

我想将 set_1 中每个 url 的标题存储在一个 txt 文件 YP_set_1.txt 中。 set_2 中的 url 也是如此。

这是我的代码。

import requests
from bs4 import BeautifulSoup
import requests.cookies
import time



s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
           'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = '****.*****@*****.***'
PASSWORD = '*******'

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)

s.post(url, data=login_data, headers=headers)

set_1 = "This is the first set."

targeted_pages = ['https://www.yellowpages.com/brookfield-wi/business',
                  'https://www.yellowpages.com/bronx-ny/cheap-party-halls',
                  'https://www.yellowpages.com/bronx-ny/24-hour-liquor-store',
                  'https://www.yellowpages.com/bronx-ny/24-hour-oil-change',
                  'https://www.yellowpages.com/bronx-ny/auto-insurance',
                  'https://www.yellowpages.com/bronx-ny/awnings-canopies',
                  'https://www.yellowpages.com/bronx-ny/golden-corral',
                  'https://www.yellowpages.com/bronx-ny/concrete-contractors',
                  'https://www.yellowpages.com/bronx-ny/automobile-salvage',
                  'https://www.yellowpages.com/bronx-ny/24-hour-daycare-centers',
                  'https://www.yellowpages.com/bronx-ny/movers',
                  'https://www.yellowpages.com/bronx-ny/nursing-homes',
                  'https://www.yellowpages.com/bronx-ny/signs'
                  ]
for target_urls in targeted_pages:
    targeted_page = s.get(target_urls, headers=headers, cookies=cj)
    targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

    for record in targeted_soup.findAll('title'):
        with open("YP_Set_1.txt", "w") as text_file:
            print(set_1 + '\n' + record.text, file=text_file)
time.sleep(5)

set_2 = "This is the second set."

targeted_pages_2 = ['https://www.yellowpages.com/north-miami-beach-fl/attorneys',
                    'https://www.yellowpages.com/north-miami-beach-fl/employment-agencies',
                    'https://www.yellowpages.com/north-miami-beach-fl/dentists',
                    'https://www.yellowpages.com/north-miami-beach-fl/general-contractors',
                    'https://www.yellowpages.com/north-miami-beach-fl/electricians',
                    'https://www.yellowpages.com/north-miami-beach-fl/pawnbrokers',
                    'https://www.yellowpages.com/north-miami-beach-fl/lighting-fixtures',
                    'https://www.yellowpages.com/north-miami-beach-fl/towing'
                    ]
for target_urls_2 in targeted_pages_2:
    targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
    targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

    for record in targeted_soup_2.findAll('title'):
        with open("YP_Set_2.txt", "w") as text_file:
            print(set_2 + '\n' + record.text, file=text_file)

当我运行代码时，这是 YP_Set_1.txt 的打印输出。

This is the first set.
Signs in Bronx, New York with Reviews & Ratings - YP.com

YP_Set_2.txt 的打印输出。

This is the second set.
Towing in North Miami Beach, Florida with Reviews & Ratings - YP.com

是否有一种快速修复方法可以让我将集合中每个 url 的所有标题存储在文本文件中，而不是只获取集合中最后一个 url 的标题？感谢您的任何意见。

【问题讨论】：

标签： python-3.x web-scraping beautifulsoup python-requests

【解决方案1】：

您一直在循环中打开文件，因此您一直在覆盖内容，您可以使用 "a" 追加而不是 "w" 继续重新打开，它会覆盖但更容易在循环外打开一次：

with open("YP_Set_2.txt", "w") as text_file:
    for target_urls_2 in targeted_pages_2:
        targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
        targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

        for record in targeted_soup_2.find_all('title'):            
                text_file.write(set_2 + '\n' + record.text)

对两个块执行相同的操作。

【讨论】：

再次感谢您的帮助。