【问题标题】:How do I save a txt file for each set of links?如何为每组链接保存一个 txt 文件?
【发布时间】:2016-07-01 14:21:23
【问题描述】:

我正在尝试抓取黄页的多个页面并将打印输出存储在 txt 文件中。我知道获取这些页面上的数据不需要登录,我只是想练习一下登录请求。会话()。

我想将 set_1 中每个 url 的标题存储在一个 txt 文件 YP_set_1.txt 中。 set_2 中的 url 也是如此。

这是我的代码。

import requests
from bs4 import BeautifulSoup
import requests.cookies
import time



s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
           'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = '****.*****@*****.***'
PASSWORD = '*******'

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)

s.post(url, data=login_data, headers=headers)

set_1 = "This is the first set."

targeted_pages = ['https://www.yellowpages.com/brookfield-wi/business',
                  'https://www.yellowpages.com/bronx-ny/cheap-party-halls',
                  'https://www.yellowpages.com/bronx-ny/24-hour-liquor-store',
                  'https://www.yellowpages.com/bronx-ny/24-hour-oil-change',
                  'https://www.yellowpages.com/bronx-ny/auto-insurance',
                  'https://www.yellowpages.com/bronx-ny/awnings-canopies',
                  'https://www.yellowpages.com/bronx-ny/golden-corral',
                  'https://www.yellowpages.com/bronx-ny/concrete-contractors',
                  'https://www.yellowpages.com/bronx-ny/automobile-salvage',
                  'https://www.yellowpages.com/bronx-ny/24-hour-daycare-centers',
                  'https://www.yellowpages.com/bronx-ny/movers',
                  'https://www.yellowpages.com/bronx-ny/nursing-homes',
                  'https://www.yellowpages.com/bronx-ny/signs'
                  ]
for target_urls in targeted_pages:
    targeted_page = s.get(target_urls, headers=headers, cookies=cj)
    targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

    for record in targeted_soup.findAll('title'):
        with open("YP_Set_1.txt", "w") as text_file:
            print(set_1 + '\n' + record.text, file=text_file)
time.sleep(5)

set_2 = "This is the second set."

targeted_pages_2 = ['https://www.yellowpages.com/north-miami-beach-fl/attorneys',
                    'https://www.yellowpages.com/north-miami-beach-fl/employment-agencies',
                    'https://www.yellowpages.com/north-miami-beach-fl/dentists',
                    'https://www.yellowpages.com/north-miami-beach-fl/general-contractors',
                    'https://www.yellowpages.com/north-miami-beach-fl/electricians',
                    'https://www.yellowpages.com/north-miami-beach-fl/pawnbrokers',
                    'https://www.yellowpages.com/north-miami-beach-fl/lighting-fixtures',
                    'https://www.yellowpages.com/north-miami-beach-fl/towing'
                    ]
for target_urls_2 in targeted_pages_2:
    targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
    targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

    for record in targeted_soup_2.findAll('title'):
        with open("YP_Set_2.txt", "w") as text_file:
            print(set_2 + '\n' + record.text, file=text_file)

当我运行代码时,这是 YP_Set_1.txt 的打印输出。

This is the first set.
Signs in Bronx, New York with Reviews & Ratings - YP.com

YP_Set_2.txt 的打印输出。

This is the second set.
Towing in North Miami Beach, Florida with Reviews & Ratings - YP.com

是否有一种快速修复方法可以让我将集合中每个 url 的所有标题存储在文本文件中,而不是只获取集合中最后一个 url 的标题?感谢您的任何意见。

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    您一直在循​​环中打开文件,因此您一直在覆盖内容,您可以使用 "a" 追加而不是 "w" 继续重新打开,它会覆盖但更容易在循环外打开一次:

    with open("YP_Set_2.txt", "w") as text_file:
        for target_urls_2 in targeted_pages_2:
            targeted_page_2 = s.get(target_urls_2, headers=headers, cookies=cj)
            targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
    
            for record in targeted_soup_2.find_all('title'):            
                    text_file.write(set_2 + '\n' + record.text)
    

    对两个块执行相同的操作。

    【讨论】:

    • 再次感谢您的帮助。
    猜你喜欢
    • 2020-09-16
    • 1970-01-01
    • 1970-01-01
    • 2015-08-15
    • 1970-01-01
    • 1970-01-01
    • 2014-02-16
    • 1970-01-01
    • 2019-07-26
    相关资源
    最近更新 更多