Python Web Scrape - For 循环问题答案

【问题标题】：Python Web Scrape - For Loop IssuePython Web Scrape - For 循环问题
【发布时间】：2017-12-14 16:15:22
【问题描述】：

我有这个小项目来抓取一个网站。我为竞争对手完成了另一个网站，但我在当前的网站上遇到了困难。

代码当前正在做的是创建一个 csv 文件（这是我想要的），在 csv 文件中，我显示了标题，但它下面没有数据。

有人可以帮我处理我的 for 循环脚本吗？我相信它不是为了将数据写入 csv 文件而捕获数据。

感谢您的帮助。

下面是python脚本：

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

# setting my_url to the wesite
urls = ['https://www.extraspace.com/Storage/Facilities/US/North_Carolina/Charlotte/1000000398/Facility.aspx'
    , 'https://www.extraspace.com/Storage/Facilities/US/North_Carolina/Charlotte/1000000404/Facility.aspx']

#https://www.extraspace.com/Storage/Facilities/US/North_Carolina/Charlotte/1000000398/Facility.aspx?cid=org::maps&utm_source=google&utm_medium=organic&utm_campaign=org::maps



filename = "extraspace.csv"
open(filename, 'w').close()
f = open(filename, "a")
num = 0

headers = "unit_size, size_dim1, unit_type, online_price, reg_price, street_address, store_city, store_postalcode\n"

f.write(headers)

for my_url in urls:
    # Opening up connection, grabbing the page
    uClient = uReq(my_url)

    # naming uClient to page_html
    page_html = uClient.read()

    # closing uClient
    uClient.close()

    # this does my html parsing
    page_soup = soup(page_html, "html.parser")

    # setting container to capture where the actual info is using inspect element

    #-----   <div class="right-col-unit-listings" class="unit-listings"> ==$0    -------this is body of each unit container
    #-----   <div class="results"> ==$0    -------this is body of each unit container
    #grabs each product
    containers = page_soup.findAll("div", {"itemprop": "makesOffer"})

    #-----   <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> ==$0     -----this is body of address
    #grabs address
    store_locator = page_soup.findAll("div", {"itemprop": "address"})

    f.write("website " + str(num) + ": \n")
    for container in containers:
        for store_location in store_locator:
            street_address = store_location.findAll("span", {"itemprop": "streetAddress"})
            store_city = store_location.findAll("span", {"itemprop": "addressLocality"})
            store_postalcode = store_location.findAll("spand", {"itemprop": "postalCode"})
            title_container = container.findAll("div", {"class": "size RamaGothicSemiBold"})
            size_dim = container.findAll("div", {"itemprop": "description"})
            #unit_type = container.findAll("ul", {"itemprop": "description"})
            unit_container = container.ul.li
            unit_type = container.text
            online_price = container.findAll("div", {"itemprop": "price"})
            reg_price = container.findAll("div", {"class": "rate strikeout"})

        for item in zip(title_container, size_dim, unit_type, online_price, reg_price, street_address, store_city, store_postalcode):
            csv = item[0].text + "," + item[1].text + "," + item[2] + "," + item[3].text + "," + item[4].text + "," + item[5].text + "," + item[6].text + "," + item[7].text + "\n"
            f.write(csv)
    num += 1

以下是容器的 HTML：

<div itemprop="makesOffer" itemscope="" itemtype="http://schema.org/Offer">
    <div itemprop="itemOffered" itemscope="" itemtype="http://schema.org/Product">
        <div class="guide">
            <div class="size-help-lnk size-guide hidden" data-locker="False" data-square-feet="25">Size Help</div>
            <div alt="5x5" class="video-btn-5x5 video-link" onclick="trackSC('UnitListingVideo');"></div>
        </div>
        <div class="size RamaGothicSemiBold">
            <div itemprop="description">5' x 5'</div>
            <div>SMALL</div>
        </div>
        <div class="features">
            <ul itemprop="description">
                <li><i class="check-icon"></i>Enclosed Storage</li>
                <li><i class="check-icon"></i>Indoor</li>
                <li><i class="check-icon"></i>1st Floor Access</li>
            </ul>
        </div>
    </div>
    <div>
        <div class="rate strikeout">
            <div><span style="width:100%;"></span>$57</div><span class="StreetRate">IN-STORE</span></div>
        <div class="rate">
            <div content="35.00" itemprop="price">$35
                <meta content="USD" itemprop="priceCurrency" />
            </div><span class="WebRate">WEB RATE</span></div>
        <div class="promo"><span style="color:#000;">Act fast:<br/>Limited units</span></div>
    </div>
    <a class="btn btn-orange cta-test is-vehicle" href="https://www.extraspace.com/Storage/ReserveOrHold.aspx?uid=a0GC000000tUNupMAG" id="ctl00_mContent_UnitListPopular_ctrl0_hlReserveLink" onclick="upDown('unitRows|8506|1;05X05|NDN|57|35| | ; | | | | | ; | | | | | ;05X05|CDN|71|48| | ;05X07|CDN|74|50| | ');">RESERVE</a>
    <div class="clear"></div>
    <link href="http://schema.org/OnlineOnly" itemprop="availability">
    </link>
</div>

最后是地址的 HTML：

< div itemprop = "address"
    itemscope = ""
    itemtype = "http://schema.org/PostalAddress" >
    <span id = "ctl00_mContent_lbAddress"
    itemprop = "streetAddress" > 3304 Eastway Dr < br / > Ste D < /span><br/ >
    <span id = "ctl00_mContent_lbCity"
    itemprop = "addressLocality" > Charlotte < /span>, <
    span id = "ctl00_mContent_lbState"
    itemprop = "addressRegion" > NC < /span> <
    span id = "ctl00_mContent_lbPostalCode"
    itemprop = "postalCode" > 28205 < /span> <
    /div>]

【问题讨论】：

最后关闭文件 - 如果您不关闭它，它可能不会将数据发送到文件。并且代码open(filename, 'w').close() 可能不起作用，因为它会创建新连接，然后它只关闭这个新连接，而不是您忘记关闭的旧连接。
@furas - 我应该将 close() 命令放在脚本末尾吗？
您对单个潜在客户的预期输出是多少？您是否希望观众浏览您的代码并发现它？尽量明确您的要求。
@D-Ru 是的，最后使用f.close()。它是 Python 或系统在磁盘上发送仍然可以在内存 RAM 中的缓冲区中的所有数据的信息。

标签： python html python-3.x web-scraping beautifulsoup

【解决方案1】：

网页上只有一个地址，因此您只需获取一次。因此，您可以摆脱其中一个 for 循环。你有一个错字

store_postalcode = store_location.findAll("spand", {"itemprop": "postalCode"})

导致 None 值。

来自 zip 文档 https://docs.python.org/3/library/functions.html#zip

当最短的输入迭代用完时，迭代器停止。

所以它停在 None 值上，你没有得到任何输出。否则，当地址详细信息用完时，它会在一次迭代后停止。

修复这些代码可以正常工作：

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re

urls = ['https://www.extraspace.com/Storage/Facilities/US/North_Carolina/Charlotte/1000000398/Facility.aspx'
    , 'https://www.extraspace.com/Storage/Facilities/US/North_Carolina/Charlotte/1000000404/Facility.aspx']

filename = "extraspace.csv"
open(filename, 'w').close()
f = open(filename, "a")
num = 0
headers = "unit_size_0, unit_size_1, size_dim1, unit_type, online_price, reg_price, street_address, store_city, store_postalcode\n"
f.write(headers)

for my_url in urls:
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()
    page_soup = soup(page_html, "html.parser")

    street_address = page_soup.find("span", {"itemprop": "streetAddress"}).text
    store_city = page_soup.find("span", {"itemprop": "addressLocality"}).text
    store_postalcode = page_soup.find("span", {"itemprop": "postalCode"}).text

    containers = page_soup.findAll("div", {"itemprop": "makesOffer"})
    for container in containers:
        title_container = container.findAll("div", {"class": "size RamaGothicSemiBold"})
        size_dim = container.findAll("div", {"itemprop": "description"})
        unit_type = container.findAll("ul", {"itemprop": "description"})
        online_price = container.findAll("div", {"itemprop": "price"})
        reg_price = container.findAll("div", {"class": "rate strikeout"})

        for item in zip(title_container, size_dim, unit_type, online_price, reg_price ):
            i= re.match(r"([^A-Z]*)([A-Z]*)", item[0].text.replace('\n', '').strip("\""))
            csv = i.group(1) + "," + i.group(2) + "," + item[1].text + "," + item[2].text + "," + item[3].text + "," + item[4].text + "," \
                  + street_address + "," + store_city + "," + store_postalcode + "\n"
            f.write(csv)
    num += 1

f.close()

如果你喜欢使用 Python csv 模块https://docs.python.org/3/library/csv.html，你可以更新它

【讨论】：

@ Dan-Dev - 谢谢你的作品。虽然，我不确定您所说的“网页上只有一个地址，所以您只需要获取一次”是什么意思，因为我从多个网站（同一家公司，但位置不同）拉取。是否可以获取 unit_size 并将尺寸（5' x 5'）与不同列上的“SMALL”分开？
我的意思是每个网页上只有一个地址（每个 URL 一个）。假设您想要的文本是大写文本，您可以使用我已更新答案的正则表达式。