【问题标题】:python webscrape cleaningpython 网页刮擦清理
【发布时间】:2018-04-18 17:31:56
【问题描述】:

我正在使用 python 和 beautifulsoup 来捕获和打印以下内容: 小的 5' x 10' 外部单元/驱动访问 56 美元/月。 $70 店内

我设法让它正确地打印出单元大小(小)和单元类型(外部单元/驱动访问),但是,其他人正在打印正确的数据以及“div class text”。

有人知道我怎样才能正确正确地抓住它吗? 我将添加以下代码;

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup


#setting my_url to the wesite
my_url = 'https://www.publicstorage.com/north-carolina/self-storage-
charlotte-nc/28206-self-storage/2334?
lat=35.23552&lng=-80.83296&clp=1&sp=Charlotte|35.2270869|-80.8431267&ismi=1'

#Opening up connection, grabbing the page
uClient = uReq(my_url)

#naming uClient to page_html
page_html = uClient.read()

#closing uClient
uClient.close()

#this does my html parsing
page_soup = soup(page_html, "html.parser")

#setting container to capture where the actual info is using inspect element
#grabs each product
containers = page_soup.findAll("li",{"class":"srp_res_row plp"})

filename = "product.csv"
f = open(filename, "w")

headers = "unit_size, size_dim, unit_type, online_price, reg_price\n"

f.write(headers)

for container in containers:
    title_container = container.div.div
    unit_size = title_container.text
    size_dim = container.findAll("div", {"class":"srp_label srp_font_14"})
    unit_container = container.li
    unit_type = unit_container.text
    online_price = container.findAll("div", {"class":"srp_label alt-price"})
    reg_price = container.findAll("div", {"class":"reg-price"})


    print("unit_size: " + str(unit_size))
    print("size_dim: " + str(size_dim))
    print("unit_type: " + str(unit_type))
    print("online_price: " + str(online_price))
    print("reg_price: " + str(reg_price))

    f.write(str(unit_size) + "," +str(size_dim) + "," +str(unit_type) + "," 
    +str(online_price) + "," +str(reg_price) + "\n")
f.close()

<li class="srp_res_row plp">
    <div class="srp_res_clm srp_clm160">
        <div class="srp_label plp">Small</div>
        <div class="srp_v-space_3"></div>
        <div class="srp_label srp_font_14" style="padding-left: 5px;">5' x 10'</div>
        <div class="srp_v-space_3"></div>
    </div>
    <div class="srp_res_clm srp_clm120">
        <ul class="srp_list">
            <li>Outside unit/Drive-up access</li>
        </ul>
    </div>
    <div class="srp_res_clm srp_clm90">
        <div class="srp_label">$1<span class="srp_label_symbol">†</span></div>
        <div class="srp_v-space_10">1st Month</div>
    </div>
    <div class="srp_res_clm srp_clm90">
        <div class="srp_label alt-price">$56/mo.</div>
        <div class="online-special">Online Special<span class="srp_label_symbol">†</span></div>
        <div class="srp_v-space_15"></div>
        <div class="reg-price">$70 In-store</div>
    </div>
    <div class="srp_res_clm srp_clm100 srp_vcenter"><a class="srp_continue unit-no-deposit" data-deposit-amount="0" data-deposit-days="0" data-features="Outside unit/Drive-up access" data-marketing-size="5x10" data-ppk="altproduct_price" data-promotionid="132" data-siteid="2334" data-size-description="5' x 10'" data-sizeid="613573" data-wc2-unit="false" href="/ReservationDetails.aspx?st=2334&amp;sz=613573&amp;key=[rnd]&amp;location=&amp;plp=1&amp;rk=&amp;ismi=1&amp;sp=Charlotte%7c35.2270869%7c-80.8431267&amp;clp=1"><img alt="Continue" src="/images/srp-cont-new-80.png" style="width: 80px; height: 32px"/></a></div>
</li>

【问题讨论】:

    标签: python beautifulsoup code-formatting


    【解决方案1】:

    find_all 返回一个 ResultSet 对象,您可以使用 for 循环对其进行迭代。

    试试这个代码:

    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup
    
    
    #setting my_url to the wesite
    my_url = 'https://www.publicstorage.com/north-carolina/self-storage-charlotte-nc/28206-self-storage/2334?lat=35.23552&lng=-80.83296&clp=1&sp=Charlotte|35.2270869|-80.8431267&ismi=1'
    
    #Opening up connection, grabbing the page
    uClient = uReq(my_url)
    
    #naming uClient to page_html
    page_html = uClient.read()
    
    #closing uClient
    uClient.close()
    
    #this does my html parsing
    page_soup = soup(page_html, "html.parser")
    
    #setting container to capture where the actual info is using inspect element
    #grabs each product
    containers = page_soup.findAll("li",{"class":"srp_res_row plp"})
    
    filename = "product.csv"
    f = open(filename, "w")
    
    headers = "unit_size, size_dim, unit_type, online_price, reg_price\n"
    
    f.write(headers)
    
    for container in containers:
        title_container = container.div.div
        unit_size = title_container.text
        size_dim = container.findAll("div", {"class":"srp_label srp_font_14"})
        unit_container = container.li
        unit_type = unit_container.text
        online_price = container.findAll("div", {"class":"srp_label alt-price"})
        reg_price = container.findAll("div", {"class":"reg-price"})
    
    
    
        print("unit_size: " + str(unit_size))
        print("size_dim: {}".format("".join(list(i.text for i in size_dim))))  #edited line
        print("unit_type: " + str(unit_type))
        print("Online Price: {}".format("".join(list(i.text for i in online_price)))) #edited line
        print("Online Price: {}".format("".join(list(i.text for i in reg_price)))) #edited line
    
    
        f.write(str(unit_size) + "," +str(size_dim) + "," +str(unit_type) + ","
        +str(online_price) + "," +str(reg_price) + "\n")
    f.close()
    

    输出:

    unit_size: Small
    size_dim: 5' x 10'
    unit_type: Outside unit/Drive-up access
    Online Price: $52/mo.
    Online Price: $64 In-store
    unit_size: Medium
    size_dim: 5' x 15'
    unit_type: Outside unit/Drive-up access
    Online Price: $80/mo.
    Online Price: $100 In-store
    

    根据 cmets 更新代码:

    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup
    
    
    #setting my_url to the wesite
    my_url = 'https://www.publicstorage.com/north-carolina/self-storage-charlotte-nc/28206-self-storage/2334?lat=35.23552&lng=-80.83296&clp=1&sp=Charlotte|35.2270869|-80.8431267&ismi=1'
    
    #Opening up connection, grabbing the page
    uClient = uReq(my_url)
    
    #naming uClient to page_html
    page_html = uClient.read()
    
    #closing uClient
    uClient.close()
    
    #this does my html parsing
    page_soup = soup(page_html, "html.parser")
    
    #setting container to capture where the actual info is using inspect element
    #grabs each product
    containers = page_soup.findAll("li",{"class":"srp_res_row plp"})
    
    filename = "product.csv"
    f = open(filename, "w")
    
    headers = "unit_size, size_dim1, unit_type, online_price, reg_price\n"
    
    f.write(headers)
    
    for container in containers:
        title_container = container.div.div
        unit_size = title_container.text
        size_dim = container.findAll("div", {"class":"srp_label srp_font_14"})
        unit_container = container.li
        unit_type = unit_container.text
        online_price = container.findAll("div", {"class":"srp_label alt-price"})
        reg_price = container.findAll("div", {"class":"reg-price"})
    
    
        for item in zip(unit_size,size_dim,unit_container,online_price,reg_price):
            csv=item[0] + "," + item[1].text + "," + item[2] + "," + item[3].text + "," + item[4].text + "\n"
            f.write(csv)
    

    【讨论】:

    • Paul,这会正确打印它,但不会在它创建的 excel 文件中这样做。这可能是什么原因?
    • @D-Ru 因为您没有存储来自我编辑的代码的输出,而是存储了不正确的旧输出。
    • @Paul,对不起,我对此很陌生,但我将如何存储新的输出?我假设循环正在捕获旧的输出。
    • @D-Ru 你想打印和存储这两个东西,或者如果你必须直接存储输出,那么为什么要在终端上打印它?
    • @Paul,将其存储在 excel 中是主要目的。我正在终端上打印它以快速查看它是否捕获了正确的数据。
    猜你喜欢
    • 2019-06-16
    • 2021-11-07
    • 2018-05-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-06
    • 1970-01-01
    • 2017-08-31
    相关资源
    最近更新 更多