【问题标题】:Issue with scraping data using beautiful soup使用漂亮的汤抓取数据的问题
【发布时间】:2013-01-15 01:39:35
【问题描述】:

我正在使用以下代码从网站上抓取数据。

# -*- coding: cp1252 -*-
import urllib2
import sys
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
plans = soup.findAll('div', {"class": "planTitle"})
for plan in plans:
    planname = u' '.join(plan.stripped_strings)
    plantypes = soup.findAll('div', {"class":"top"})
    prices = soup.findAll('div', {"class":"bottom"})
    for plantype, price in zip(plantypes, prices):
        plantype1 = u' '.join(plantype.stripped_strings)
        price1 = u' '.join(price.stripped_strings)
        print planname, plantype1, price1

问题:如果您浏览此代码中提到的网页,这些是 4-5 种计划,每个计划有 3 个语音选项和一些 2-3 个数据选项。我想以这样一种方式抓取数据,以便对于每个计划,我可以获得其各自的语音选项,然后是这些选项的月度价格。

我现在运行的代码返回计划名称+语音选项的所有可能组合。对于每个计划名称,我都会获得大约 20-30 个条目,因为即使对于错误的计划名称 + 语音选项组合,它也会创建一个条目。例如。个人计划 - 550 分钟 - 59.99 美元,在这个组合中,500 分钟和 59.99 是家庭计划的一部分。

我希望循环运行,以便只提取正确的计划 + 语音选项组合。

网页片段: 对于每个计划,网页上都有一个框,其中包含语音选项和与这些选项相对应的价格,我希望为每个框运行循环,但语音选项的元素 + 类组合及其价格不是唯一的。这就是为什么计划名称也从其他 boxex 中获取价值的原因。

<div class="innerContainer"> 

    <div class="planTitle"> 
        <h2><a href="http://www.att.com/shop/wireless/plans/individualplans.html" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">AT&amp;T Individual Plans</a></h2> 
    </div> 
    <div class="planSubTitle"> 
        <img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-clock.jpg" alt=""> 
        <p>Voice plan options:</p> 
    </div> 
    <!-- Begin three white boxes --> 
    <!-- Note, extra boxes can be added to the row with the following method  --> 
    <!-- 1. Add more div containers inside .whiteBox  --> 
    <!-- 2. Modify class names to boxes_one, boxes_two, boxes_three etc... (max six) --> 
    <div class="whiteBox"> 
        <div class="boxes_three"> 
            <a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_450" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830290.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindvoice450';return false;" aria-describedby="smartphone_individual_voice_450" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a> 
            <span id="smartphone_individual_voice_450" class="tips" role="tooltip">$0.45/min. for additional minutes</span> 
            <div class="top"> 
                <p class="stat">450</p> 
                <p class="statText">Minutes</p> 
            </div> 
            <div class="bottom"> 
                <p>$39.99/mo.</p> 
            </div> 
        </div> 
        <div class="boxes_three"> 
            <a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_900" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830292.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindvoice900';return false;" aria-describedby="smartphone_individual_voice_900" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a> 
            <span id="smartphone_individual_voice_900" class="tips" role="tooltip">$0.40/min. for additional minutes</span> 
            <div class="top"> 
                <p class="stat">900</p> 
                <p class="statText">Minutes</p> 
            </div> 
            <div class="bottom"> 
                <p>$59.99/mo.</p> 
            </div> 
        </div> 
        <div class="boxes_three borderNone"> 
            <a class="fullBoxLink" href="http://www.att.com/shop/wireless/plans/voice/sku3830293.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindvoiceunlim" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a> 
            <div class="top"> 
                <p class="stat">Unlimited</p> 
                <p class="statText">Minutes</p> 
            </div> 
            <div class="bottom"> 
                <p>$69.99/mo.</p> 
            </div> 
        </div> 
    </div> 
    <!-- End three white boxes --> 
    <!-- Begin left gray container --> 
    <div class="containerTwoThirds"> 
        <div class="planSubTitle"> 
            <img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-globe.jpg" alt=""> 
            <p>Data plan options:</p> 
        </div> 
        <div class="grayTwoThirds"> 
            <div class="grayBox"> 
                <a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/dataplus300mb-smartphone4glte-sku5380269.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spinddata300mb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a> 
                <p class="stat"><strong>300MB</strong></p> 
                <p class="statText">$20.00/mo.</p> 
            </div> 
            <div class="grayBoxBreak"></div> 
            <div class="grayBox"> 
                <a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro3gb-smartphone4glte-sku5470232.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spinddata3gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a> 
                <p class="stat"><strong>3GB</strong></p> 
                <p class="statText">$30.00/mo.</p> 
            </div> 
            <div class="grayBoxBreak"></div> 
            <div class="grayBox"> 
                <a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro5gb-smartphone4glte-sku5480228.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spinddata5gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a> 
                <p class="stat"><strong>5GB</strong></p> 
                <p class="statText">$50.00/mo.</p> 
            </div> 
        </div> 
    </div> 
    <!-- End left gray container --> 
    <!-- Begin right gray container --> 
    <div class="containerThird"> 
        <div class="planSubTitle"> 
            <img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-phone.jpg" alt=""> 
            <p>Messaging plan options: <span class="fix"></span></p> 
        </div> 
        <div class="grayThird"> 
            <div class="grayBox">  
                <a data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2012325" href="http://www.att.com/shop/wireless/services/messagingunlimited-sku1160055.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindmessunlim" class="fullBoxLink"></a>  
                <p class="stat"><strong>ULTD</strong> MSGS</p>  
                <p class="statText">$20.00/mo.</p>  
            </div> 
            <div class="grayBoxBreak"></div> 
            <div class="grayBox last"> 
                <p class="stat"><strong>PAY PER USE</strong></p> 
                <p class="statText">20¢/text <span class="lightGray">|</span> 30¢/pic/video</p> 
            </div> 
        </div> 
    </div> 
    <!-- End right gray container --> 
    <!-- Begin sub footer --> 
    <div class="bottomLinks">  
        <div class="links"> 
            <a href="http://www.att.com/shop/wireless/plans/individualplans.html?taxoPlan=POSTPAID-INDIVIDUAL-CANADA&amp;source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindcanada" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Nation with Canada Plans</a> | <a href="http://www.att.com/shop/wireless/plans/voice/sku5740279.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindhomephone" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Unlimited Home Phone</a> | <a href="http://www.att.com/shop/wireless/plans/voice/sku3830294.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=spindsenior" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Senior Plans</a> 
        </div> 
        <a class="shop_button" href="http://www.att.com/shop/wireless/devices/smartphones.html?source=IC95ATPLP00PSP00L&amp;wtExtndSource=indshopsp" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"><img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/buttons/shop_smartphones.png" alt="Shop Smartphones" width="158" height="29"></a> 
    </div> 
    <!-- End sub footer --> 
</div>

由于我是编程新手,请帮助我解决这个问题。

【问题讨论】:

  • 能否请您提供一个 HTML 示例?这个问题目前真的太本地化了(一旦 AT​​T 网站发生变化,您的问题对未来的访问者将毫无用处)。
  • 我已经添加了 html sn-p 用于计划的示例框。
  • @MartijnPieters 我是否需要添加/提及其他任何内容以使这个问题更普遍?
  • 不,HTML sn-p 非常好。现在没有时间详细回答,但这不会阻止其他人提供帮助。
  • @atams 完全重写了我的答案,希望它对你有用。

标签: python html python-2.7 beautifulsoup


【解决方案1】:

从头开始重写。没有 cmets,但它非常不言自明。字典中的 lambda 用于查找以某个字符串开头的属性。我为此引用了这个答案:https://stackoverflow.com/a/2830550/541208

我原以为你在soup 上使用findAll,而你应该使用plan.findAll,但它没有任何帮助,所以我只是重写了整个内容。

import urllib2
import sys
from bs4 import BeautifulSoup


page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)

#find the container for all the plans
tabcontent = soup.find('div', {"id": "smartphonePlans", "class": "tabcontent"})
containers = tabcontent.findAll('div', {"class": "innerContainer"})

for plan in containers:
     planTitle = plan.find("div", {"class": "planTitle"})
     if planTitle:
          title = planTitle.find("a").text     
          print title          

     voiceBoxes = plan.find("div", {"class": "whiteBox"})     
     if voiceBoxes:
               box3 = voiceBoxes.findAll("div", {"class": lambda x: x and x.startswith("boxes_")})
               if box3:
                    for box in box3:
                         top = box.findAll("p")
                         minutes = u" ".join([tag.text for tag in top])
                         print "\t", minutes

哪些输出:

AT&T Individual Plans
    450 Minutes $39.99/mo.
    900 Minutes $59.99/mo.
    Unlimited Minutes $69.99/mo.
AT&T Family Plans
    550 Minutes $59.99/mo.
    700 Minutes $69.99/mo.
    1,400 Minutes $89.99/mo.
    2,100 Minutes $109.99/mo.
    Unlimited Minutes $119.99/mo.
AT&T Mobile Share Plans
    1GB $40/mo. + $45/smartphone
    4GB $70/mo. + $40/smartphone
    6GB $90/mo. + $35/smartphone
    10GB $120/mo.
    15GB $160/mo. + $30/smartphone
    20GB $200/mo.

【讨论】:

  • 在上面的代码中,我们得到的输出是针对语音计划的,但是如果我想要数据和消息计划的输出也出现在灰色框中。我尝试使用类似的方法,但没有输出。
  • 尝试在 Chrome/Firefox 中打开页面并启用开发工具,这样您就可以“检查元素”,然后从外部容器跟踪每个 grayBox 的路径,然后尝试使用代码重新创建它.在每个 findAll 之后放置大量断点,以查看返回的数据并确认您选择了正确的部分。
猜你喜欢
  • 2012-12-13
  • 2012-12-08
  • 2012-12-16
  • 2021-01-12
  • 1970-01-01
  • 1970-01-01
  • 2021-11-27
  • 2020-09-15
  • 2012-12-11
相关资源
最近更新 更多