【问题标题】:Python: How to access and iterate over a list of div class element using (BeautifulSoup)Python:如何使用 (BeautifulSoup) 访问和迭代 div 类元素列表
【发布时间】:2018-06-05 23:32:03
【问题描述】:

我正在使用 BeautifulSoup 解析有关汽车生产的数据(另请参阅我的 first question):

from bs4 import BeautifulSoup
import string

html = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
      Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
      Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
      Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
      Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
      Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
      Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
      Vehicle 805,000 units ( 2011 ) 
    </div>
"""
soup = BeautifulSoup(html, 'lxml')

for item in soup.select("div.profile-area"):
  produkz = item.text.strip()
  produkz = produkz.replace('\n',':')

  prev_h4 = str(item.find_previous_sibling('h4'))
  if "Models" in prev_h4:
    models=produkz
  else:
    models=""

  if "Capacity" in prev_h4:
    capacity=produkz
  else:
    capacity=""

  if "( 2015 )" in produkz:
    prod15=produkz
  else:
    prod15=""
  if "( 2016 )" in produkz:
    prod16=produkz
  else:
    prod16=""
  if "( 2017 )" in produkz:
    prod17=produkz
  else:
    prod17=""

  print(models+';'+capacity+';'+prod15+';'+prod16+';'+prod17)

我的问题是,所有匹配的 HTML 事件(“div.profile-area”)的下一个循环会覆盖我的结果:

;Vehicle 1,140,000 units /year;;;;;;
;;;;;;Vehicle 809,000 units ( 2016 );
;;;;;Vehicle 815,000 units ( 2015 );;
;;;;Vehicle 836,000 units ( 2014 );;;
;;;Vehicle 807,000 units ( 2013 );;;;
;;Vehicle 760,000 units ( 2012 );;;;;
;;;;;;;

我想要的结果是:

;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 );

如果您能告诉我一个更好的方法来构建我的代码,我会很高兴。提前致谢。

【问题讨论】:

  • 你想要的结果是什么?
  • 我已经更新了我的问题。
  • 尝试使用 xPath?我昨天遇到了同样的问题。但我使用了 selenium 和 xPath。所以要解决这个问题,首先抓取 h4 元素,然后遍历每个 //h4 然后在 for 循环 //h4/div[@class="profile-area"]
  • @eddwinpaz 能否请您链接到您的示例(如果它不适合此处)?
  • 顺便说一句,使用 pyQuery 比 BeautifulSoup 更容​​易

标签: python loops beautifulsoup


【解决方案1】:

这是我的解决方案,您需要照顾每个元素标签并相应地解析它。我进一步解决了您的问题,并提供了一种更灵活的方式来访问每个数据值。希望有帮助。

import re

from bs4 import BeautifulSoup

html_doc = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
    Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
    Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
    Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
    Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
    Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
    Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
    Vehicle 805,000 units ( 2011 ) 
    </div>"""

soup = BeautifulSoup(html_doc, 'html.parser')
h4_elements = soup.find_all('h4')
profile_areas = soup.find_all('div', attrs={'class': 'profile-area'})
print('\n')
print("++++++++++++++++++++++++++++++++++++")
print("Element counts")
print("++++++++++++++++++++++++++++++++++++")
print("Total H4: {}".format(len(h4_elements)))
print("++++++++++++++++++++++++++++++++++++")
print("Total profile-area: {}".format(len(profile_areas)))
print("++++++++++++++++++++++++++++++++++++")
print('\n')

for i in h4_elements:
    print("++++++++++++++++++++++++++++++++++++")
    print(i.text.rstrip().lstrip())
    print("++++++++++++++++++++++++++++++++++++")
    del profile_areas[0]
    for j in profile_areas:
        raw = re.sub('[^A-Za-z0-9]+', ' ', j.text.replace(',','').lstrip().rstrip())
        raw = raw.rstrip()
        el = raw.split(' ')

        print('Type: {} '.format(el[0]))
        print('Sold: {} {} '.format(el[1], el[2]))
        print('Year: {} '.format(el[3]))
        print("++++++++++++++++++++++++++++++++++++")

输出如下:

 ++++++++++++++++++++++++++++++++++++
Production Capacity (year)
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 809000 units 
Year: 2016 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 815000 units 
Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 836000 units 
Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 807000 units 
Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 760000 units 
Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 805000 units 
Year: 2011 
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
Output
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 815000 units 
Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 836000 units 
Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 807000 units 
Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 760000 units 
Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 805000 units 
Year: 2011 
++++++++++++++++++++++++++++++++++++

【讨论】:

    【解决方案2】:

    我建议您将每个条目存储在字典中,然后您可以在最后轻松提取所需的字段(您似乎不想要 2011 年?):

    from bs4 import BeautifulSoup
    import re
    
    html = """
    <h4>Production Capacity (year)</h4>
        <div class="profile-area">
          Vehicle 1,140,000 units /year
        </div>
    <h4>Output</h4>
        <div class="profile-area">
          Vehicle 809,000 units ( 2016 ) 
        </div>
        <div class="profile-area">
          Vehicle 815,000 units ( 2015 ) 
        </div>
        <div class="profile-area">
          Vehicle 836,000 units ( 2014 ) 
        </div>
        <div class="profile-area">
          Vehicle 807,000 units ( 2013 ) 
        </div>
        <div class="profile-area">
          Vehicle 760,000 units ( 2012 ) 
        </div>
        <div class="profile-area">
          Vehicle 805,000 units ( 2011 ) 
        </div>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    units = {}
    
    for item in soup.find_all(['h4', 'div']):
        if item.name == 'h4':
            for h4 in ['capacity', 'output', 'models']:
                if h4 in item.text.lower():
                    break
        elif item.get('class', [''])[0] == 'profile-area':
            vehicle = item.get_text(strip=True)
    
            if h4 == 'output':
                re_year = re.search(r'\( (\d+) \)', vehicle)
    
                if re_year:
                    year = re_year.group(1)
                else:
                    year = 'unknown'
    
                units[year] = vehicle
            else:
                units[h4] = vehicle
    
    req_fields = ['models', 'capacity', '2012', '2013', '2014', '2015', '2016']            
    print(';'.join([units.get(field, '') for field in req_fields]))
    

    这将显示:

    ;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )
    

    正则表达式用于从车辆条目中提取年份。然后将其用作字典中的键。

    对于 pastebin 中的 HTML,它给出:

    Volkswagen Golf, Golf Variant(Estate), Golf Plus, CrossGolf (2006-), e-Golf (2014-)Volkswagen Touran, CrossTouran (2007-), Tiguan (2007-);I.D. electric vehicles based on MEB (planning);SEAT new SUV MQB-A2 platform (2018- planning);Components:press shop, chassis, plastics technology;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )
    

    【讨论】:

    • 这很酷。但老实说,我的 HTML 源代码有点复杂。 (['h4', 'div']) 发现有很多类似“

      MENU

      ”的内容。这是我的完整 HTML:pastebin.com/fB0s7eCF
    • 查找仍然可以,您只需要手动检查类。不幸的是,我认为不可能在单个find_all 中直接指定多个过滤器。我添加了一个类测试。
    • 课堂测试不起作用(遗憾的是我无法修复它)。 item.get('class', [''])[0] 始终是“面板标题”。
    • 添加解决方案@MartinEvans 作为答案。
    • @MartinEvans 如果正确请 +1。谢谢.. 一点业力对我的时间有好处。
    猜你喜欢
    相关资源
    最近更新 更多
    热门标签