【问题标题】:How to get each value inside <li> with <span> tag BeautifulSoup如何使用 <span> 标签获取 <li> 中的每个值 BeautifulSoup
【发布时间】:2017-12-02 14:59:28
【问题描述】:

我有一个如下所示的 HTML 文档,self.soup 是 BeautifulSoup 对象。我试图在列表元素中抓取数据。列表元素如下所示:

 <ul class="list-group">
        <li class="list-group-item">
           <span class="strong">Name</span>
           <span class="pull-right">Piter</span>
        </li>
        <li class="list-group-item">
           <span class="strong">Year</span>
           <span class="pull-right">2017</span>
        </li>
 </ul>

python 文件 scrape.py

  #person is a array
  need = { 'Name' : 'name',
           'Year' : 'year'
  }

第一次尝试

  specs = self.soup.select("ul.list-group li.list-group-item") 
  if  len(specs) > 0 :
        for data in specs :
            text = data.get_text()
            if need.has_key( data[0].strip()) : 
                 if need[ data[0].strip() ] not in person or person[ need[ data[0].strip() ] ] == '':
                    person[ need[ text[0].strip() ] ] = text[1].strip()

第一个错误

 File "scraper.py", line 68, in scrape
    if need.has_key( data[0].strip()) : 
 File "build/bdist.linux-x86_64/egg/bs4/element.py", line 1011, in__getitem__
 KeyError: 0

第二次尝试

  specs = self.soup.select("ul.list-group li.list-group-item")
  if  len(specs) > 0 :
        for data in specs :
            text = data.get_text()
            if need.has_key( data[0].strip()) : 
                 if need[ data[0].strip() ] not in person or person[ need[ data[0].strip() ] ] == '':
                    person[ need[ text[0].strip() ] ] = text[1].strip() 

第二个错误

  File "site_scrapers/v12software.scraper.py", line 66, in scrape
    text = [ data.contents[0].get_text(), data.contents[1].get_text() ] 
  File "build/bdist.linux-x86_64/egg/bs4/element.py", line 737, in __getattr__
  AttributeError: 'NavigableString' object has no attribute 'get_text'

我试图将上面的元素字符串放到 person 数组中。

我需要这样的结果:

  print person['Name']
  #output Piter
  print person['Year']
  #output 2017

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:
    from bs4 import BeautifulSoup
    
    html = """<ul class="list-group">
            <li class="list-group-item">
               <span class="strong">Name</span>
               <span class="pull-right">Piter</span>
            </li>
            <li class="list-group-item">
               <span class="strong">Year</span>
               <span class="pull-right">2017</span>
            </li>
     </ul>"""
    
    soup = BeautifulSoup(html, 'html.parser')
    
    need = {}
    
    for li_tag in soup.find_all('ul', {'class':'list-group'}):
        for span_tag in li_tag.find_all('li', {'class':'list-group-item'}):
            field = span_tag.find('span', {'class':'strong'}).text
            value = span_tag.find('span', {'class':'pull-right'}).text
            need[field] = value
    
    print(need)
    

    【讨论】:

    • 感谢您的回复,我根据我的测试了,只收到了最后一个。仅限“年”。我想我们在 span_tag 之后向 pus 展示了另一个 for 循环?
    • 我打印出了 span_tag。看起来像&lt;li class="list-group-item"&gt; &lt;span class="strong"&gt;Name&lt;/span&gt; &lt;span class="pull-right"&gt;Piter&lt;/span&gt; &lt;/li&gt;&lt;li class="list-group-item"&gt; &lt;span class="strong"&gt;Year&lt;/span&gt; &lt;span class="pull-right"&gt;2017&lt;/span&gt; &lt;/li&gt;
    • 我尝试了print field,但没有得到任何输出,但是当我打印span_tag.find('span', {'class':'strong'}).text 时显示输出NewYear
    • 检查一下,我面临另一个问题。 hastebin.com/acimalamin.cs 当我尝试相同的方法时。
    猜你喜欢
    • 1970-01-01
    • 2017-06-29
    • 1970-01-01
    • 2021-12-06
    • 1970-01-01
    • 2021-05-10
    • 2021-07-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多