【问题标题】:Python BeautifulSoup, iterating through tags and attributesPython BeautifulSoup,遍历标签和属性
【发布时间】:2017-11-27 04:02:24
【问题描述】:

我想遍历 html 页面某些部分中的所有标签。我应用了 BeautifulSoup,但我可以没有它,只有 Selenium 库。 假设我有以下 html 代码:

<table id="myBSTable">   
    <tr>
        <th>Column A1</th>
        <th>Column B1</th>
        <th>Column C1</th>
        <th>Column D1</th>
        <th>Column E1</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 1</td>
        <td>Soup 1</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 1</td>
        <td>Rocks 1</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3</td>
        <td>Pyhon 1</td>
        <td>Boulder 1</td>
    </tr>
    <tr>
        <th>Column A2</th>
        <th>Column B2</th>
        <th>Column C2</th>
        <th>Column D2</th>
        <th>Column E2</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 2</td>
        <td>Soup 2</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 2</td>
        <td>Rocks 2</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3 2</td>
        <td>Pyhon 2</td>
        <td>Boulder 2</td>
    </tr>
</table>  

我的这部分工作完美:

#Selenium libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

#BeautifulSoup
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
browser.get('http://urltoget.com')   

table = browser.find_element_by_id('myBSTable')
bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
#So far so good
rows = bs_table.findAll('tr')
for tr in rows:
    #Here is where I need help
    #I want to iterate through all tags
    #but I don't know if is going to be a th or a td
    #At the same time I need to do something
    #if is a td or a th

这就是我想要完成的:

    #The following is a pseudo code
    for col in tr.tags:
        print col.name, col.value
        for attribute in col.attrs:
            print "    ", attribute.name, attribute.value
    #End pseudo code

谢谢, 附庸风雅

【问题讨论】:

    标签: python html selenium beautifulsoup tags


    【解决方案1】:

    您可以通过指定要查找的标签列表找到tdth。要获取所有元素属性,请使用.attrs attribute

    rows = bs_table.find_all('tr')
    for row in rows:
        cells = row.find_all(['td', 'th'])
        for cell in cells:
            print(cell.name, cell.attrs)
    

    【讨论】:

    • 谢谢,您的解决方案几乎奏效了。我说几乎是因为 cell.attrs 没有工作。经过一番研究,我发现了以下遍历属性的方法:** for attr, value in cell.attrs.iteritems(): print " attribute", attr, value** 我不得不使用 attrs 中的 iteritems(),因为如果我只有 cells.attrs,它就不起作用。
    【解决方案2】:

    另类循环(动作在底部):

    html='''<table id="myBSTable">   
        <tr>
            <th>Column A1</th>
            <th>Column B1</th>
            <th>Column C1</th>
            <th>Column D1</th>
            <th>Column E1</th>
        </tr>
        <tr>
            <td data="First Column Data"></td>
            <td data="Second Column Data"></td>
            <td title="Title of the First Row">Value of Row 1</td>
            <td>Beautiful 1</td>
            <td>Soup 1</td>
        </tr>
        <tr>
            <td></td>
            <td data-g="Second Column Data"></td>
            <td title="Title of the Second Row">Value of Row 2</td>
            <td>Selenium 1</td>
            <td>Rocks 1</td>
        </tr>
        <tr>
            <td></td>
            <td></td>
            <td title="Title of the Third Row">Value of Row 3</td>
            <td>Pyhon 1</td>
            <td>Boulder 1</td>
        </tr>
        <tr>
            <th>Column A2</th>
            <th>Column B2</th>
            <th>Column C2</th>
            <th>Column D2</th>
            <th>Column E2</th>
        </tr>
        <tr>
            <td data="First Column Data"></td>
            <td data="Second Column Data"></td>
            <td title="Title of the First Row">Value of Row 1</td>
            <td>Beautiful 2</td>
            <td>Soup 2</td>
        </tr>
        <tr>
            <td></td>
            <td data-g="Second Column Data"></td>
            <td title="Title of the Second Row">Value of Row 2</td>
            <td>Selenium 2</td>
            <td>Rocks 2</td>
        </tr>
        <tr>
            <td></td>
            <td></td>
            <td title="Title of the Third Row">Value of Row 3 2</td>
            <td>Pyhon 2</td>
            <td>Boulder 2</td>
        </tr>
    </table>'''
    
    Soup = BeautifulSoup(html)
    
    rows = Soup.findAll('tr')
    for tr in rows:
        for z in tr.children:
            if z.name =='td':
                do stuff1
            if z.name == 'th':
                do stuff2
    

    【讨论】:

      猜你喜欢
      • 2018-10-22
      • 2019-06-24
      • 1970-01-01
      • 2012-05-05
      • 2013-12-24
      • 2015-01-11
      • 2020-07-20
      • 2019-01-25
      • 2014-04-25
      相关资源
      最近更新 更多