使用 python/beautifulsoup 抓取没有 id 的表，如何使用文字 html 字符串？答案

【问题标题】：Scraping a table with no id with python/beautifulsoup, how can I use the literal html string?使用 python/beautifulsoup 抓取没有 id 的表，如何使用文字 html 字符串？
【发布时间】：2018-02-25 15:11:27
【问题描述】：

我要抓取的表格没有特定的表格id，表格的高/宽等级与同一页面上的其他表格匹配，但文字html字符串是唯一的：

<table border="10%" cellpadding="10%" cellspacing="10%" width="100%">

那么在“soup.find()”中查找这个文字字符串的格式是什么。

【问题讨论】：

举个例子说明你的意思会有帮助！

标签： python html html-table beautifulsoup screen-scraping

【解决方案1】：

您可以使用findAll('table') 方法找到页面中的所有表格，然后将表格对象放入字符串构造函数中以获取其文字html代码。（字符串构造函数基本上是在表对象上调用__str__()）

例子：

import bs4

page = """

<html>
    <head> </head>

    <body>
        <table border="10%" cellpadding="10%" cellspacing="10%" width="100%"">
          <tr>
            <th>Firstname</th>
            <th>Lastname</th> 
            <th>Age</th>
          </tr>
          <tr>
            <td>Altair</td>
            <td>Ibn La Ahad</td> 
            <td>939</td>
          </tr>
          <tr>
            <td>Ezio </td>
            <td>Auditore</td> 
            <td>604</td>
          </tr>
        </table>
    </body>
</html>

"""

bs= bs4.BeautifulSoup(page, 'lxml')

tables = bs.findAll('table') # Find all tables

# for each table
for table in tables:
    table_html_code= str(table)          #get html code of this table

    first_line = table_html_code.split('\n')[0] # get first line of the table's html code
    print(first_line)

您可以尝试的另一件事是使用表格的顺序。如果您想访问页面中的第四个表格，您可以像这样访问表格：

beautifulsoup_obj.findAll('table')[3]

【讨论】：

【解决方案2】：

从查看文档看来，您可以使用 find() 方法执行以下操作。您可以传入一个 html 属性字典。看起来this link 有类似的问题/解决方案。

BeautilSoup.find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)

from bs4 import BeautifulSoup

html = """
<html>
<head>
</head>

 <body>
    <table border="10%" cellpadding="10%" cellspacing="10%" width="100%"></table>

    <table></table
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

results = soup.find("table", {"border": "10%", "cellpadding": "10%", "cellspacing": "10%", "width": "100%"})

print(results)

【讨论】：