【问题标题】:Get the content of tr in tbody获取tbody中tr的内容
【发布时间】:2021-05-28 22:52:11
【问题描述】:

我有下表:

<table class="table table-bordered adoption-status-table">
        <thead>
            <tr>
                <th>Extent of IFRS application</th>
                <th>Status</th>
                <th>Additional Information</th>
            </tr>
        </thead>
        <tbody>
                    <tr>
                        <td>IFRS Standards are required for domestic public companies</td>
                        <td>
                        </td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>IFRS Standards are permitted but not required for domestic public companies</td>
                        <td>
                                <img src="/images/icons/tick.png" alt="tick">
                        </td>
                        <td>Permitted, but very few companies use IFRS Standards.</td>
                    </tr>
                    <tr>
                        <td>IFRS Standards are required or permitted for listings by foreign companies</td>
                        <td>
                        </td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>The IFRS for SMEs Standard is required or permitted</td>
                        <td>
                                <img src="/images/icons/tick.png" alt="tick">
                        </td>
                        <td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
                    </tr>
                    <tr>
                        <td>The IFRS for SMEs Standard is under consideration</td>
                        <td>
                        </td>
                        <td></td>
                    </tr>
        </tbody>
    </table>

我正在尝试提取original source 中的数据:

这是我的作品:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))

table1 = gdp[0]
body = table1.find_all("tr")
head = body[0] 
body_rows = body[1:] 

headings = []
for item in head.find_all("th"):
    item = (item.text).rstrip("\n")
    headings.append(item)
print(headings)

all_rows = [] 
for row_num in range(len(body_rows)): 
    row = [] 
    for row_item in body_rows[row_num].find_all("td"):
        aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
        row.append(aa)
    all_rows.append(row)

df = pd.DataFrame(data=all_rows,columns=headings)

这是我得到的唯一输出:

Number of tables on site:  1
['Extent of IFRS application', 'Status', 'Additional Information']

我想用 False 替换 NULL 单元格,用 True 替换图像检查的路径。

【问题讨论】:

  • 看起来它正在工作。 df.head() 仅打印出数据框的标题行
  • @WombatPM 即使我删除了 df.head() 它给了我同样的结果

标签: python html python-3.x web-scraping


【解决方案1】:

您需要在td 中查找img 元素。这是一个例子:

data = []
for tr in body_rows:
    cells = tr.find_all('td')
    img = cells[1].find('img')
    if img and img['src'] == '/images/icons/tick.png':
        status = True
    else:
        status = False
    
    data.append({
        'Extent of IFRS application': cells[0].string,
        'Status': status,
        'Additional Information': cells[2].string,
    })

print(pd.DataFrame(data).head())

【讨论】:

    【解决方案2】:

    上面的答案很好,另一种选择是使用pandas.read_html 将表提取到数据框中,并使用lxml xpath 填充缺少的Status 列(如果您愿意,也可以使用beautifulsoup,但它更冗长):

    import pandas as pd
    import requests
    from lxml import html
    
    r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
    table = pd.read_html(r.content)[0]
    tree = html.fromstring(r.content)
    table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
    print(table)
    

    Try this on repl.it

    【讨论】:

    • 完美解决方案。 + 业力 ;)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-30
    • 1970-01-01
    • 2014-09-25
    • 1970-01-01
    • 2014-04-27
    • 2018-06-30
    相关资源
    最近更新 更多