【问题标题】:Python - BeautifulSoup scrape non-standard web tablePython - BeautifulSoup 抓取非标准网页表
【发布时间】:2016-08-31 19:59:23
【问题描述】:

我正在尝试从多个网页中抓取数据以创建数据的 CSV。数据只是产品的营养信息。我已经生成了访问该网站的代码,但是我不能完全让代码正确地迭代出来。问题是,该网站使用 DIV 标签作为产品名称,并且在 DIV 内部 或 ,它在页面之间有所不同。当我尝试迭代它时,产品名称会立即显示在一个带有标签的列表中,然后我得到我请求的列的内容,没有标签。我试图弄清楚我做错了什么。

源代码示例:

<div><strong>Product 1 Name</strong></div>

<table>
    <tbody>
        <tr>
            <td>Serving Size</td>
            <td>8 (fl. Oz.)</td>
        </tr>
        <tr>
            <td>Calories</td>
            <td>122 Calories</td>
        </tr>
        <tr>
            <td>Fat</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sodium</td>
            <td>0.2 (mg)</td>
        </tr>
        <tr>
            <td>Carbs</td>
            <td>8.8 (mg)</td>
        </tr>
        <tr>
            <td>Dietary Fiber</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sugar</td>
            <td>8.8 (g)<br />
            &nbsp;</td>
        </tr>
    </tbody>
</table>
&nbsp;

<div><strong>Product 2 Name</strong></div>

<table>
    <tbody>
        <tr>
            <td>Serving Size</td>
            <td>8 (fl. Oz.)</td>
        </tr>
        <tr>
            <td>Calories</td>
            <td>134 Calories</td>
        </tr>
        <tr>
            <td>Fat</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sodium</td>
            <td>0.0 (mg)</td>
        </tr>
        <tr>
            <td>Carbs</td>
            <td>8.4 (mg)</td>
        </tr>
        <tr>
            <td>Dietary Fiber</td>
            <td>0 (g)</td>
        </tr>
        <tr>
            <td>Sugar</td>
            <td>8.4 (g)<br />
            &nbsp;</td>
        </tr>
    </tbody>
</table>
&nbsp;

理想情况下,我希望能够输出到标题行中包含“产品名称”和第 1 列数据的 CSV,因为它对于所有表都是相同的。然后数据行将如下所示: "Product 1 Name, 8, 112, 0, 0.2, 8.8, 0, 8.8"

我知道需要对数据进行一些操作以使其达到该点(以删除大小信息)。

这是我目前所拥有的让我发疯的东西:

import requests, bs4, urllib2, csv
from bs4 import BeautifulSoup
from collections import defaultdict


#Loop on URLs to get Nutritional Information from each one.
with open('NutritionalURLs.txt') as f:
    for line in f:
        r = requests.get('website' + line)
        soup=BeautifulSoup(r.text.encode('ascii','ignore'),"html.parser")

#TESTING
        with open('output.txt', 'w') as o:
            product_list = soup.find_all('b')
            product_list = soup.find_all('strong')
            print(product_list)
            table_list = soup.find_all('table')
            for tables in table_list:
                trs = tables.find_all('tr')
                for tr in trs:
                    tds = tr.find_all('td')[1:]
                    if tds:
                        facts = tds[0].find(text=True)
                        print(facts)
#                        o.write("Serving Size: %s, Calories: %s, Fat: %s, Sodium: %s, Carbs: %s, Dietary Fiber: %s, Sugar: %s\n" % \
#                            (facts[0].text, facts[1].text, facts[2].text, facts[3].text, facts[4].text, facts[5].text, facts[6].text)) 

这给了我这样的输出:

[<strong>Product 1 Name</strong>, <strong>Product 2 Name</strong>]
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
[]

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    找到表格,然后从前一个 strong 中提取文本,并从每个 tr 中取出第二个 td,将文本拆分一次以删除 (g) 等。 :

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html)
    
    for table in soup.find_all("table"):
        name = [table.find_previous("strong").text]
        amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")])
        print(name + amounts)
    

    这会给你:

    ['Product 1 Name', '8', '122', '0', '0.2', '8.8', '0', '8.8']
    ['Product 2 Name', '8', '134', '0', '0.0', '8.4', '0', '8.4']
    

    select("tr td + td") 使用一个css选择器从每个tr/row中获取第二个td, p>

    或者使用 find_all 和切片看起来像:

    for table in soup.find_all("table"):
        name = [table.find_previous("strong").text]
        amounts = [td.find_all("td")[1].text.split(None, 1)[0] for  td in table.find_all("tr")]
        print(name + amounts)
    

    由于它并不总是一个强烈但有时是你想要的粗体标签,所以只需先寻找强烈的标签,然后再回到粗体:

    from bs4 import BeautifulSoup
    import requests
    html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
    soup = BeautifulSoup(html, "html.parser")
    for table in soup.select("div.article-content table"):
        name = table.find_previous("strong") or table.find_previous("b")
        amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
        print([name.text] + amounts)
    

    如果 table.find_previous("strong") 什么都没找到,它将是 None 所以 or 将被执行并且名称将被设置为 table.find_previous("b") em>。

    现在两者都适用:

    In [12]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
    
    In [13]: soup = BeautifulSoup(html, "html.parser")
    
    In [14]: for table in soup.select("div.article-content table"):
       ....:         name = table.find_previous("strong") or table.find_previous("b")
       ....:         amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
       ....:         print([name.text] + amounts)
       ....:     
    [u'Cruzan Banana Flavored Rum 42 proof', u'1.5', u'79', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Banana Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Black Cherry Flavored Rum 42 proof', u'1.5', u'80', u'0', u'0.0', u'6.9', u'0', u'6.9']
    [u'Cruzan Citrus Flavored Rum 42 proof', u'1.5', u'99', u'0', u'0.0', u'2.8', u'0', u'2.6']
    [u'Cruzan Coconut Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.9', u'0', u'6.5']
    [u'Cruzan Coconut Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.1', u'6.1', u'0', u'0']
    [u'Cruzan Guaza Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.5', u'0', u'6.5']
    [u'Cruzan Key Lime Flavored Rum 42 proof', u'1.5', u'81', u'0', u'0.0', u'8.1', u'0', u'6']
    [u'Cruzan Mango Flavored Rum 42 proof', u'1.5', u'85', u'0', u'0.0', u'8.5', u'0', u'8.5']
    [u'Cruzan Mango Flavored Rum 55 proof', u'1.5', u'101', u'0', u'0.0', u'8.5', u'0', u'8.5']
    [u'Cruzan Orange Flavored Rum 42 proof', u'1.5', u'76.77', u'0', u'0', u'6.4', u'0', u'6.4']
    [u'Cruzan Passion Fruit Flavored Rum 42 proof', u'1.5', u'77', u'0', u'0.0', u'6.3', u'0', u'6.3']
    [u'Cruzan Pineapple Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Pineapple Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Raspberry Flavored Rum 42 proof', u'1.5', u'92', u'0', u'0.0', u'10.1', u'0', u'10.1']
    [u'Cruzan Raspberry Flavored Rum 55 proof', u'1.5', u'108', u'0', u'0.0', u'10.1', u'0', u'10.1']
    [u'Cruzan Strawberry Flavored Rum 42 proof', u'1.5', u'76', u'0', u'0.0', u'6.1', u'0', u'6']
    [u'Cruzan Vanilla Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Vanilla Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
    [u'Cruzan Estate Dark Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
    [u'Cruzan Estate Light Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
    [u'Cruzan Estate Single Barrel Rum 80 proof', u'1.5', u'99', u'0', u'0.0', u'0.9', u'0', u'0.9']
    

    还有粗体:

    In [20]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1790163-midori-nutrition-information").content
    
    In [21]: soup = BeautifulSoup(html, "html.parser")
    
    In [22]: for table in soup.select("div.article-content table"):
       ....:         name = table.find_previous("strong") or table.find_previous("b")
       ....:         amounts = [td.text.split(None, 1)[0] for  td in table.select("tr td + td")]
       ....:         print([name.text] + amounts)
       ....:     
    [u'Midori', u'1.0', u'62.1', u'0', u'0.3', u'7.5', u'0', u'7.0']
    

    【讨论】:

    • 我一定是做错了什么。我收到此错误:Traceback (most recent call last): File "htmlextraction.py", line 10, in &lt;module&gt; name = [table.find_previous("strong").text] AttributeError: 'NoneType' object has no attribute 'text' 甚至尝试添加 html5lib
    • @PDGill,你能分享一个指向实际页面的链接吗?
    • 当然。 [这是其中一页的示例。] (beamsuntory.desk.com/customer/en/portal/articles/…) 感谢您的帮助。
    • 您的代码在该页面上按预期工作。 It is the pages that use the bold tag that are throwing the error.。有时用新鲜的眼光看待它会有所不同。
    • 效果很好。谢谢你,先生。你是一个绅士和一个学者。
    猜你喜欢
    • 2018-04-25
    • 2014-06-20
    • 1970-01-01
    • 1970-01-01
    • 2014-06-20
    • 2020-09-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多