【问题标题】:Beautiful soup multiple Span Extract Table美汤多跨度提取表
【发布时间】:2020-06-18 17:16:24
【问题描述】:

我目前正在做我的课堂作业。我必须从这个网页的 SPECS 表中提取数据。

https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/

我需要的数据存储为

<h2 class="crux-product-title">Specs</h2>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Programmable
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Programmable</span>
<span class="crux-body-copy crux-body-copy--small">Programmable models have a clock and can be set to brew at a specified time.</span>
</span>
</span>
</span>
</div>
<div class="col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-value">
<span class='crux-body-copy crux-body-copy--small'>Yes</span>
</div>
</div>
</div>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Thermal carafe/mug
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Thermal carafe/mug</span>
<span class="crux-body-copy crux-body-copy--small">Keeps coffee warm for about four hours; thermal mugs don&#039;t hold heat as well.</span>
</span>
</span>
</span>

我需要为三个跨度类创建列表

class="crux-body-copy crux-body-copy--small--bold
crux-body-copy crux-body-copy--small
crux-body-copy crux-body-copy--small

提取表格的问题是因为表格中使用了多个跨度。

我用BEAUTIFUL SOUP,用find_allfind,用span名字来调用它。

我总是得到第一个值。

我该怎么做?

【问题讨论】:

  • 到目前为止我们能看到你的代码吗?通常,如果您有一个元素列表 myelements[2] 例如,您将不得不指定一个索引。
  • 'url = "consumerreports.org/products/drip-coffee-maker/…" html_content = requests.get(url).text soup = BeautifulSoup(html_content, "lxml") print(soup.prettify()) table = soup.find_all( "span", attrs={"class": "crux-body-copy crux-body-copy--small"}) print(table)'

标签: python html web-scraping html-table beautifulsoup


【解决方案1】:

我不知道这是否适合你。

from simplified_scrapy import SimplifiedDoc,req,utils
html = ''' ''' # Your html
doc = SimplifiedDoc(html)
spans = doc.selects('span.crux-body-copy crux-body-copy--small--bold')
for span in spans:
    # print (span.firstText())
    print (span.select('span.crux-body-copy crux-body-copy--small--bold').text)
    print (span.select('span.crux-body-copy crux-body-copy--small').unescape())

结果:

Programmable
Programmable models have a clock and can be set to brew at a specified time.
Thermal carafe/mug
Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-03-18
    • 2019-08-12
    • 1970-01-01
    • 2021-09-08
    • 1970-01-01
    • 2012-08-01
    • 2017-12-11
    相关资源
    最近更新 更多