【问题标题】:Using SPLIT to create a list of HTML使用 SPLIT 创建 HTML 列表
【发布时间】:2014-08-26 00:46:33
【问题描述】:

我正在执行的搜索返回大量 HTML 的返回值。

for i in deal_list:
        regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
        pattern2 = re.compile(regex2)
        info2 = re.search(pattern2,htmltext)
        html_captured = info2.group(0).split('</figure>')
        print html_captured

这是一个返回的示例:

<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
      <a href="//www" class="deal-tile-inner">
        <img>
      <figcaption>
                  <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
          <p class="merchant-name truncation ">1742 Wine Bar</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Upper East Side</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$178.90</s>
              <s class="discount-price">$49</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
      </a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
            <a href="//www" class="deal-tile-inner">
              <img>
      <figcaption>
                        <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
          <p class="merchant-name truncation ">Statler Grill</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Midtown</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$213</s>
              <s class="discount-price">$89</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
            </a>
</figure>

我想使用html_captured = info2.group(0).split('&lt;/figure&gt;,以便每个新标签集之间的所有 HTML 都成为列表的一个元素,在本例中为 HTML_CAPTURED。

除了每个都成为自己的列表并在末尾带有''之外,它有点工作。例如:['&lt;figure .... &lt;/figure&gt;','']['&lt;figure .... &lt;/figure&gt;','']

但我想要的是 ['&lt;figure .... &lt;/figure&gt;','&lt;figure .... &lt;/figure&gt;','&lt;figure .... &lt;/figure&gt;'...etc]

【问题讨论】:

  • 难道你不知道当你尝试用正则表达式解析 html 时,旧的会被唤醒吗?使用 BeautifulSoup 或类似的。
  • 你能告诉我如何在这种情况下使用 BeautifulSoup @timgeb

标签: python html regex for-loop web-scraping


【解决方案1】:

有用于解析 HTML 的特殊工具 - HTML parsers

使用BeautifulSoup的示例:

from bs4 import BeautifulSoup

data = """
your html here
"""

soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]

还可以了解为什么不应该使用正则表达式来解析 HTML:

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-09-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多