使用 SPLIT 创建 HTML 列表答案

【问题标题】：Using SPLIT to create a list of HTML使用 SPLIT 创建 HTML 列表
【发布时间】：2014-08-26 00:46:33
【问题描述】：

我正在执行的搜索返回大量 HTML 的返回值。

for i in deal_list:
        regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
        pattern2 = re.compile(regex2)
        info2 = re.search(pattern2,htmltext)
        html_captured = info2.group(0).split('</figure>')
        print html_captured

这是一个返回的示例：

<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
      <a href="//www" class="deal-tile-inner">
        <img>
      <figcaption>
                  <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
          <p class="merchant-name truncation ">1742 Wine Bar</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Upper East Side</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$178.90</s>
              <s class="discount-price">$49</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
      </a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
            <a href="//www" class="deal-tile-inner">
              <img>
      <figcaption>
                        <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
          <p class="merchant-name truncation ">Statler Grill</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Midtown</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$213</s>
              <s class="discount-price">$89</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
            </a>
</figure>

我想使用html_captured = info2.group(0).split('</figure>，以便每个新标签集之间的所有 HTML 都成为列表的一个元素，在本例中为 HTML_CAPTURED。

除了每个都成为自己的列表并在末尾带有''之外，它有点工作。例如：['<figure .... </figure>','']['<figure .... </figure>','']

但我想要的是 ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]

【问题讨论】：

难道你不知道当你尝试用正则表达式解析 html 时，旧的会被唤醒吗？使用 BeautifulSoup 或类似的。
你能告诉我如何在这种情况下使用 BeautifulSoup @timgeb

标签： python html regex for-loop web-scraping

【解决方案1】：

有用于解析 HTML 的特殊工具 - HTML parsers。

使用BeautifulSoup的示例：

from bs4 import BeautifulSoup

data = """
your html here
"""

soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]

还可以了解为什么不应该使用正则表达式来解析 HTML：

RegEx match open tags except XHTML self-contained tags

【讨论】：