【问题标题】:Python: Parse HTML to remove a tag and apply text transformation to all text after the tagPython:解析 HTML 以删除标签并将文本转换应用于标签后的所有文本
【发布时间】:2018-07-18 20:00:51
【问题描述】:

我正在尝试检测包含 HTML 标记 <p><strong class="title"> </strong></p> 以及标记 "shared" OR "amenities" 内的某些单词的字符串,并将单词 "shared" 附加到出现在该标记之后的所有逗号分隔的子字符串中。有没有简单的方法来实现这一点?

输入示例:

</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">

示例输出:

swimming pool, barbecue, beach shared, tennis courts shared

【问题讨论】:

  • 先行建议 - 不要使用正则表达式解析 HTML;)
  • @liborm 你在那个评论上打败了我.....

标签: python regex string beautifulsoup


【解决方案1】:

您可以为此使用几个不同的库,常见的选择是 Beautiful Soup 或 lxml。我更喜欢 lxml,因为大多数语言都有实现,类似于 regex,所以感觉我会从投资中获得更多收益。

from lxml import html

stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag  = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)

【讨论】:

    【解决方案2】:

    我使用下面的代码让它工作。欢迎任何 cmets 和建议!

    from bs4 import BeautifulSoup
    
    html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
    
    soup = BeautifulSoup(html_to_parse)
    html_body = soup('body')[0]
    
    shared_indicator = html_body.find('strong', 'title').get_text()
    non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
    non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
             .get_text()
             .strip()
            )
    shared_amenities = html_to_parse.split(shared_indicator,1)[1]
    
    shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
              .get_text()
              .split(','))
              .replace("[^A-Za-z0-9'`]+", " ", regex = True)
              .str.strip()
            .apply(lambda x: "{}{}".format(x, ' shared'))
    )
    
    shared_amenities_tagged = ", ".join(shared_amenities_array)
    
    non_shared_amenities + ', ' + shared_amenities_tagged
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-04-04
      • 2012-01-17
      • 1970-01-01
      • 2012-05-26
      • 2014-03-14
      • 1970-01-01
      • 2014-11-03
      • 2014-04-06
      相关资源
      最近更新 更多