Python：解析 HTML 以删除标签并将文本转换应用于标签后的所有文本答案

【问题标题】：Python: Parse HTML to remove a tag and apply text transformation to all text after the tagPython：解析 HTML 以删除标签并将文本转换应用于标签后的所有文本
【发布时间】：2018-07-18 20:00:51
【问题描述】：

我正在尝试检测包含 HTML 标记 <p><strong class="title"> </strong></p> 以及标记 "shared" OR "amenities" 内的某些单词的字符串，并将单词 "shared" 附加到出现在该标记之后的所有逗号分隔的子字符串中。有没有简单的方法来实现这一点？

输入示例：

</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">

示例输出：

swimming pool, barbecue, beach shared, tennis courts shared

【问题讨论】：

先行建议 - 不要使用正则表达式解析 HTML；)
@liborm 你在那个评论上打败了我.....

标签： python regex string beautifulsoup

【解决方案1】：

您可以为此使用几个不同的库，常见的选择是 Beautiful Soup 或 lxml。我更喜欢 lxml，因为大多数语言都有实现，类似于 regex，所以感觉我会从投资中获得更多收益。

from lxml import html

stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag  = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)

【讨论】：

【解决方案2】：

我使用下面的代码让它工作。欢迎任何 cmets 和建议！

from bs4 import BeautifulSoup

html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'

soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]

shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
         .get_text()
         .strip()
        )
shared_amenities = html_to_parse.split(shared_indicator,1)[1]

shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
          .get_text()
          .split(','))
          .replace("[^A-Za-z0-9'`]+", " ", regex = True)
          .str.strip()
        .apply(lambda x: "{}{}".format(x, ' shared'))
)

shared_amenities_tagged = ", ".join(shared_amenities_array)

non_shared_amenities + ', ' + shared_amenities_tagged

【讨论】：