根据 HTML 文本中的标签对文本进行分组答案

【问题标题】：grouping text based on tags in HTML text根据 HTML 文本中的标签对文本进行分组
【发布时间】：2019-07-03 14:12:52
【问题描述】：

我的文本格式为（保留标签并删除文本以便理解）

<h2>...</h2>
  <p>...</p>
   .      .
   .      .
  <p>...</p>
<h2>...</h2>
  <ul>...</ul>
     <li> .. </li>
  ...
<h2>...</h2>
   <li> ..</li>

我正在尝试使用scrapy 根据标题分隔/分组文本。因此，作为第一步，我需要从上面获取 3 组数据。

from scrapy import Selector 
sentence = "above text in the format"
sel = Selector(text = sentence)
// item = sel.xpath("//h2//text())
item = sel.xpath("//h2/following-sibling::li/ul/p//text()").extract()

我得到一个空数组。任何帮助表示赞赏。

【问题讨论】：

找到了答案，我可以使用 BeautifulSoup 做到这一点。 stackoverflow.com/questions/14444732/…

标签： html python-3.x parsing scrapy

【解决方案1】：

我有这个解决方案，用scrapy做的

import scrapy
from lxml import etree, html


class TagsSpider(scrapy.Spider):
    name = 'tags'
    start_urls = [
        'https://support.litmos.com/hc/en-us/articles/227739047-Sample-HTML-Header-Code'
    ]

    def parse(self, response):
        for header in response.xpath('//header'):
            with open('test.html', 'a+') as file:
                file.write(
                    etree.tostring(
                        html.fromstring(header.extract()),
                        encoding='unicode',
                        pretty_print=True,
                    )
                )

我用它来获取标题和其中的所有内容

【讨论】：