用正则表达式刮掉带有可选 <spans> 的 答案

【问题标题】：Scrape a with optional <spans> with regex用正则表达式刮掉带有可选 <spans> 的 
【发布时间】：2018-09-24 18:52:56
【问题描述】：

我正在尝试抓取这样的表格：

<table><tr>
<td width="100"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My title example:</span></p></td>
<td width="440"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My text example.</span></p></td>
</tr>
<tr>
<td width="100">My second title:</p></td>
<td width="440"><p>My <span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt; text-decoration: underline;">second</span> text example.</p></td>
</tr></table>

在一个简单的字典列表中显示输出，如下所示：

[
{"title": "My title example", "text": "My text example"},
{"title": "My other example", "text": "My <u>second</u> text example"},
{"title": "My title example", "text": "My new example"},
]

但我需要清理代码并将下划线部分交换为标记。所以这是我到目前为止的代码：

from bs4 import BeautifulSoup
import re
# Find the rows in the table
for table_row in html.select("table tr"):
    cells = table_row.findAll('td')
    if len(cells) > 0:
        row_title = cells[0].text.strip()
        paragraphs = []
        # Find all spans in a row
        for run in cells[1].findAll('span'):
            print(run)
            if "text-decoration: underline" in str(run):
                paragraphs.append("{0}{1}{2}".format("<u>", run.text, "</u>"))
            else:
                paragraphs.append(run.text)
        # Build up a sanitized string with all the runs.
        row_text = "".join(paragraphs)
        row = {"title": row_title, "text": row_text}
        data.append(row)
print(data)

问题：您可能已经注意到，它抓取完美跨度的行（第一个示例），但它在第二个示例中失败，并且仅刮下划线部分（因为文本不在 span 标签内）。所以我在想，与其尝试查找跨度，不如删除所有跨度并用正则表达式替换我需要的跨度，如下所示：

# Find all runs in a row
for paragraph in cells[1].findAll('p'):
    re.sub('<.*?>', '', str(paragraph))

这将创建没有标签的文本，但也没有下划线格式，这就是我卡住的地方。

我不知道如何在正则表达式上添加这样的条件。欢迎任何帮助。

预期输出：从段落中删除所有标签，但将找到text-decoration: underline 的跨度替换为 标签。

【问题讨论】：

标签： python regex python-3.x beautifulsoup

【解决方案1】：

当您找到带有下划线属性的 标记时，您可以更改其文本以使用span.string = '{}'.format(span.text) 添加... 标记。修改文字后，可以使用unwrap()去掉标签。

result = []
for row in soup.select('table tr'):
    columns = row.find_all('td')
    title = columns[0]
    txt = columns[1]
    for span in txt.find_all('span', style=lambda s: 'text-decoration: underline' in s):
        span.string = '<u>{}</u>'.format(span.text)
        span.unwrap()

    result.append({'title': title.text, 'text': txt.text})

print(result)
# [{'title': 'My title example:', 'text': 'My text example.'}, {'title': 'My second title:', 'text': 'My <u>second</u> text example.'}]

注意：这种方法实际上不会改变标签。它修改字符串并删除标签。

【讨论】：

如果您有很多跨度，而其中只有一些带有下划线，而其他一些没有，该怎么办？我认为您的代码会将添加到每个跨度，而不仅仅是带下划线的那些。
是的，你说得对。如果有多个跨度标签，这将只修改第一个跨度标签。我将编辑我的答案以处理多个跨度标签。
看看编辑。这将更改所有带下划线的跨度标签，对其他标签没有任何作用。
我用它来解析富文本编辑器，唯一的可能是添加下划线文本，所以是的。大文本、不同大小的文本、字体等中会有很多“运行”。解析器应该只获取文本和下划线，不管那里有什么。列上可能有多个不同类型的跨度。
它有效。我会接受你的回答，因为你对我所有 cmets 的帮助是无价的，而且真的很有帮助。不错的工作！此外，以这种方式保持代码的结构使我将来可以更轻松地对其进行编辑。

【解决方案2】：

一种想法是使用.replace_with() 将“下划线”span 元素替换为u 元素，然后使用.encode_contents() 获取“文本”单元格的内部 HTML：

result = []
for row in soup.select("table tr"):
    title_cell, data_cell = row('td')[:2]

    for span in data_cell('span'):
        if 'underline' in span.get('style', ''):
            u = soup.new_tag("u")
            u.string = span.get_text()
            span.replace_with(u)
        else:
            # replacing the "span" element with its contents
            span.unwrap()

    # replacing the "p" element with its contents
    data_cell.p.unwrap()

    result.append({
        "title": title_cell.get_text(strip=True),
        "test": str(data_cell.encode_contents())
    })

【讨论】：

我在每一行的开头都得到 b'\n，我不确定那里出了什么问题。