【问题标题】:Find and insert tags using regex使用正则表达式查找和插入标签
【发布时间】:2019-06-17 20:22:10
【问题描述】:

我正在将一本书从 PDF 转换为 calibre 的 epub。但是标题不在标题标签内,因此尝试使用正则表达式替换它的python函数。

示例文本:

<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>

我尝试使用正则表达式

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
    list = re.findall(pattern, match.group())
    
    for x in list:
        x = "</a>(?i)chapter [0-9]+: [\w\s]+(.?)<br>"
        x = s.split("</a>", 1)[0] + '</a><h2>' + s.split("a>",1)[1]
        x = s.split("<br>", 1)[0] + '</h2><br>' + s.split("<br>",1)[1]
    return match.group()


def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
    s.replace(re.match(pattern, s), r'<h2>$0')

但仍然没有得到预期的结果。我想要的是……

输入

&lt;/a&gt;Chapter 370: Slamming straight on&lt;/p&gt;

输出

&lt;/a&gt;&lt;h2&gt;Chapter 370: Slamming straight on&lt;/h2&gt;&lt;/p&gt;

在所有类似的实例中添加h2标签

【问题讨论】:

  • 您可能应该改用 xml 解析器。不要用正则表达式解析 xml
  • 我没有修改任何内容,我正在转换为 epub 以便在移动设备上阅读,作为一名程序员,我很想知道如何做。

标签: python regex calibre


【解决方案1】:

regex 不应该用于解析 xml。看 : Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's termsWhy shouldn't you..会是一个更好的标题)

但是,您可以改用 BeautifulSoup:

from bs4 import BeautifulSoup
data = """<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t"""

soup = BeautifulSoup(data, 'lxml')


for x in soup.find_all('p', {'class':'calibre1'}):

    link = x.find('a')
    title = x.text
    corrected_title = soup.new_tag('h2')
    corrected_title.append(title)

    if link:
        x.string=''
        corrected_title = soup.new_tag('h2')
        corrected_title.append(title)
        link.append(corrected_title)
        x.append(link)

print(soup.body)

输出

<body>
    <p class="calibre1">
        <a id="p1">
            <h2>Chapter 370: Slamming straight on</h2>
        </a>
    </p>
    <p class="softbreak"> </p>
    <p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
    <p class="calibre1">
        <a id="p7">
            <h2>Chapter 372: Yan Zhaoge’s plan</h2>
        </a>
    </p>
    <p class="softbreak"> </p>
    <p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
    i t
</body>

【讨论】:

  • 感谢@Sebastien,但不幸的是只允许使用正则表达式。这也不起作用。
  • @Andruraj,我的耻辱,我纠正了它。只是好奇,为什么只有regex 有效?
  • 我正在使用 calibre 将 pdf 转换为 epub,这只允许使用正则表达式。希望澄清足够。
【解决方案2】:

Jean-François 的评论会好很多,但如果我们必须这样做,我猜我们会从这个表达式开始:

(<\/a>)([^<]+)?(<\/p>)
(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)

被替换为:

\1<h2>\2</h2>\3

Demo 1

Demo 2

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)"

test_str = ("<p class=\"calibre1\"><a id=\"p1\"></a>Chapter 370: Slamming straight on</p>\n"
    "<p class=\"softbreak\"> </p>\n"
    "<p class=\"calibre1\">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>\n"
    "<p class=\"calibre1\"><a id=\"p7\"></a>Chapter 372: Yan Zhaoge’s plan</p>\n"
    "<p class=\"softbreak\"> </p>\n"
    "<p class=\"calibre1\">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>")

subst = "\\1<h2>\\2</h2>\\3"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

【讨论】:

  • 谢谢@Emma。但它也检测到 标签(对于两个正则表达式)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-05-21
  • 2013-11-29
  • 1970-01-01
  • 1970-01-01
  • 2011-03-17
  • 1970-01-01
  • 2015-01-13
相关资源
最近更新 更多