【问题标题】:How to make BeautifulSoup 'replace_with' attribute work with a 'unicode' object?如何使 BeautifulSoup 'replace_with' 属性与 'unicode' 对象一起使用?
【发布时间】:2019-03-06 12:30:48
【问题描述】:

这是我的html

<html>
<body>
<h2>Pizza</h2>
<p>This is some random paragraph without child tags.</p>
<p>Delicious homebaked pizza.<br><em></em>$8.99 pp</em></p>
<h2>Eggplant Parmesan</h2>
<p>Try the authentic <i>Italian flavor</i> of baked aubergine.<br><em>$6.99 pp</em></p>
<h2>Italian Ice Cream</h2>
<p>Our dessert specialty.<br><em>$3.99 pp</em></p>
</body>
</html>

使用 BeautifulSoup,我想获取为 h2p 标记显示的文本,用树中的前缀版本替换它们,并将它们打印在屏幕上。对于h2 标签,这工作正常:

from bs4 import BeautifulSoup

with open("/var/www/html/Test/index.html", "r") as f:
 soup = BeautifulSoup(f, "lxml")

f = open("/var/www/html/Test/I18N_index.html", "w+")

for h2 in soup.find_all('h2'):
    i18n_string = "I18N_"+h2.string
    h2.string.replace_with(i18n_string)
    print(h2.string)

f.write(str(soup))


###Output:##############################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
########################################################

在我的 I18N_index.html 中,所有 3 个字符串都以“I18N_”为前缀正确显示。

但是,我的p 标签包含子标签,对于这些标签,返回类型为“无”。结果,串联不再起作用:

    for p in soup.find_all('p'):
        i18n_string = "I18N_"+p.string
        p.string.replace_with(i18n_string)
        print(p.string)

    f.write(str(soup))

###Output:##################################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# Traceback (most recent call last):
  # File "./test.py", line 15, in <module>
    # i18n_string = "I18N_"+p.string
# TypeError: cannot concatenate 'str' and 'NoneType' objects
############################################################

this thread 我了解了join 函数。它让我进行连接并在屏幕上打印出结果字符串,而不是汤树中的替换:

for p in soup.find_all('p'):
    joined = ''.join(p.strings)
    i18n_string = "I18N_"+joined
    #joined.replace_with(i18n_string)
    print (i18n_string)

###Output with 'joined.replace_with(i18n_string)' DISABLED:###
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# I18N_Delicious homebaked pizza.$8.99 pp
# I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
# I18N_Our dessert specialty$3.99 pp
############################################################

###Output with 'joined.replace_with(i18n_string)' ENABLED:#####
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# Traceback (most recent call last):
  # File "./test.py", line 41, in <module>
    # joined.replace_with(i18n_string)
# AttributeError: 'unicode' object has no attribute 'replace_with'
############################################################

在那个帖子中,提到了另一个基于isinstance 的解决方案,但我无法做到这一点。

如果我理解正确,join 函数会连接字符串,但返回一个“unicode”对象,而不是字符串对象,这就是“replace_with”属性不起作用的原因。我该如何解决这个问题?非常感谢任何帮助。

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    replace_with() 方法不起作用不是因为joined 是一个 unicode 对象,而是因为它是一个特定于 bs4 对象的方法。看到这个:BeautifulSoup-replace_with

    顺便说一句,join() 方法返回一个 str 参见:python3-join

    现在给你一个解决方案,我会简单地删除p标签后的string

    from bs4 import BeautifulSoup
    
    with open("index.html", "r") as f:
     soup = BeautifulSoup(f, "lxml")
    
    f = open("I18N_index.html", "w+")
    
    for h2 in soup.find_all('h2'):
        i18n_string = "I18N_"+h2.string
        h2.string.replace_with(i18n_string)
        print(h2.string)
    
    for p in soup.find_all('p'):
        joined = ''.join(p.strings)
        i18n_string = "I18N_"+joined
        p.replace_with(i18n_string)
        print (i18n_string)
    
    
    f.write(str(soup))
    

    输出:

    I18N_Pizza I18N_Eggplant Parmesan I18N_Italian Ice Cream I18N_This is some random paragraph without child tags. I18N_Delicious homebaked pizza.$8.99 pp I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp I18N_Our dessert specialty.$3.99 pp

    【讨论】:

    • 此解决方案有效。非常感谢,也感谢您提供更多信息。
    【解决方案2】:

    使用您的代码的简化版本(即只处理p 标签问题),看起来您必须将p.string 替换为p.text

    soup = BeautifulSoup([你的 html], "lxml")

     for p in soup.find_all('p'):
       print('before: ',p.text)
       i18n_string = "I18N_"+p.text
       print('after ',i18n_string)
    

    输出:

    before:  This is some random paragraph without child tags.
    after  I18N_This is some random paragraph without child tags.
    before:  Delicious homebaked pizza.$8.99 pp
    after  I18N_Delicious homebaked pizza.$8.99 pp
    before:  Try the authentic Italian flavor of baked aubergine.$6.99 pp
    after  I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
    before:  Our dessert specialty.$3.99 pp
    after  I18N_Our dessert specialty.$3.99 pp
    

    【讨论】:

    • 感谢您的回复。我之前尝试过“文本”,但它并没有解决我无法使用“替换_与”的问题。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-04-05
    • 2022-11-17
    • 2021-08-21
    • 1970-01-01
    • 2016-04-21
    • 2019-03-12
    相关资源
    最近更新 更多