【问题标题】:str.replace returns ValueError beautiful soupstr.replace 返回 ValueError 漂亮的汤
【发布时间】:2020-05-21 23:04:30
【问题描述】:

我有以下html:

<body><h3>Full Results for race 376338</h3>"Category","Position","Name","Time","Team"<br>"A","1","James","20:20:00","5743"<br><br>"A","2","Matt","20:15:00"<br>

它像&lt;br&gt; # some text &lt;br&gt; 一样持续数百行。 我想在每个
创建一个新行,所以它是这样的 CSV 格式:

<body><h3>Full Results for race 376338</h3>"Category","Position","Name","Time","Team"
<br>"A","1","James","20:20:00","5743"<br>
<br>"A","2","Matt","20:15:00"<br>

我有这个代码:

soup = BeautifulSoup(html_string, features="html.parser")

    for br in soup.find_all('br'):
        soup.replace_with("\n")

这样我得到了错误:ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree. 我需要改变什么?

【问题讨论】:

  • html_string('br') 应该做什么?我想你的意思是soup.find_all('br')
  • html_string是什么,检查一下类型。可能是字节串。
  • @JohnGordon 正确,我做到了,但是无论我有body 还是br,都会返回ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree.
  • @PythonIsBae,你能发布更新的代码吗?
  • @PythonIsBae,尝试在 br 上应用 replace_with 而不是汤。

标签: python html beautifulsoup


【解决方案1】:

你想要文本属性。

In [15]: soup.text
Out[15]: 'Full Results for race 376338"Category","Position","Name","Time","Team"\n"A","1","James","20:20:00","5743"\n"A","2","Matt","20:15:00"'

In [16]: soup.text.split()
Out[16]: 
['Full',
 'Results',
 'for',
 'race',
 '376338"Category","Position","Name","Time","Team"',
 '"A","1","James","20:20:00","5743"',
 '"A","2","Matt","20:15:00"']

In [17]: soup.text.split()[4:]
Out[17]: 
['376338"Category","Position","Name","Time","Team"',
 '"A","1","James","20:20:00","5743"',
 '"A","2","Matt","20:15:00"']

或者get_text 方法。

In [24]: soup.get_text()
Out[24]: 'Full Results for race 376338"Category","Position","Name","Time","Team"\n"A","1","James","20:20:00","5743"\n"A","2","Matt","20:15:00"'

或者

In [25]: [text for text in soup.stripped_strings]
Out[25]: 
['Full Results for race 376338',
 '"Category","Position","Name","Time","Team"',
 '"A","1","James","20:20:00","5743"',
 '"A","2","Matt","20:15:00"']

最后两个直接来自文档。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-04-07
    • 2021-11-12
    • 2023-02-04
    • 2015-11-19
    • 1970-01-01
    • 2020-01-20
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多