【问题标题】:Extract text between specified html chunks in python在python中的指定html块之间提取文本
【发布时间】:2019-01-11 05:32:39
【问题描述】:

我有下面一段 html,只需要从中提取文本

<p>Current</p> and <p>Archive</p>

Html 块看起来像:

<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>

因此所需的输出应该类似于 File1、File2、File3。

这是我迄今为止尝试过的

import re
m = re.compile('<p>Current</p>(.*?)<p>Archive</p>').search(text)

但没有按预期工作。

有没有简单的解决方案如何在python中指定的html标签块之间提取文本?

【问题讨论】:

标签: python python-3.x python-2.7


【解决方案1】:

如果您坚持使用正则表达式,您可以将它与 list comp 结合使用,如下所示:

chunk="""<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>"""

import re 

# find all things between > and < the shorter the better  
found = re.findall(r">(.+?)<",chunk) 

# only use the stuff after "Current" before "Archive"
found[:] = found[ found.index("Current")+1:found.index("Archive")]

print(found) # python 3 syntax, remove () for python2.7 

输出:

['File1', 'File2', 'File3']

【讨论】:

    【解决方案2】:
    from bs4 import BeautifulSoup as bs
    
    
    html_text = """
    <p>Current</p>
    <a href="some link to somewhere 1">File1</a>
    <br>
    <a href="some link to somewhere 2">File2</a>
    <br>
    <a href="some link to somewhere 3">File3</a>
    <br>
    <p>Archive</p>
    <a href="Some another link to another file">Some another file</a>"""
    
    a_tag = soup.find_all("a")
    
    text = []
    for i in a_tag:
       text.append(get_text())
    
    print (text)
    

    输出:

    ['File1', 'File2', 'File3', 'Some another file']
    

    BeautifulSoup 库对于解析 html 文件并从中获取文本非常有用。

    【讨论】:

      猜你喜欢
      • 2015-11-06
      • 1970-01-01
      • 1970-01-01
      • 2016-09-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-19
      相关资源
      最近更新 更多