【问题标题】:Best way extract mention from text with link like wikipedia dump使用维基百科转储等链接从文本中提取提及的最佳方法
【发布时间】:2020-12-06 09:51:03
【问题描述】:

我从维基百科转储中获得如下数据:

' The atomic number or proton number (symbol Z) of a <a href="chemical%20element">chemical element</a> is the number of <a href="proton">proton</a>s found in the <a href="atomic%20nucleus">nucleus</a> of an <a href="atom">atom</a>.'  
' It is identical to the <a href="charge%20number">charge number</a> of the nucleus.',  
' The atomic number uniquely identifies a chemical element.'  
' In an <a href="electric%20charge">uncharged</a> atom, the atomic number is also equal to the number of <a href="electron">electron</a>s.'

我想从这些句子中提取提及(带有超链接的文本跨度)。预期的输出是:

["chemical element", "proton", "nucleus", "atom"]  
["charge number"],  
[] 
["uncharged", "electron"]

我想知道从文本中提取此类信息的最佳方法是什么。谢谢。

【问题讨论】:

标签: python-3.x


【解决方案1】:

由于您正在处理 HTML,您可以尝试使用 Beautiful Soup 库进行转储。
确切的代码如下所示:

from bs4 import BeautifulSoup

#Enter the line from dump here

soup = BeautifulSoup(your_string, "html.parser")

# Retrieve all of the anchor tags from the parsed information
tags = soup('a')
for tag in tags:
    print('Contents:{}\n'.format(tag.contents[0]))

【讨论】:

    【解决方案2】:

    您应该能够使用re.findall 捕获这些值:

    import re
    
    vals = [
        ' The atomic number or proton number (symbol Z) of a <a href="chemical%20element">chemical element</a> is the number of <a href="proton">proton</a>s found in the <a href="atomic%20nucleus">nucleus</a> of an <a href="atom">atom</a>.',
    ' It is identical to the <a href="charge%20number">charge number</a> of the nucleus.',  
    ' The atomic number uniquely identifies a chemical element.',
    ' In an <a href="electric%20charge">uncharged</a> atom, the atomic number is also equal to the number of <a href="electron">electron</a>s.',
    ]
    
    
    for val in vals:
        matches = re.findall('<a[^>]*>([\w\s]+)<\/a>', val)
        print(matches)
    
    

    查看模式here

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-01-09
      • 2019-09-08
      • 1970-01-01
      • 2021-06-08
      • 2018-04-13
      • 2022-01-14
      • 1970-01-01
      相关资源
      最近更新 更多