【问题标题】:Problems while transforming into a pandas dataframe the following xml elements?将以下 xml 元素转换为 pandas 数据框时出现问题?
【发布时间】:2019-01-26 02:56:21
【问题描述】:

我正在使用漂亮的汤从一堆 xml 文件中解析和提取一些信息,如下所示:

import os
a_lis = []
for filepath in glob(os.path.join('../data/trainingFiles/', '*.xml')):
    with open(filepath) as f:
        content = f.read()
        results = BeautifulSoup(content, 'lxml')
        #print(results)
        for LabelInteractions in results.find_all("labelinteractions"):
            #print(LabelInteractions)
            for labelinteractions in LabelInteractions.findAll('labelinteraction'):
                print(labelinteractions)

出来:

<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
....
<labelinteraction precipitant="riociguat" precipitantcode="N0000188995" type="Unspecified interaction"></labelinteraction>
<labelinteraction effect=" 25064002: Headache (finding)" precipitant="alcohol" precipitantcode="N0000007432" type="Pharmacodynamic interaction"></labelinteraction>

如何将这些 xml 属性转换为 pandas 数据框格式?,列看起来像这样:

precipitant  precipitantcode type effect

【问题讨论】:

    标签: python python-3.x pandas beautifulsoup lxml


    【解决方案1】:

    您可以将列存储在数组中,然后创建数据框:

    from collections import defaultdict
    
    from bs4 import BeautifulSoup
    import pandas as pd
    
    soup = BeautifulSoup("""
    <labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
    <labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
    <LabelInteraction type="Pharmacodynamic interaction" precipitant="alcohol" precipitantCode="N0000007432" effect=" 25064002: Headache (finding)"/>
    """) 
    
    columns = ['precipitant', 'precipitantcode', 'type', 'effect']
    d = defaultdict(list)
    
    for labelinteraction in soup.findAll('labelinteraction'):
        for col in columns:
            d[col].append(labelinteraction[col] if labelinteraction.has_attr(col) else None)
    
    df = pd.DataFrame(d)
    

    输出:

         precipitant precipitantcode                         type                         effect
    0      ritonavir     N0000007423      Unspecified interaction                           None
    1  gc stimulator          NO MAP      Unspecified interaction                           None
    2        alcohol     N0000007432  Pharmacodynamic interaction   25064002: Headache (finding)
    

    【讨论】:

    • 是的,我实际上尝试定义列表并合并 pandas 数据框中的所有内容。但我认为这是一种更 Pythonic 的方式...谢谢!
    • 我发现了一个问题...我意识到有些元素具有effect 参数。例如:&lt;LabelInteraction type="Pharmacodynamic interaction" precipitant="alcohol" precipitantCode="N0000007432" effect=" 25064002: Headache (finding)"/&gt;如果元素没有任何效果属性,有没有办法把NaN放?
    • 谢谢...实际上我刚刚意识到,正因为如此,我认为使用列表是错误的...因为当我尝试获取 effect 值时,BS 没有找到它案例...
    【解决方案2】:

    如果您有想要的列列表:

    cols = ['precipitant', 'precipitantcode', 'type']
    

    然后您可以遍历它们并附加到字典中的数组:

    d = {}
    for labelinteractions in LabelInteractions.findAll('labelinteraction'):
        for c in cols:
            if not c in d:
                d[c] = [labelinteractions[c]]
            else:
                d[c].append(labelinteractions[c])
    

    完成后,您可以请求DataFrame:

    df = pd.DataFrame(d)
    

    这是我从你的样品中得到的:

         precipitant precipitantcode                         type
    0      ritonavir     N0000007423      Unspecified interaction
    1  gc stimulator          NO MAP      Unspecified interaction
    2      riociguat     N0000188995      Unspecified interaction
    3        alcohol     N0000007432  Pharmacodynamic interaction
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-08-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-04-03
      相关资源
      最近更新 更多