【问题标题】:How to parse potentially malformed xml to dataframe?如何将可能格式错误的 xml 解析为数据框?
【发布时间】:2019-01-21 10:46:09
【问题描述】:

我有一个来自 API 的 xml。

import requests
import pandas as pd
import lxml.etree as et
from lxml import etree


 url = 'abc.com'

 xml_data1 = requests.get(url).content
 print(xml_data1)

xml_data1:

    <?xml version="1.0" encoding="utf-8"?>
    <Leads>
      <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">
    <Campaign CampaignId="123" CampaignTitle="abc" />
    <Status StatusId="123" StatusTitle="test" />
    <Agent AgentId="123" AgentName="test, test" AgentEmail="a@a.com">
      <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />
    </Agent>
    <Fields>
      <Field FieldId="7" Value="a@a.com" FieldTitle="test" FieldType="test" />
      <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />
      <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />
      <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />
      <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />
      <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />
      <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1267" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1310" Value="test" FieldTitle="test" FieldType="Phone" />
      <Field FieldId="1319" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1485" Value="test" FieldTitle="tst" FieldType="State" />
    </Fields>
    <Logs>
      <StatusLog>
        <Status LogId="123" LogDate="01/04/2017 03:08:44" StatusId="28" StatusTitle="test" AgentId="19" AgentName="test" AgentEmail="test@test.com" />
      </StatusLog>
      <ActionLog>
        <Action LogId="123" ActionTypeId="73" ActionTypeName="test" MilestoneId="1" ActionDate="01/04/2017 03:08:44" ActionNote="test" AgentId="19" AgentName="test,test" AgentEmail="test@test.com" />
      </ActionLog>
      <EmailLog>
        <Email LogId="123" SendDate="01/01/2017 20:53:39" EmailTemplateId="1" EmailTemplateName="test " AgentId="1" AgentName="test" AgentEmail="test@test.com" />
      </EmailLog>
      <DistributionLog>
        <Distribution LogId="1" LogDate="01/01/2017 10:10:08" DistributionProgramId="1" DistributionProgramName="test" AssignedAgentId="1" AssignedAgentName="test,test" AssignedAgentEmail="test@test.com" />
      </DistributionLog>
      <CreationLog LogId="1" LogDate="01/01/2017 10:10:05" Imported="true" CreatedByAgentId="1" CreatedByAgentName="test, test" CreatedByAgentEmail="test@test.com" />
    </Logs>
  </Lead>
</Leads>

您是否担心工作,我不能发布整个 xml 字符串,但它遵循上面的结构。根据 xml 验证器,xml 是正确的,但是当我进行另一个 API 调用并返回一个不同的 xml 字符串时,它看起来像这样:

<?xml version="1.0" encoding="utf-8"?>\r\n<Leads>\r\n  <Lead Id="123" />\r\n  <Lead Id="456" />\r\n</Leads>'

我可以使用以下代码成功地将上面的 xml 传递到数据框中:

class XML2DataFrame:

    def __init__(self, xml_data):
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        """Return a list of dictionaries from the text
         and attributes of the children under this XML root."""
        return [self.parse_element(child) for child in iter(root)]

    def parse_element(self, element, parsed=None):
        """ Collect {key:attribute} and {tag:text} from thie XML
         element and all its children into a single dictionary of strings."""
        if parsed is None:
            parsed = dict()

        for key in element.keys():
            if key not in parsed:
                parsed[key] = element.attrib.get(key)
            else:
                raise ValueError('duplicate attribute {0} at element {1}'.format(key, element.getroottree().getpath(element)))           


        """ Apply recursion"""
        for child in list(element):
            self.parse_element(child, parsed)

        return parsed

    def process_data(self):
        """ Initiate the root XML, parse it, and return a dataframe"""
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

但是,当我将可能格式错误的 xml 字符串传递给上述函数时,我得到了错误:

AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroottree'

由于可能格式错误的 xml 在同一个标​​签中有多个值,我认为该函数无法解析它。

我希望将可能格式错误的 xml 推送到平面数据框中。

从 xml 编辑输出行列标题:

 ActionCount           CreateDate Flagged      Id LastDistributionDate  LeadFormType                                   LeadTitle LogCount FieldId                 FieldTitle FieldType                          Value CampaignId  CampaignTitle  AgentEmail AgentId     AgentName              LogDate   LogId  StatusId       StatusTitle AssignedAgentEmail AssignedAgentId AssignedAgentName DistributionProgramId DistributionProgramName              LogDate   LogId  

【问题讨论】:

    标签: python xml python-3.x pandas


    【解决方案1】:

    我认为您会发现 BeautifulSoup 进行 XML/HTML 解析要容易得多。它还可以很好地处理格式错误的 XML 和 HTML。

    pip install beautifulsoup4

    以下是如何解析您使用 BeautifulSoup 提供的 xml。

    from bs4 import BeautifulSoup 
    import pandas as pd
    
    xml = """
    <?xml version="1.0" encoding="utf-8"?>
    <Leads>
        <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test"></Lead>
        <Lead Id="123" />
        <Lead Id="456" />
    </Leads>
    """
    
    soup = BeautifulSoup(xml, "xml")
    leads = soup.findAll('Lead')
    lead_list = []
    for lead in leads:
        lead_list.append(lead.attrs)
    
    df = pd.DataFrame(lead_list)
    df
    

    输出:

    ACount  CreateDate  Flagged Id  LCount  LastDistributionDate    LeadFormType    LeadTitle   ModifyDate  RCount  ROnly
    0   1   01/01/2017 11:11:11 false   123 4   01/01/2017 10:10:10 test test   test, test., , (123) 456-7890,  01/04/2017 03:03:03 0   false
    1   NaN NaN NaN 123 NaN NaN NaN NaN NaN NaN NaN
    2   NaN NaN NaN 456 NaN NaN NaN NaN NaN NaN NaN
    

    【讨论】:

    • 感谢您的回答,它非常适合潜在客户。但是,我放入了未排序的剩余 XML。如何调整您的代码以接收我发布的整个 xml 字符串?
    • 嗯,它基本上是相同的原则。 Lead 是 XML DOM 中的一个节点,soup.findAll() 可用于查询 XML 中的所有节点。 lead.attrs 是节点所有属性的字典表示。但是,我在上面向您展示的是如何使用 Lead 节点排成一行。混合行/节点可能更具挑战性。如果您向我展示您的 xml 结构的其余部分,我可能会对您有所帮助。
    • 我在前面的示例下发布了剩余的 XML。这就是全部。使用您的方法,我可以进入每个元素并将其编写为数据框。
    • 您能否提供一个示例,说明该 xml 的输出行应该是什么样子?
    • 我做了输出行。它在 xml 中的所有值都在任何具有等号的值之前。希望这是有道理的。
    【解决方案2】:

    既然你更新了问题,我决定用新的 xml 发布另一个答案。

    from bs4 import BeautifulSoup 
    import pandas as pd
    
    xml = """
        <?xml version="1.0" encoding="utf-8"?>
        <Leads>
          <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">
        <Campaign CampaignId="123" CampaignTitle="abc" />
        <Status StatusId="123" StatusTitle="test" />
        <Agent AgentId="123" AgentName="test, test" AgentEmail="a@a.com">
          <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />
        </Agent>
        <Fields>
          <Field FieldId="7" Value="a@a.com" FieldTitle="test" FieldType="test" />
          <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />
          <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />
          <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />
          <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />
          <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />
          <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />
          <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />
          <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />
          <Field FieldId="1267" Value="test" FieldTitle="test" FieldType="Text" />
          <Field FieldId="1310" Value="test" FieldTitle="test" FieldType="Phone" />
          <Field FieldId="1319" Value="test" FieldTitle="test" FieldType="Number" />
          <Field FieldId="1485" Value="test" FieldTitle="tst" FieldType="State" />
        </Fields>
        <Logs>
          <StatusLog>
            <Status LogId="123" LogDate="01/04/2017 03:08:44" StatusId="28" StatusTitle="test" AgentId="19" AgentName="test" AgentEmail="test@test.com" />
          </StatusLog>
          <ActionLog>
            <Action LogId="123" ActionTypeId="73" ActionTypeName="test" MilestoneId="1" ActionDate="01/04/2017 03:08:44" ActionNote="test" AgentId="19" AgentName="test,test" AgentEmail="test@test.com" />
          </ActionLog>
          <EmailLog>
            <Email LogId="123" SendDate="01/01/2017 20:53:39" EmailTemplateId="1" EmailTemplateName="test " AgentId="1" AgentName="test" AgentEmail="test@test.com" />
          </EmailLog>
          <DistributionLog>
            <Distribution LogId="1" LogDate="01/01/2017 10:10:08" DistributionProgramId="1" DistributionProgramName="test" AssignedAgentId="1" AssignedAgentName="test,test" AssignedAgentEmail="test@test.com" />
          </DistributionLog>
          <CreationLog LogId="1" LogDate="01/01/2017 10:10:05" Imported="true" CreatedByAgentId="1" CreatedByAgentName="test, test" CreatedByAgentEmail="test@test.com" />
        </Logs>
      </Lead>
    </Leads>
    """
    
    soup = BeautifulSoup(xml, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)
    
    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list= [x for x in attrs if 'FieldId' in x.keys()]
    other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]
    
    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():  
            attribute_dict.setdefault(k, v)
    
    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)
    
    # Make Dataframe
    df = pd.DataFrame(full_list)
    

    但是,请注意,此方法会覆盖 xml 中具有相同名称的属性 id,例如 LogId。无论如何,这段代码应该可以帮助您入门。

    【讨论】:

    • 非常感谢,field.update(super_dict) 这行 super_dict 应该是空字典吗?因为我收到错误“未定义的名称 super_dict”
    • 哦,对不起。重新格式化时出错。使用attribute_dict 而不是super_dict
    猜你喜欢
    • 1970-01-01
    • 2013-12-29
    • 2016-01-27
    • 2010-11-03
    • 2013-06-16
    • 1970-01-01
    • 2021-12-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多