【问题标题】:Parsing KML with Beautiful Soup用 Beautiful Soup 解析 KML
【发布时间】:2017-04-25 23:38:57
【问题描述】:

我在使用漂亮的汤解析 KML 文件 (XML) 时遇到问题。对于我的示例 2 表 xml 解析器返回的每个级别 lxml 返回的代码,此 sn-p 的迭代次数应为非零,并且数字应为 3

from bs4 import BeautifulSoup

url = "sample.kml"

with open(url,'r') as page:

    soup = BeautifulSoup(page, "lxml")

    tables = soup.find_all('table')
    print(len(tables))

    for table in tables:    
        rows = table.find_all('tr')

        for row in rows:    
            cols = row.find_all('td')

第一个示例脚本返回 2 个表,而不是 3 个使用 lxml 和 0 个使用 xml 解析器。

soup = BeautifulSoup(page, "xml")

    placemark = soup.find_all('Placemark')
    print(len(placemark))

    for place in placemark:

        tables = place.find_all('table')
        print(len(tables))

        for table in tables:    
            rows = table.find_all('tr')

            for row in rows:    
                cols = row.find_all('td')

遍历树我最初开始搜索 len(tables) 返回的表 2 我知道是假的应该是大约 92,000 所以我找到了另一个标签开始遍历树(返回正确的计数),并尝试然后在每个标签中找到它们都返回零的行和列,这让我感到惊讶。我玩弄了不同的解析器,最终确定 xml 是合适的,但是尽管能够使用 re.search 或在 sublime 文本中搜索找到它们,但仍然无法找到正确数量的表,然后引导我检查它的方法可能已经封装但无济于事。我很困惑,似乎无法找到使用 find_all("TAG") 方法访问 92,000 个表的方法。有什么建议吗?

KML 示例

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document id="laaSECS" xsi:schemaLocation="http://www.opengis.net/kml/2.2 http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd http://www.google.com/kml/ext/2.2 http://code.google.com/apis/kml/schema/kml22gx.xsd">
    <name>laaSECS</name>
    <Snippet maxLines="0"></Snippet>
    <Style id="PolyStyle00">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <Style id="PolyStyle000">
        <LabelStyle>
            <color>00000000</color>
            <scale>0</scale>
        </LabelStyle>
        <LineStyle>
            <color>ff7f5555</color>
            <width>0.2</width>
        </LineStyle>
        <PolyStyle>
            <color>ffc5d9fa</color>
            <fill>0</fill>
        </PolyStyle>
    </Style>
    <StyleMap id="PolyStyle001">
        <Pair>
            <key>normal</key>
            <styleUrl>#PolyStyle00</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#PolyStyle000</styleUrl>
        </Pair>
    </StyleMap>
    <Folder id="FeatureLayer0">
        <name>laaSECS</name>
        <Snippet maxLines="0"></Snippet>
        <Placemark id="ID_00000">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>0</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>24</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35570867858526,32.86011073571817,0 -88.35570870147141,32.86253443065814,0 -88.35597594524225,32.86011537400984,0 -88.35570867858526,32.86011073571817,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00001">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>1</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>25</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35597594524225,32.86011537400984,0 -88.3567389068841,32.85292852502473,0 -88.35768486975799,32.84508568993779,0 -88.35570853700197,32.84511675513796,0 -88.35570867858526,32.86011073571817,0 -88.35597594524225,32.86011537400984,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00002">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<head>

<META http-equiv="Content-Type" content="text/html">

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

</head>

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;">

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px">

<tr style="text-align:center;font-weight:bold;background:#9CBCE2">

<td>AL</td>

</tr>

<tr>

<td>

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px">

<tr>

<td>FID</td>

<td>2</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>STATE</td>

<td>AL</td>

</tr>

<tr>

<td>MER</td>

<td>25</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>TWP</td>

<td>22</td>

</tr>

<tr>

<td>TDIR</td>

<td>N</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>RNG</td>

<td>4</td>

</tr>

<tr>

<td>RDIR</td>

<td>W</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>SEC</td>

<td>36</td>

</tr>

<tr>

<td>MODDATE</td>

<td>20050311</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>DATUM</td>

<td>NAD27</td>

</tr>

<tr>

<td>SOURCE</td>

<td>WhiteStar</td>

</tr>

<tr bgcolor="#D4E4F3">

<td>MTR</td>

<td>25 22.0N  4.0W</td>

</tr>

</table>

</td>

</tr>

</table>

</body>

</html>]]></description>
            <styleUrl>#PolyStyle001</styleUrl>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>
                                -88.35768486975799,32.84508568993779,0 -88.35843183642189,32.83843382961495,0 -88.35914980106479,32.83165897171819,0 -88.35908878782671,32.83049899571662,0 -88.35570839957039,32.83056244880483,0 -88.35570853700197,32.84511675513796,0 -88.35768486975799,32.84508568993779,0 
                            </coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
        <Placemark id="ID_00003">
            <name>AL</name>
            <Snippet maxLines="0"></Snippet>
            <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

链接到原始文件 KML FILE

【问题讨论】:

  • 请提供一小部分 XML 示例来说明问题。 175 MB 的文件不符合 minimal reproducible example! 的一部分!
  • OK @miken32 我已经把它带到了 500 行

标签: python xml beautifulsoup


【解决方案1】:

问题的根源在于您有一个 XML 文档,其中包含嵌套的 HTML 文档。尝试解析具有 HTML 的整个内容是行不通的,因为 HTML 文档似乎存储为标签。因此,虽然这是有效的 XML,但它甚至不是远程有效的 HTML。

为了解决这个问题,我将整个文档解析为 XML,提取每个 HTML 部分(作为字符串),然后将该 HTML 部分解析为 HTML。请注意,有些令人困惑的是,lxml 是 HTML 解析器,而 lxml-xml 是 XML 解析器。

from bs4 import BeautifulSoup as Soup

with open('sample.kml') as data:
    kml_soup = Soup(data, 'lxml-xml') # Parse as XML

descriptions = kml_soup.find_all('description')
for description in descriptions:
    html_soup = Soup(description.text, 'lxml') # Parse as HTML
    tables = html_soup.find_all('table')
    print(len(tables))
    for table in tables:
        rows = table.find_all('tr')

        for row in rows:
            cols = row.find_all('td')
            ...

对于您提供的示例,有六个表。上面的代码打印了 3 次“2”,所以找到了所有 6 个。

【讨论】:

  • 哇,谢谢你,是什么让你了解了这个嵌套结构,我想我永远也想不通@supersam654
  • @TylerCowan 您可以看到 HTML 存储在 CDATA 部分中,因此就 XML 而言,它只是文本。这就是为什么需要单独解析 HTML 的原因。
  • @miken32 谢谢,这很有帮助
  • 对于那些想知道lxmllxml-xml 的人,请参阅文档的“安装解析器”部分。 crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
猜你喜欢
  • 2023-03-24
  • 1970-01-01
  • 1970-01-01
  • 2019-07-03
  • 2015-11-08
  • 2021-02-28
  • 2016-05-10
  • 2022-12-02
  • 2018-10-11
相关资源
最近更新 更多