【问题标题】:Convert HTML (unordered) list to nested Python dictionary将 HTML(无序列)列表转换为嵌套的 Python 字典
【发布时间】:2019-10-19 08:49:17
【问题描述】:

如果我有一个如下所示的嵌套 HTML(无序列)列表:

<<ul style="">
  <li class="jstree-last jstree-open" id="wfo-7000000004">
    <ins class="jstree-icon"> </ins>
    <a class="" href="taxon/wfo-7000000004">
      <ins class="jstree-icon"> </ins>
      Acoraceae
    </a>
    <ul style="">
      <li class="jstree-last jstree-open" id="wfo-4000000350">
        <ins class="jstree-icon"> </ins>
        <a class="" href="taxon/wfo-4000000350">
          <ins class="jstree-icon"> </ins>
          Acorus
        </a>
        <ul style="">
          <li class="jstree-open" id="wfo-0000350733">
            <ins class="jstree-icon"> </ins>
            <a class="" href="taxon/wfo-0000350733">
              <ins class="jstree-icon"> </ins>
              Acorus calamus
            </a>
            <ul style="">
              <li class="jstree-leaf" id="wfo-0000350841">
                <ins class="jstree-icon"> </ins>
                <a class="" href="taxon/wfo-0000350841">
                  <ins class="jstree-icon"> </ins>
                  Acorus calamus var. americanus
                </a>
              </li>
              <li class="jstree-last jstree-leaf" id="wfo-0000350949">
                <ins class="jstree-icon"> </ins>
                <a class="" href="taxon/wfo-0000350949">
                  <ins class="jstree-icon"> </ins>
                  Acorus calamus var. angustatus
                </a>
              </li>
            </ul>
          </li>
          <li class="jstree-last jstree-leaf" id="wfo-0000352676">
            <ins class="jstree-icon"> </ins>
            <a class="" href="taxon/wfo-0000352676">
              <ins class="jstree-icon"> </ins>
              Acorus gramineus
            </a>
          </li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

如何在 Python 中形成一个嵌套字典?例如:

{
  Acorales: {
    Acoraceae: {
      Acorus: {
        Acoruscalamus: [
          Acoruscalamusvar.americanus,
          Acoruscalamusvar.angustatus
        ],
        Acorusgramineus
      }
    }
  }
}

我认为 Beautiful SoupHTML Parser 之类的库具有执行此操作的功能(在 python 中使用 for 循环),但我无法弄清楚.感谢您的帮助!

我试过这样:

def create_dic(soup):
    return {li.a.get_text().replace("\xa0", ""): create_dic(li)
            for ul in soup('ul', recursive=False)
            for li in ul('li', recursive=False)}

但是,输出是这样的(其中 Acorus calamus var. americanus 和 Acorus calamus var. angustatus 应该在列表中,而 Acorus gramineus 不是字典):

{'Acorales': {'Acoraceae': {'Acorus': {'Acorus calamus': {'Acorus calamus var. americanus': {},
                                                          'Acorus calamus var. angustatus': {}},
                                       'Acorus gramineus': {}}}}}

【问题讨论】:

标签: python web-scraping beautifulsoup html-parsing


【解决方案1】:

我会回答这个问题,因为要让Parsing nested HTML list with BeautifulSoup 的答案起作用,你必须调用beautifulsoup 来解析你的html uls。我还将问题标记为重复,所以如果它的重复只是关闭/删除。

from bs4 import BeautifulSoup

htmlbody = '''
<<ul style="">
  <li class="jstree-last jstree-open" id="wfo-7000000004">
    <ins class="jstree-icon"> </ins>
    <a class="" href="taxon/wfo-7000000004">
      <ins class="jstree-icon"> </ins>
      Acoraceae
    </a>
    <ul style="">
      <li class="jstree-last jstree-open" id="wfo-4000000350">
        <ins class="jstree-icon"> </ins>
        <a class="" href="taxon/wfo-4000000350">
          <ins class="jstree-icon"> </ins>
          Acorus
        </a>
        <ul style="">
          <li class="jstree-open" id="wfo-0000350733">
            <ins class="jstree-icon"> </ins>
            <a class="" href="taxon/wfo-0000350733">
              <ins class="jstree-icon"> </ins>
              Acorus calamus
            </a>
            <ul style="">
              <li class="jstree-leaf" id="wfo-0000350841">
                <ins class="jstree-icon"> </ins>
                <a class="" href="taxon/wfo-0000350841">
                  <ins class="jstree-icon"> </ins>
                  Acorus calamus var. americanus
                </a>
              </li>
              <li class="jstree-last jstree-leaf" id="wfo-0000350949">
                <ins class="jstree-icon"> </ins>
                <a class="" href="taxon/wfo-0000350949">
                  <ins class="jstree-icon"> </ins>
                  Acorus calamus var. angustatus
                </a>
              </li>
            </ul>
          </li>
          <li class="jstree-last jstree-leaf" id="wfo-0000352676">
            <ins class="jstree-icon"> </ins>
            <a class="" href="taxon/wfo-0000352676">
              <ins class="jstree-icon"> </ins>
              Acorus gramineus
            </a>
          </li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

'''

def ul_to_dict(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:
            result[key] = ul_to_dict(ul)
        else:
            result[key] = None
    return result

# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)

【讨论】:

    猜你喜欢
    • 2020-05-16
    • 2021-09-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-09-03
    • 2023-03-11
    • 1970-01-01
    相关资源
    最近更新 更多