仅对特定类值使用 find_next_sibling()答案

【问题标题】：Use find_next_sibling() for specific class value only仅对特定类值使用 find_next_sibling()
【发布时间】：2022-01-06 21:16:07
【问题描述】：

我在HTML 的页面中有一堆p 元素，并使用BeautifulSoup 来解析HTML 页面。该页面是在线书籍的索引。我需要做的是创建一个嵌套的JSON 结构，目前没有，因为索引的某些术语是单个术语的子级。所以你可以这样想索引：

parent term
    child term
    child term
    child term
parent term
parent term

但是，HTML 没有嵌套，它列在所有 标记中，如下所示。如您所见，术语Action(s) 是父术语，有8 个子术语。那么下一个父项是 Actionable Insights 并且有 0 个子项。我有一个循环遍历每个  标记，并且需要将子级嵌套在 JSON 文件中的父级下。所以我不能使用find_next_siblings()（复数），因为它只会不加选择地获取所有标签。但是，如果我能找到一种使用find_next_sibling()（单数）的方法，但只有那些使用'class': 'index2' 的方法，并将它们添加到列表中，那么我可以将该列表作为子项添加到父项中。至少，到目前为止，这是我的逻辑。

<h2>A</h2>
    <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
    <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
    <p class="index1">Action(s):</p>
    <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
    <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
    <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
    <p class="index2">driving, <i>see</i> Driving action</p>
    <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
    <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
    <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
    <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
    <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
    <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
    <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
    <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
    <p class="index1">Aha Moment:</p>
    <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
    <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
    <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
    <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
    <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
    <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
    <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
    <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>

然而，问题是我无法弄清楚它的逻辑。这很复杂，因为我也需要递归。但我不断收到NoneType 错误（如下所述）。如果我取出我卡住的那个代码块，其余的代码就可以工作。但是我怎样才能使用 BeautifulSoup 只获取下一个带有index2 类的 标记？至少孩子被标识为index2。我只是想避免每次需要几个子术语时扫描整个文档。似乎它应该是直截了当的，但没有运气。感谢您的帮助。

我的密码：

from bs4 import BeautifulSoup
import json

# convert html to bs4 object
def bs4_convert(file):
    with open(file, encoding='utf8') as fp:
        html = BeautifulSoup(fp, 'html.parser')
    return html

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list

卡在这里，这个代码块一直被 NONETYPE 错误所困扰，说 p.find_next_sibling('p')['class'] 是不可订阅的。即使我没有检查。

        children = []
        if(p.find_next_sibling('p') is not None):
            while(p.find_next_sibling('p')['class'] == ['index2']):
                next_child = p.find_next_sibling('p')
                if(next_child is not None):
                    children.append(next_child)
                    p = next_child
                else:
                    break
                
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

        tags.append(tag)

    return tags

# loop through all indices
def html_parser(html, link_prefix):
    tags = []
    # extract index
    html.find('section', {'role': "doc-index"})
    # iterate over every indented letter in index
    letters = html.find_all('section')
    for letter in letters:
        tags += p_parser(letter.find_all('p'), link_prefix)

    return tags

# add the course name as parent to all tags
def add_course_tag(course_name, tags):
    complete_tags = {
        'tag': course_name,
        'definition': '',
        'source': tags
    }

    return complete_tags

# write tags to JSON file
def write_to_json(course_name, tags):
    # Serializing json 
    json_object = json.dumps(tags, indent = 4)

    # Writing to course_name.json
    with open(course_name + '_tags.json', 'w') as outfile:
        outfile.write(json_object)

def main():
    # course information for the book
    course = {
        'course': 'data_storytelling', # exact course name
        'file': 'data_storytelling.html', # the html file you extracted
        'parse_type': 'index'
    }

    # this link prefix should be the same for all pages of one book
    prefix_id = 'effective-data-storytelling/9781119615712'
    link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'

    tags = []
    # parse the html
    html = bs4_convert(course['file'])
    # create tags
    tags = html_parser(html, link_prefix)
    # add course name as outermost tag
    tags = add_course_tag(course['course'], tags)
    # write results to json file
    write_to_json(course['course'], tags)

if __name__ == "__main__":
    main()

编辑： 我尝试了这段代码，但它不会停止在命令行中运行（并且没有任何新内容打印到 JSON 文件中）。

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list
        children = []
        for child in p.next_siblings:
            if child.name == 'p' and 'index2' not in child['class']:
                break
            elif child.name == 'p' and 'index2' in child['class']:
                children.append(child) 

        tags.append(tag)
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

    return tags

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

您已经接近您的目标，只需进行一些小调整 - 在迭代检查 tag.name 及其类时，如果它不是类包含 index2 的 ，则中断：

children = []

for c in p.next_siblings:
    if c.name == 'p' and 'index2' not in c['class']:
        break
    elif c.name == 'p' and 'index2' in c['class']:
        children.append(c)

示例

只是为了演示，但我相信你会根据你的代码调整它。

import requests,bs4
html='''
<h2>A</h2>
    <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
    <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
    <p class="index1">Action(s):</p>
    <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
    <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
    <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
    <p class="index2">driving, <i>see</i> Driving action</p>
    <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
    <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
    <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
    <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
    <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
    <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
    <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
    <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
    <p class="index1">Aha Moment:</p>
    <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
    <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
    <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
    <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
    <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
    <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
    <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
    <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
'''
soup = bs4.BeautifulSoup(html)

# this link prefix should be the same for all pages of one book
prefix_id = 'effective-data-storytelling/9781119615712'
link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'

data = []

for p in soup.select('p.index1'):
    tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)],
            'children':[]
        }
    
    for c in p.next_siblings:
        if c.name == 'p' and 'index1' in c['class']:
            break
        elif c.name == 'p' and 'index2' in c['class']:
            tag['children'].append({
                'tag': c.text,
                'definition': '',
                'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in c.find_all('a', recursive=False)],
            })
    data.append(tag)
    
data

编辑

#create tag
def create_tag(p, link_prefix):
    tag = {
        'tag': p.text,
        'definition': '',
        'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
    }
return tag

#parse p and p children
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = create_tag(p, link_prefix)
        # add all child terms of a parent term to a list
        children = []
        for child in p.next_siblings:
            if child.name == 'p' and 'index2' not in child['class']:
                break
            elif child.name == 'p' and 'index2' in child['class']:
                if child is not None:
                    children.append(create_tag(child, link_prefix)) 
       
        # make child tags
        if children:
            tag['children'] = children

        # add any parent tags to tags
        tags.append(tag)

    return tags

【讨论】：

您好，感谢您的提示，尽管我运行它并且它从未停止运行。如果您想看看我是如何添加的，我会将新功能添加到我的帖子中的 EDIT 标题下。
认为您应该从编辑中删除 # make child tags tag['children'] = p_parser(children, link_prefix) 行并告诉我。
不幸的是，需要这一行，它是唯一在父标签内添加子列表的行。这个想法是创建与父标签一样的子标签，但子标签列表应该嵌套在父标签中，如'children': [tags]。并非所有索引条目都有子项，因此代码 sn-p 以防万一。
p.next_siblings 是否遍历 标签的整个文档？这就是我在使用其他代码时一直遇到的问题。如果类是index2，我正在尝试找到一种方法，只遍历条件指示的接下来的几行。
在我的答案中添加了一个 EDIT，它显示了基于您的原始代码的解决方案，将您的 p_parser 函数替换为这两个函数，因为您无法再次使用子元素调用 p_parser，这将导致 infinit循环。