【问题标题】:Use find_next_sibling() for specific class value only仅对特定类值使用 find_next_sibling()
【发布时间】:2022-01-06 21:16:07
【问题描述】:

我在HTML 的页面中有一堆p 元素,并使用BeautifulSoup 来解析HTML 页面。该页面是在线书籍的索引。我需要做的是创建一个嵌套的JSON 结构,目前没有,因为索引的某些术语是单个术语的子级。 所以你可以这样想索引:

parent term
    child term
    child term
    child term
parent term
parent term

但是,HTML 没有嵌套,它列在所有<p> 标记中,如下所示。如您所见,术语Action(s) 是父术语,有8 个子术语。那么下一个父项是 Actionable Insights 并且有 0 个子项。我有一个循环遍历每个 <p> 标记,并且需要将子级嵌套在 JSON 文件中的父级下。所以我不能使用find_next_siblings()(复数),因为它只会不加选择地获取所有<p>标签。但是,如果我能找到一种使用find_next_sibling()(单数)的方法,但只有那些使用'class': 'index2' 的方法,并将它们添加到列表中,那么我可以将该列表作为子项添加到父项中。至少,到目前为止,这是我的逻辑。

<h2>A</h2>
    <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
    <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
    <p class="index1">Action(s):</p>
    <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
    <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
    <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
    <p class="index2">driving, <i>see</i> Driving action</p>
    <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
    <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
    <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
    <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
    <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
    <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
    <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
    <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
    <p class="index1">Aha Moment:</p>
    <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
    <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
    <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
    <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
    <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
    <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
    <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
    <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>

然而,问题是我无法弄清楚它的逻辑。这很复杂,因为我也需要递归。但我不断收到NoneType 错误(如下所述)。如果我取出我卡住的那个代码块,其余的代码就可以工作。但是我怎样才能使用 BeautifulSoup 只获取下一个带有index2 类的&lt;p&gt; 标记?至少孩子被标识为index2。我只是想避免每次需要几个子术语时扫描整个文档。似乎它应该是直截了当的,但没有运气。感谢您的帮助。

我的密码:

from bs4 import BeautifulSoup
import json

# convert html to bs4 object
def bs4_convert(file):
    with open(file, encoding='utf8') as fp:
        html = BeautifulSoup(fp, 'html.parser')
    return html

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list

卡在这里,这个代码块一直被 NONETYPE 错误所困扰,说 p.find_next_sibling('p')['class'] 是不可订阅的。即使我没有检查。

        children = []
        if(p.find_next_sibling('p') is not None):
            while(p.find_next_sibling('p')['class'] == ['index2']):
                next_child = p.find_next_sibling('p')
                if(next_child is not None):
                    children.append(next_child)
                    p = next_child
                else:
                    break
                
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

        tags.append(tag)

    return tags

# loop through all indices
def html_parser(html, link_prefix):
    tags = []
    # extract index
    html.find('section', {'role': "doc-index"})
    # iterate over every indented letter in index
    letters = html.find_all('section')
    for letter in letters:
        tags += p_parser(letter.find_all('p'), link_prefix)

    return tags

# add the course name as parent to all tags
def add_course_tag(course_name, tags):
    complete_tags = {
        'tag': course_name,
        'definition': '',
        'source': tags
    }

    return complete_tags

# write tags to JSON file
def write_to_json(course_name, tags):
    # Serializing json 
    json_object = json.dumps(tags, indent = 4)

    # Writing to course_name.json
    with open(course_name + '_tags.json', 'w') as outfile:
        outfile.write(json_object)

def main():
    # course information for the book
    course = {
        'course': 'data_storytelling', # exact course name
        'file': 'data_storytelling.html', # the html file you extracted
        'parse_type': 'index'
    }

    # this link prefix should be the same for all pages of one book
    prefix_id = 'effective-data-storytelling/9781119615712'
    link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'

    tags = []
    # parse the html
    html = bs4_convert(course['file'])
    # create tags
    tags = html_parser(html, link_prefix)
    # add course name as outermost tag
    tags = add_course_tag(course['course'], tags)
    # write results to json file
    write_to_json(course['course'], tags)

if __name__ == "__main__":
    main()

编辑: 我尝试了这段代码,但它不会停止在命令行中运行(并且没有任何新内容打印到 JSON 文件中)。

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list
        children = []
        for child in p.next_siblings:
            if child.name == 'p' and 'index2' not in child['class']:
                break
            elif child.name == 'p' and 'index2' in child['class']:
                children.append(child) 

        tags.append(tag)
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

    return tags

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您已经接近您的目标,只需进行一些小调整 - 在迭代检查 tag.name 及其类时,如果它不是类包含 index2&lt;p&gt;,则中断:

    children = []
    
    for c in p.next_siblings:
        if c.name == 'p' and 'index2' not in c['class']:
            break
        elif c.name == 'p' and 'index2' in c['class']:
            children.append(c)
    

    示例

    只是为了演示,但我相信你会根据你的代码调整它。

    import requests,bs4
    html='''
    <h2>A</h2>
        <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
        <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
        <p class="index1">Action(s):</p>
        <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
        <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
        <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
        <p class="index2">driving, <i>see</i> Driving action</p>
        <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
        <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
        <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
        <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
        <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
        <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
        <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
        <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
        <p class="index1">Aha Moment:</p>
        <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
        <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
        <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
        <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
        <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
        <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
        <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
        <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
    '''
    soup = bs4.BeautifulSoup(html)
    
    # this link prefix should be the same for all pages of one book
    prefix_id = 'effective-data-storytelling/9781119615712'
    link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'
    
    data = []
    
    for p in soup.select('p.index1'):
        tag = {
                'tag': p.text,
                'definition': '',
                'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)],
                'children':[]
            }
        
        for c in p.next_siblings:
            if c.name == 'p' and 'index1' in c['class']:
                break
            elif c.name == 'p' and 'index2' in c['class']:
                tag['children'].append({
                    'tag': c.text,
                    'definition': '',
                    'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in c.find_all('a', recursive=False)],
                })
        data.append(tag)
        
    data
    

    编辑

    #create tag
    def create_tag(p, link_prefix):
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
    return tag
    
    #parse p and p children
    def p_parser(el, link_prefix):
        tags = []
        for p in el:
            tag = create_tag(p, link_prefix)
            # add all child terms of a parent term to a list
            children = []
            for child in p.next_siblings:
                if child.name == 'p' and 'index2' not in child['class']:
                    break
                elif child.name == 'p' and 'index2' in child['class']:
                    if child is not None:
                        children.append(create_tag(child, link_prefix)) 
           
            # make child tags
            if children:
                tag['children'] = children
    
            # add any parent tags to tags
            tags.append(tag)
    
        return tags
    

    【讨论】:

    • 您好,感谢您的提示,尽管我运行它并且它从未停止运行。如果您想看看我是如何添加的,我会将新功能添加到我的帖子中的 EDIT 标题下。
    • 认为您应该从编辑中删除 # make child tags tag['children'] = p_parser(children, link_prefix) 行并告诉我。
    • 不幸的是,需要这一行,它是唯一在父标签内添加子列表的行。这个想法是创建与父标签一样的子标签,但子标签列表应该嵌套在父标签中,如'children': [tags]。并非所有索引条目都有子项,因此代码 sn-p 以防万一。
    • p.next_siblings 是否遍历&lt;p&gt; 标签的整个文档?这就是我在使用其他代码时一直遇到的问题。如果类是index2,我正在尝试找到一种方法,只遍历条件指示的接下来的几行。
    • 在我的答案中添加了一个 EDIT,它显示了基于您的原始代码的解决方案,将您的 p_parser 函数替换为这两个函数,因为您无法再次使用子元素调用 p_parser,这将导致 infinit循环。
    猜你喜欢
    • 2012-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-22
    • 2011-04-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多