【发布时间】:2022-01-06 21:16:07
【问题描述】:
我在HTML 的页面中有一堆p 元素,并使用BeautifulSoup 来解析HTML 页面。该页面是在线书籍的索引。我需要做的是创建一个嵌套的JSON 结构,目前没有,因为索引的某些术语是单个术语的子级。
所以你可以这样想索引:
parent term
child term
child term
child term
parent term
parent term
但是,HTML 没有嵌套,它列在所有<p> 标记中,如下所示。如您所见,术语Action(s) 是父术语,有8 个子术语。那么下一个父项是 Actionable Insights 并且有 0 个子项。我有一个循环遍历每个 <p> 标记,并且需要将子级嵌套在 JSON 文件中的父级下。所以我不能使用find_next_siblings()(复数),因为它只会不加选择地获取所有<p>标签。但是,如果我能找到一种使用find_next_sibling()(单数)的方法,但只有那些使用'class': 'index2' 的方法,并将它们添加到列表中,那么我可以将该列表作为子项添加到父项中。至少,到目前为止,这是我的逻辑。
<h2>A</h2>
<p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
<p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
<p class="index1">Action(s):</p>
<p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
<p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
<p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
<p class="index2">driving, <i>see</i> Driving action</p>
<p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
<p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
<p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
<p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
<p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
<p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
<p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
<p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
<p class="index1">Aha Moment:</p>
<p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
<p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
<p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
<p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
<p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
<p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
<p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
<p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
然而,问题是我无法弄清楚它的逻辑。这很复杂,因为我也需要递归。但我不断收到NoneType 错误(如下所述)。如果我取出我卡住的那个代码块,其余的代码就可以工作。但是我怎样才能使用 BeautifulSoup 只获取下一个带有index2 类的<p> 标记?至少孩子被标识为index2。我只是想避免每次需要几个子术语时扫描整个文档。似乎它应该是直截了当的,但没有运气。感谢您的帮助。
我的密码:
from bs4 import BeautifulSoup
import json
# convert html to bs4 object
def bs4_convert(file):
with open(file, encoding='utf8') as fp:
html = BeautifulSoup(fp, 'html.parser')
return html
# create a tag
def p_parser(el, link_prefix):
tags = []
for p in el:
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
}
# add all child terms of a parent term to a list
卡在这里,这个代码块一直被 NONETYPE 错误所困扰,说 p.find_next_sibling('p')['class'] 是不可订阅的。即使我没有检查。
children = []
if(p.find_next_sibling('p') is not None):
while(p.find_next_sibling('p')['class'] == ['index2']):
next_child = p.find_next_sibling('p')
if(next_child is not None):
children.append(next_child)
p = next_child
else:
break
# make child tags
tag['children'] = p_parser(children, link_prefix)
tags.append(tag)
return tags
# loop through all indices
def html_parser(html, link_prefix):
tags = []
# extract index
html.find('section', {'role': "doc-index"})
# iterate over every indented letter in index
letters = html.find_all('section')
for letter in letters:
tags += p_parser(letter.find_all('p'), link_prefix)
return tags
# add the course name as parent to all tags
def add_course_tag(course_name, tags):
complete_tags = {
'tag': course_name,
'definition': '',
'source': tags
}
return complete_tags
# write tags to JSON file
def write_to_json(course_name, tags):
# Serializing json
json_object = json.dumps(tags, indent = 4)
# Writing to course_name.json
with open(course_name + '_tags.json', 'w') as outfile:
outfile.write(json_object)
def main():
# course information for the book
course = {
'course': 'data_storytelling', # exact course name
'file': 'data_storytelling.html', # the html file you extracted
'parse_type': 'index'
}
# this link prefix should be the same for all pages of one book
prefix_id = 'effective-data-storytelling/9781119615712'
link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'
tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)
# write results to json file
write_to_json(course['course'], tags)
if __name__ == "__main__":
main()
编辑: 我尝试了这段代码,但它不会停止在命令行中运行(并且没有任何新内容打印到 JSON 文件中)。
# create a tag
def p_parser(el, link_prefix):
tags = []
for p in el:
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
}
# add all child terms of a parent term to a list
children = []
for child in p.next_siblings:
if child.name == 'p' and 'index2' not in child['class']:
break
elif child.name == 'p' and 'index2' in child['class']:
children.append(child)
tags.append(tag)
# make child tags
tag['children'] = p_parser(children, link_prefix)
return tags
【问题讨论】:
标签: python beautifulsoup