【发布时间】:2019-01-10 14:24:45
【问题描述】:
我正在参加一门在线课程,我正在尝试自动化捕获课程结构的过程以用于我的个人笔记,我将其保存在本地的 Markdown 文件中。
这是一个示例章节:
下面是 HTML 外观的示例:
<!-- Header of the chapter -->
<div class="chapter__header">
<div class="chapter__title-wrapper">
<span class="chapter__number">
<span class="chapter-number">1</span>
</span>
<h4 class="chapter__title">
Introduction to Experimental Design
</h4>
<span class="chapter__price">
Free
</span>
</div>
<div class="dc-progress-bar dc-progress-bar--small chapter__progress">
<span class="dc-progress-bar__text">0%</span>
<div class="dc-progress-bar__bar chapter__progress-bar">
<span class="dc-progress-bar__fill" style="width: 0%;"></span>
</div>
</div>
</div>
<p class="chapter__description">
An introduction to key parts of experimental design plus some power and sample size calculations.
</p>
<!-- !Header of the chapter -->
<!-- Body of the chapter -->
<ul class="chapter__exercises hidden">
<li class="chapter__exercise ">
<a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
<span class="chapter__exercise-icon exercise-icon ">
<img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
</span>
<h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
<span class="chapter__exercise-xp">
50 xp
</span>
</a> </li>
到目前为止,我已经使用BeautifulSoup提取了所有相关信息:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
outline_list = []
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
outline_list.append(item.text.strip())
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
outline_list.append(lesson_name)
outline_list.append(lesson_link)
except KeyError:
pass
这给了我一个这样的列表:
['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]
我的目标是将所有内容放入一个类似于以下内容的 .md 文件中:
# Introduction to Experimental Design
* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)
我的问题是:构建这些数据的最佳方式是什么,以便我以后在编写文本文件时可以轻松访问它?拥有一个包含chapter、lesson、lesson_link 列的 DataFrame 会更好吗?带有 MultiIndex 的 DataFrame?嵌套字典?如果它是一本字典,我应该如何命名这些键?还是我缺少另一种选择?某种数据库?
任何想法将不胜感激!
【问题讨论】:
标签: python dictionary dataframe web-scraping beautifulsoup