将网页抓取结果存储在 DataFrame 或字典中答案

【问题标题】：Store web scraping results in DataFrame or dictionary将网页抓取结果存储在 DataFrame 或字典中
【发布时间】：2019-01-10 14:24:45
【问题描述】：

我正在参加一门在线课程，我正在尝试自动化捕获课程结构的过程以用于我的个人笔记，我将其保存在本地的 Markdown 文件中。

这是一个示例章节：

下面是 HTML 外观的示例：

  <!-- Header of the chapter -->
  <div class="chapter__header">
      <div class="chapter__title-wrapper">
        <span class="chapter__number">
          <span class="chapter-number">1</span>
        </span>
        <h4 class="chapter__title">
          Introduction to Experimental Design
        </h4>
          <span class="chapter__price">
            Free
          </span>
      </div>
      <div class="dc-progress-bar dc-progress-bar--small chapter__progress">
        <span class="dc-progress-bar__text">0%</span>
        <div class="dc-progress-bar__bar chapter__progress-bar">
          <span class="dc-progress-bar__fill" style="width: 0%;"></span>
        </div>
      </div>
  </div>
  <p class="chapter__description">
    An introduction to key parts of experimental design plus some power and sample size calculations.
  </p>
  <!-- !Header of the chapter -->

<!-- Body of the chapter -->
  <ul class="chapter__exercises hidden">
      <li class="chapter__exercise ">
        <a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
          <span class="chapter__exercise-icon exercise-icon ">
            <img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
          </span>
          <h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
          <span class="chapter__exercise-xp">
            50 xp
          </span>
</a>      </li>

到目前为止，我已经使用BeautifulSoup提取了所有相关信息：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

outline_list = []

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            outline_list.append(item.text.strip())
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            outline_list.append(lesson_name)
            outline_list.append(lesson_link)
    except KeyError:
        pass

这给了我一个这样的列表：

['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]

我的目标是将所有内容放入一个类似于以下内容的 .md 文件中：

# Introduction to Experimental Design

* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

我的问题是：构建这些数据的最佳方式是什么，以便我以后在编写文本文件时可以轻松访问它？拥有一个包含chapter、lesson、lesson_link 列的 DataFrame 会更好吗？带有 MultiIndex 的 DataFrame？嵌套字典？如果它是一本字典，我应该如何命名这些键？还是我缺少另一种选择？某种数据库？

任何想法将不胜感激！

【问题讨论】：

标签： python dictionary dataframe web-scraping beautifulsoup

【解决方案1】：

如果我没看错，您目前正在将每个元素按其出现的顺序附加到列表outline_list。但显然你没有 1 种，而是 3 种不同的数据：

chapter__title
chapter__exercise.name
chapter__exercise.link

每个标题可以有多个练习，它们总是一对name 和link。由于您还想为您的文本文件保留此结构中的数据，因此您可以提出任何表示此层次结构的结构。一个例子：

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            chapter = item.text.strip()
            chapters[chapter] = []
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            chapters[chapter].append((lesson_name, lesson_link))
    except KeyError:
        pass

从那里编写文本文件应该很容易：

for chapter, lessons in chapters.items():
    # write chapter title
    for lesson_name, lesson_link in lessons:
        # write lesson

【讨论】：

谢谢！我知道有一个优雅的解决方案：字典 > 列表 > 元组。问题：您建议使用 OrderedDictionary 与常规字典的原因是什么？是否与您在.items() 中的访问方式有关？
使用OrderedDict 可确保结果的顺序与源中的顺序相同。一个标准的dict 不能保证你会遍历键或项。
知道了。再次感谢！