【问题标题】:Store web scraping results in DataFrame or dictionary将网页抓取结果存储在 DataFrame 或字典中
【发布时间】:2019-01-10 14:24:45
【问题描述】:

我正在参加一门在线课程,我正在尝试自动化捕获课程结构的过程以用于我的个人笔记,我将其保存在本地的 Markdown 文件中。

这是一个示例章节:

下面是 HTML 外观的示例:

  <!-- Header of the chapter -->
  <div class="chapter__header">
      <div class="chapter__title-wrapper">
        <span class="chapter__number">
          <span class="chapter-number">1</span>
        </span>
        <h4 class="chapter__title">
          Introduction to Experimental Design
        </h4>
          <span class="chapter__price">
            Free
          </span>
      </div>
      <div class="dc-progress-bar dc-progress-bar--small chapter__progress">
        <span class="dc-progress-bar__text">0%</span>
        <div class="dc-progress-bar__bar chapter__progress-bar">
          <span class="dc-progress-bar__fill" style="width: 0%;"></span>
        </div>
      </div>
  </div>
  <p class="chapter__description">
    An introduction to key parts of experimental design plus some power and sample size calculations.
  </p>
  <!-- !Header of the chapter -->

<!-- Body of the chapter -->
  <ul class="chapter__exercises hidden">
      <li class="chapter__exercise ">
        <a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
          <span class="chapter__exercise-icon exercise-icon ">
            <img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
          </span>
          <h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
          <span class="chapter__exercise-xp">
            50 xp
          </span>
</a>      </li>

到目前为止,我已经使用BeautifulSoup提取了所有相关信息:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

outline_list = []

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            outline_list.append(item.text.strip())
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            outline_list.append(lesson_name)
            outline_list.append(lesson_link)
    except KeyError:
        pass

这给了我一个这样的列表:

['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]

我的目标是将所有内容放入一个类似于以下内容的 .md 文件中:

# Introduction to Experimental Design

* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

我的问题是:构建这些数据的最佳方式是什么,以便我以后在编写文本文件时可以轻松访问它?拥有一个包含chapterlessonlesson_link 列的 DataFrame 会更好吗?带有 MultiIndex 的 DataFrame?嵌套字典?如果它是一本字典,我应该如何命名这些键?还是我缺少另一种选择?某种数据库?

任何想法将不胜感激!

【问题讨论】:

    标签: python dictionary dataframe web-scraping beautifulsoup


    【解决方案1】:

    如果我没看错,您目前正在将每个元素按其出现的顺序附加到列表outline_list。但显然你没有 1 种,而是 3 种不同的数据:

    • chapter__title
    • chapter__exercise.name
    • chapter__exercise.link

    每个标题可以有多个练习,它们总是一对namelink。由于您还想为您的文本文件保留此结构中的数据,因此您可以提出任何表示此层次结构的结构。一个例子:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    from collections import OrderedDict
    
    url = 'https://www.datacamp.com/courses/experimental-design-in-r'
    html = urlopen(url)
    soup = BeautifulSoup(html, 'lxml')
    
    lesson_outline = soup.find_all(['h4', 'li'])
    
    # Using OrderedDict assures that the order of the result will be the same as in the source
    chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}
    
    for item in lesson_outline:
        attributes = item.attrs
        try:
            class_type = attributes['class'][0]
            if class_type == 'chapter__title':
                chapter = item.text.strip()
                chapters[chapter] = []
            if class_type == 'chapter__exercise':
                lesson_name = item.find('h5').text
                lesson_link = item.find('a').attrs['href']
                chapters[chapter].append((lesson_name, lesson_link))
        except KeyError:
            pass
    

    从那里编写文本文件应该很容易:

    for chapter, lessons in chapters.items():
        # write chapter title
        for lesson_name, lesson_link in lessons:
            # write lesson
    

    【讨论】:

    • 谢谢!我知道有一个优雅的解决方案:字典 > 列表 > 元组。问题:您建议使用 OrderedDictionary 与常规字典的原因是什么?是否与您在.items() 中的访问方式有关?
    • 使用OrderedDict 可确保结果的顺序与源中的顺序相同。一个标准的dict 不能保证你会遍历键或项。
    • 知道了。再次感谢!
    猜你喜欢
    • 2019-04-02
    • 2018-03-17
    • 1970-01-01
    • 2017-09-24
    • 1970-01-01
    • 2016-07-23
    • 1970-01-01
    • 2015-02-19
    • 1970-01-01
    相关资源
    最近更新 更多