【问题标题】:Finding the data corresponding to a specific selection in an html form [closed]在 html 表单中查找与特定选择相对应的数据 [关闭]
【发布时间】:2021-02-14 01:14:15
【问题描述】:

我正在尝试从位于 http://appl101.lsu.edu/booklet2.nsf/Selector2?OpenForm 的表单中抓取数据 表单的操作是“/booklet2.nsf/f5e6e50d1d1d05c4862584410071cd2e?CreateDocument”。对于选择的每一对(学期、部门),我们会得到一个包含数据的相应页面。

我的目标是编写一些 python 代码来查找每对(学期、部门)页面的 URL。首先,我试图找到特定选择的 URL,例如(2020 年秋季,数学)。

我是网络抓取的新手,只知道一些基本的 html。如果有人能指引我正确的方向,将不胜感激。另外,请详细说明一下这个表单的作用。

【问题讨论】:

标签: python html forms web-scraping


【解决方案1】:

您可以使用此示例获取每对的 URL(但大多数会返回 NoCourseDept):

import requests
from bs4 import BeautifulSoup

base_url = 'http://appl101.lsu.edu/booklet2.nsf/Selector2?OpenForm'
post_url = 'http://appl101.lsu.edu/booklet2.nsf/f5e6e50d1d1d05c4862584410071cd2e?CreateDocument'

soup = BeautifulSoup(requests.get(base_url).content, 'lxml')

semesters = []
for s in soup.select('[name="SemesterDesc"] [value]'):
    semesters.append(s['value'])

departments = []
for d in soup.select('[name="Department"] option'):
    departments.append(d.get_text(strip=True))

for s in semesters:
    for d in departments:
        data = {
            '%%Surrogate_SemesterDesc':1,
            'SemesterDesc':s,
            '%%Surrogate_Department': 1,
            'Department':d
        }
        r = requests.post(post_url, data=data)
        print('{:<30} {:<30} {}'.format(s, d, r.url))

打印:

...

Second Summer Module 2021      BUSINESS ADMINISTRATION        https://appl101.lsu.edu/booklet2.nsf/All/FFAC316D00E5F3D5862585C7002EF1AA?OpenDocument
Second Summer Module 2021      BUSINESS EDUCATION             https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      BUSINESS LAW                   https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CHEMICAL ENGINEERING           https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CHEMISTRY                      https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CHILD AND FAMILY STUDIES       https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CHINESE                        https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CIVIL ENGINEERING              https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CIVIL & ENVIRONMENTAL ENGINEER https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CLASSICAL STUDIES              https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      COMMUNICATION DISORDERS        https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      COMMUNICATION STUDIES          https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      COMPARATIVE BIOMEDICAL SCIENCE https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      COMPARATIVE LITERATURE         https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      COMPUTER SCIENCE               https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      CONSTRUCTION MANAGEMENT        https://appl101.lsu.edu/booklet2.nsf/All/637EAD668A213EDC862585F200296FAE?OpenDocument
Second Summer Module 2021      CURRICULUM & INSTRUCTION       https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      DAIRY SCIENCE                  https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      DIGITAL MEDIA ARTS & ENGINEERI https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      DISASTER SCIENCE MANAGEMENT    https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      DOCTOR OF DESIGN               https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      ECONOMICS                      https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform
Second Summer Module 2021      EDUC LEADERSHIP RESEARCH COUNS https://appl101.lsu.edu/booklet2.nsf/All/B0D27015A5F630CF86258602002C263E?OpenDocument
Second Summer Module 2021      EDUCATION                      https://appl101.lsu.edu/booklet2.nsf/NoCourseDept?readform

...

【讨论】:

    【解决方案2】:

    表单的动作:只是同一服务器上的另一个页面,只是有一个奇怪的名字

    对于刮削部分: 一对(学期,部门)对应的每个页面没有唯一的 URL。发生的情况是,您选择的学期和部门将在 post 请求中提交到服务器,并且根据提交的数据,与表单操作中的 URL 对应的页面内容会有所不同(动态网页)。

    解决方案:一种方法是存储奇怪的 url 和 (semester, department) 对 当需要获取pair对应的页面时,向URL发送post请求,并在请求中以键值方式提供pair(学期,部门),键为“选择输入”的名称和部门从表格中输入“然后你应该收到html中的页面,你可以抓取它并提取你需要的信息。

    例如[发布请求的键值] SemesterDesc=Fall 2020,因为“SemesterDesc”是表单中学期/年选择输入的名称,同样适用于部门输入。

    你可以搜索如何发出请求。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-03-04
      • 2013-08-31
      • 2010-10-11
      • 1970-01-01
      • 2017-02-05
      • 1970-01-01
      • 2013-05-08
      • 1970-01-01
      相关资源
      最近更新 更多