【问题标题】:Problem with web scraping of required content from a URL link in R从 R 中的 URL 链接中抓取所需内容的问题
【发布时间】:2020-10-24 16:28:25
【问题描述】:

我正在使用脚本从包含不同主题的链接中抓取所需内容。

library(rvest)
url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"

query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

在上述查询中,sel_subj 针对每个不同的主题进行更改

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>% 
  html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")

instructors <- html_attr(instructor_nodes, "target") 
emails <- html_attr(instructor_nodes, "href")

length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39

sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])

结果如下所示,这是错误的:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Marika E. Snider mailto:mesnider@memphis.edu
7       Computer Apps in Design 2 - 11111 - ARCH 2612 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
8                 Design Studio 2 - 11112 - ARCH 2712 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
9                 Design Studio 2 - 15408 - ARCH 2712 - 002      Andrew M. Parks  mailto:amparks@memphis.edu
10  Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
11  Determinants of Modern Design - 27436 - ARCH 3221 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
12            Structural Design 2 - 23837 - ARCH 3322 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
13          Professional Practice - 25097 - ARCH 3431 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
14                Design Studio 4 - 11115 - ARCH 3714 - 001         Sonia Raheel  mailto:sraheel@memphis.edu
15                Design Studio 4 - 23221 - ARCH 3714 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201   Jennifer L. Barker mailto:jlbrker1@memphis.edu
17             Sustainable Design - 19491 - ARCH 4421 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
18     Internship in Architecture - 21000 - ARCH 4430 - 001     Marika E. Snider mailto:mesnider@memphis.edu
19                Design Studio 6 - 11134 - ARCH 4716 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
20             Sustainable Design - 19492 - ARCH 6421 - 001     Marika E. Snider mailto:mesnider@memphis.edu
21      Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001     Marika E. Snider mailto:mesnider@memphis.edu
22    Contemporary Architecture 2 - 24104 - ARCH 7222 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
23     Internship in Architecture - 19495 - ARCH 7430 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
24      Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25       Advanced Design Studio 2 - 18389 - ARCH 7712 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
26          Architecture Research - 25098 - ARCH 7930 - 001     Brian D. Andrews mailto:bdndrews@memphis.edu
27     Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28     Architecture Thesis Studio - 19500 - ARCH 7996 - 004     Brian D. Andrews mailto:bdndrews@memphis.edu
29     Architecture Thesis Studio - 19501 - ARCH 7996 - 005      Andrew M. Parks  mailto:amparks@memphis.edu
30     Architecture Thesis Studio - 19502 - ARCH 7996 - 006     Michael D. Hagge  mailto:mdhagge@memphis.edu
31     Architecture Thesis Studio - 19503 - ARCH 7996 - 007     Brian D. Andrews mailto:bdndrews@memphis.edu
32     Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
34                                                     <NA>   Jennifer L. Barker mailto:jlbrker1@memphis.edu
35                                                     <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
37                                                     <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38                                                     <NA>     Brian D. Andrews mailto:bdndrews@memphis.edu
39                                                     <NA>     Marika E. Snider mailto:mesnider@memphis.edu

但在链接中,数据看起来不同。
例如:
without any instructor and email 的类很少(提到了TBA),如下所示:

很少有其他类有two/three/four/multiple instructors

还有一些其他类带有same instructor given multiple times,如下所示:

对于此类数据,我希望我的输出如下所示:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002          TBA         
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Marika E. Snider mailto:mesnider@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Timothy E. Michael mailto:tmichael@memphis.edu

附:如果发布的 URL 链接不起作用。请遵循:

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search

如何处理缺失数据 (TBA)、多个讲师以及多次给予同一讲师的情况?

【问题讨论】:

    标签: html r xml web-scraping rvest


    【解决方案1】:

    问题在于使用html_nodes() 函数。此函数将返回一个值列表,而不考虑找到该值的节点。由于您的网页有时每个班级会有多名教师或没有教师,因此需要一种更有针对性的方法。

    在这个代码块中,我们首先找到包含我们想要的所有信息的每个类节点。然后我们单独解析每个节点(在lapply 函数内)以提取讲师和电子邮件,同时检查空字段。每个讲师的每个数据框中都有一行,因此如果有多个讲师,则某些数据框将有多行。

    我们为每个班级组装一个数据框列表 (bind_rows),然后合并同一班级的讲师和电子邮件结果

    library(rvest)
    library(dplyr)
    
    url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
    
    query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
                  sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
                  sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
                  sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
                  sel_crse = "",      sel_title = "",     sel_insm = "%",
                  sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
                  sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
                  sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
                  begin_ap = "a",     end_hh = "0",       end_mi = "0",
                  end_ap = "a")
    
    html <- read_html(httr::POST(url, body = query))
    classes <- html %>% html_nodes("th.ddtitle") %>% html_text()
    
    classinfo <- html %>% html_nodes("tr td.dddefault")
    classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']") 
    classinfo <- classinfo[nchar( html_text(classinfo))>50 ]   #eliminate the extra found nodes
    
    classlink <- classinfo %>% html_nodes("a") %>% html_attr("href")  #find all links
    classlinktext <- classinfo %>% html_nodes("a") %>% html_text()    #find the link text
    classlink <- classlink[classlinktext=="View Catalog Entry"]       #keep only the links for "View Catalog Entry"
    
    dfs <-lapply(1:length(classinfo), function(i) {
     # classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
      instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>% 
        html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
      
      instructors <- html_attr(instructor_node, "target") 
      emails <- html_attr(instructor_node, "href")
      #check to see if instructor was assign if not TBD
      if(length(instructors)==0){
        instructors <- "TBD"
        emails <- "NA"
      }
      data.frame(classname=classes[i], link=classlink[i], instructors, emails)
    })
       
    #merge list into data frame
    answer<- bind_rows(dfs)
    
    #consolidation the instructions in the same class
    finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
    # the paste(instructors, collapse = ", ") could be contained within the lapply 
    # loop but adding it here add some flexibility depending on whether
    # answer or final answer is the end result.
    head(finalanswer, 16)
    tail(finalanswer, 16)
    

    【讨论】:

    • 请帮忙。当我在 query 的代码中使用 sel_subj = "ART" 时,我看到了一个错误。 Error in xml_nodeset(NextMethod()) : Expecting an external pointer: [type=NULL] 请告诉我如何解决这个问题。谢谢
    • 该错误是因为向量索引大于向量中的元素个数。如果您查看有 120 个类,但只有 119 个 MeetingTimeNode。最后一节课是论文,没有会议时间。快速而肮脏的解决方法是将 lapply 更改为 1:length(MeetingTimeNodes) 另一个是在 lapply 循环中找到类名,但这需要更多时间才能弄清楚。
    • 是的,谢谢。我发现了一个类并将其从classes 中删除,我现在得到了清晰的输出。
    • 当我单击每个主题中的每个班级时,我看到一个属性capacity 显示数字。如何为每个主题提取它?请提供任何帮助。
    • 我做了一些修改。处理班级信息没有像以前的cmets那样安排会议时间的地方。我还提取了每个类附加信息的链接。您现在需要编写一个循环来访问每个类的链接并提取容量。如果您对此有疑问,我建议您创建一个新问题。这些 cmets 与原始问题的意图不同。
    猜你喜欢
    • 2021-02-06
    • 1970-01-01
    • 1970-01-01
    • 2018-05-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多