【问题标题】:In R, get messy data scraped and organized into data frame在 R 中,将杂乱的数据抓取并组织到数据框中
【发布时间】:2021-07-17 06:08:33
【问题描述】:

我们正在尝试收集有关大学篮球教练的一般信息。这是我试图抓取的两个示例页面:

我们的理想输出是:

data.frame(
  name = c('Mark Schmidt', 'Sean Neal', 'Matt Pappano', 'Steve Curran', 'Tray Woodall', NA, 'Dominique Broadus'),
  title = c("Head Men's Basketball Coach", "Assistant Men's Basketball Coach", "Director Of Basketball Operations", "Associate Head Coach, Men's Basketball", "Assistant Men's Basketball Coach", "Head Women's Basketball Coach", "Assistant Women's Basketball Coach"),
  email = c(NA, 'sneal@sbu.edu', 'mpappano@sbu.edu', 'scurran@sbu.edu', 'twoodall@sbu.edu', NA, 'dbroadus@ozarks.edu'),
  phone = c('716-375-2207', '716-375-2257', '716-375-2218', '716-375-2258', '716-375-2259', '479-979-1325', '479-979-1325'),
  stringsAsFactors = FALSE
)

               name                                  title               email        phone
1      Mark Schmidt            Head Men's Basketball Coach                <NA> 716-375-2207
2         Sean Neal       Assistant Men's Basketball Coach       sneal@sbu.edu 716-375-2257
3      Matt Pappano      Director Of Basketball Operations    mpappano@sbu.edu 716-375-2218
4      Steve Curran Associate Head Coach, Men's Basketball     scurran@sbu.edu 716-375-2258
5      Tray Woodall       Assistant Men's Basketball Coach    twoodall@sbu.edu 716-375-2259
6              <NA>          Head Women's Basketball Coach                <NA> 479-979-1325
7 Dominique Broadus     Assistant Women's Basketball Coach dbroadus@ozarks.edu 479-979-1325

这导致我们出现问题有几个原因:

  • 在两个页面上,数据都没有保存在表格中,而是保存在每个人的个人divs 中。
  • 缺少一些数据。缺少 2 封电子邮件和一个姓名。

这是我们目前得到的:

# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')

page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')


# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page1_bios_df <- unlist(page1_list) %>% as.data.frame()

page2_list <- lapply(page2_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page2_bios_df <- unlist(page2_list) %>% as.data.frame()

我们并不是那么亲密,事实上我们也不确定这是否可行。我认为即使列名错误,我们也需要首先将数据放入数据框中,然后检查列的内容(例如,查找电子邮件的@符号,查找电话号码的#s,查找单词“coach”标题等)尝试正确命名它们。

【问题讨论】:

    标签: r web-scraping data-manipulation rvest


    【解决方案1】:

    实现您想要的结果的一个选项可能看起来像这样。基本上,我的方法使用特定的 CSS 选择器逐个提取所需的信息:

    library(rvest)
    library(magrittr)
    
    # go to pages, grab person bios
    page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
    page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
    
    page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
    page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
    
    get_bios <- function(x) {
      data.frame(
        name = x %>% html_node("span.name") %>% html_text(),
        title = x %>% html_node("p:nth-of-type(2)") %>% html_text(),
        email = x %>% html_node("p.email a") %>% html_attr("href"),
        phone = x %>% html_node("p:last-of-type") %>% html_text()
      )
    }
    
    
    # turn bios into 1-column dataframes (not really what we need)
    page1_list <- lapply(page1_bios, get_bios)
    page2_list <- lapply(page2_bios, get_bios)
    
    bios_df <- do.call("rbind", c(page1_list, page2_list))
    
    bios_df$email <- gsub("^mailto:(.*)$", "\\1", bios_df$email)
    bios_df$phone <- gsub("^Phone:\\s(.*)$", "\\1", bios_df$phone)
    
    bios_df
    #>                name                                  title               email
    #> 1      Mark Schmidt            Head Men's Basketball Coach                <NA>
    #> 2      Steve Curran Associate Head Coach, Men's Basketball     scurran@sbu.edu
    #> 3         Sean Neal       Assistant Men's Basketball Coach       sneal@sbu.edu
    #> 4      Tray Woodall       Assistant Men's Basketball Coach    twoodall@sbu.edu
    #> 5      Matt Pappano      Director Of Basketball Operations    mpappano@sbu.edu
    #> 6                            Head Women's Basketball Coach                <NA>
    #> 7 Dominique Broadus     Assistant Women's Basketball Coach dbroadus@ozarks.edu
    #>          phone
    #> 1 716-375-2207
    #> 2 716-375-2258
    #> 3 716-375-2257
    #> 4 716-375-2259
    #> 5 716-375-2218
    #> 6 479-979-1325
    #> 7 479-979-1325
    

    【讨论】:

    • 接受,因为get_bios 非常有用,gsub 调用清理列也是如此。真的很有帮助,谢谢!
    【解决方案2】:

    我只能打开第一个url,所以我的解决方法如下

    library(rvest)
    page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
    
    name <- page1 %>% 
        html_nodes(css = "div.coach-bios-wrapper.clearfix span.name") %>%
        html_text()
    
    title <- page1 %>% html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div > p:nth-child(2)") %>%
        html_text()
    
    email <- page1 %>% 
        html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
        html_text() %>%
        gsub(".*\n(.*@.*)\nPhone.*","\\1",.)
    
    email[grep("@",email,invert = T)] <- NA
    
    phone <- page1 %>% 
        html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
        html_text() %>%
            gsub(".*\nPhone: (.*)\n.*","\\1",.)
    
    df <- data.frame(name,title,email,phone)
    # df$email[which(!grepl("@",df$email))] <- NA
    df
    #>           name                                  title            email
    #> 1 Mark Schmidt            Head Men's Basketball Coach             <NA>
    #> 2 Steve Curran Associate Head Coach, Men's Basketball  scurran@sbu.edu
    #> 3    Sean Neal       Assistant Men's Basketball Coach    sneal@sbu.edu
    #> 4 Tray Woodall       Assistant Men's Basketball Coach twoodall@sbu.edu
    #> 5 Matt Pappano      Director Of Basketball Operations mpappano@sbu.edu
    #>          phone
    #> 1 716-375-2207
    #> 2 716-375-2258
    #> 3 716-375-2257
    #> 4 716-375-2259
    #> 5 716-375-2218
    

    reprex package (v2.0.0) 于 2021-07-17 创建

    【讨论】:

      猜你喜欢
      • 2016-10-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-12
      • 2023-02-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多