【发布时间】:2021-07-17 06:08:33
【问题描述】:
我们正在尝试收集有关大学篮球教练的一般信息。这是我试图抓取的两个示例页面:
- https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index
- https://uofoathletics.com/sports/wbkb/coaches/index
我们的理想输出是:
data.frame(
name = c('Mark Schmidt', 'Sean Neal', 'Matt Pappano', 'Steve Curran', 'Tray Woodall', NA, 'Dominique Broadus'),
title = c("Head Men's Basketball Coach", "Assistant Men's Basketball Coach", "Director Of Basketball Operations", "Associate Head Coach, Men's Basketball", "Assistant Men's Basketball Coach", "Head Women's Basketball Coach", "Assistant Women's Basketball Coach"),
email = c(NA, 'sneal@sbu.edu', 'mpappano@sbu.edu', 'scurran@sbu.edu', 'twoodall@sbu.edu', NA, 'dbroadus@ozarks.edu'),
phone = c('716-375-2207', '716-375-2257', '716-375-2218', '716-375-2258', '716-375-2259', '479-979-1325', '479-979-1325'),
stringsAsFactors = FALSE
)
name title email phone
1 Mark Schmidt Head Men's Basketball Coach <NA> 716-375-2207
2 Sean Neal Assistant Men's Basketball Coach sneal@sbu.edu 716-375-2257
3 Matt Pappano Director Of Basketball Operations mpappano@sbu.edu 716-375-2218
4 Steve Curran Associate Head Coach, Men's Basketball scurran@sbu.edu 716-375-2258
5 Tray Woodall Assistant Men's Basketball Coach twoodall@sbu.edu 716-375-2259
6 <NA> Head Women's Basketball Coach <NA> 479-979-1325
7 Dominique Broadus Assistant Women's Basketball Coach dbroadus@ozarks.edu 479-979-1325
这导致我们出现问题有几个原因:
- 在两个页面上,数据都没有保存在表格中,而是保存在每个人的个人
divs中。 - 缺少一些数据。缺少 2 封电子邮件和一个姓名。
这是我们目前得到的:
# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')
page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')
# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page1_bios_df <- unlist(page1_list) %>% as.data.frame()
page2_list <- lapply(page2_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page2_bios_df <- unlist(page2_list) %>% as.data.frame()
我们并不是那么亲密,事实上我们也不确定这是否可行。我认为即使列名错误,我们也需要首先将数据放入数据框中,然后检查列的内容(例如,查找电子邮件的@符号,查找电话号码的#s,查找单词“coach”标题等)尝试正确命名它们。
【问题讨论】:
标签: r web-scraping data-manipulation rvest