【问题标题】:R Scraping IMDB: Better way to handle missing information?R Scraping IMDB:处理缺失信息的更好方法?
【发布时间】:2020-09-13 11:26:48
【问题描述】:

我关注这个网站是为了从 IMDB 获取信息:https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/

但是,IMDB 中缺少一些数据。该网站建议进行目视检查并编写如下函数:

for (i in c(39,73,80,89)){

a<-metascore_data[1:(i-1)]

b<-metascore_data[i:length(metascore_data)]

metascore_data<-append(a,list("NA"))

metascore_data<-append(metascore_data,b)

}

我想知道是否有更好的方法来以编程方式处理此问题?

【问题讨论】:

  • 你想抓取什么信息?还要添加您要抓取的链接。
  • 这是我使用的链接:imdb.com/search/title/…。网页

标签: r web-scraping missing-data rvest imdb


【解决方案1】:

以下对我有用:

library(rvest)
URL <- 'https://www.imdb.com/search/title/?title_type=feature&online_availability=US/IMDbTV&start=1251&ref_=adv_nxt'
webpage <- read_html(URL)
genres <- webpage %>%
  html_nodes('span.genre') %>%
  html_text() %>%
  trimws()

这会返回 50 个值:

genres
# [1] "Comedy, Romance"              "Action, Crime, Drama"        
# [3] "Action, Horror, Sci-Fi"       "Action, Adventure, Thriller" 
# [5] "Adventure, Comedy, Family"    "Comedy"                      
# [7] "Action, Adventure, Thriller"  "Comedy, Drama, Romance"      
# [9] "Comedy"                       "Comedy"                      
#[11] "Action, Adventure, Drama"     "Action, Thriller"            
#[13] "Action, Crime, Thriller"      "Mystery, Thriller"           
#[15] "Crime, Drama, Thriller"       "Drama, Horror"               
#[17] "Animation, Drama, War"        "Drama, Thriller"             
#[19] "Action, Crime, Drama"         "Drama, Sci-Fi"               
#[21] "Adventure, Comedy, Family"    "Crime, Drama"                
#[23] "Action, Adventure, Thriller"  "Action, Adventure, Sci-Fi"   
#[25] "Thriller"                     "Comedy, Crime"               
#[27] "Comedy, Romance"              "Action, Biography, Drama"    
#[29] "Adventure, Comedy"            "Crime, Drama, Thriller"      
#[31] "Drama, Sci-Fi, Thriller"      "Comedy, Romance"             
#[33] "Action, Drama, Thriller"      "Action, Adventure, Sci-Fi"   
#[35] "Action, Crime, Drama"         "Action, Adventure, Drama"    
#[37] "Action, Thriller"             "Action, Drama, War"          
#[39] "Drama, Sci-Fi, Thriller"      "Animation, Adventure, Family"
#[41] "Drama, Romance"               "Action, Drama, Fantasy"      
#[43] "Action, Adventure, Fantasy"   "Comedy, Crime, Drama"        
#[45] "Action, Crime, Drama"         "Action, Adventure, Sci-Fi"   
#[47] "Drama, Romance"               "Animation, Family, Fantasy"  
#[49] "Action, Adventure, Fantasy"   "Mystery, Thriller"           

【讨论】:

  • @Ronak_Shah 我复制了你的代码并运行了它,但仍然只有 48 个。
  • @Ronak_Shah 'code'URL imdb.com/search/title/…' > 网页 流派 % + html_nodes('span.genre') %>% + html_text() %>% + trimws() '代码'
  • @Ronak_Shah code[1]“喜剧”“喜剧、戏剧、历史”“动作”[4]“动作、冒险、戏剧”“家庭”“家庭”[7]“戏剧” “喜剧,家庭,浪漫”“家庭”[10]“动作”“惊悚”“喜剧,戏剧”[13]“动作”“戏剧”“动画”[16]“动画,冒险,家庭”“动画” 《动作、爱情、运动》[19]《戏剧、奇幻、悬疑》《冒险、戏剧、科幻》《戏剧、惊悚、犯罪》[22]《喜剧》《喜剧、家庭》《戏剧、家庭》@ 987654325@
  • @Ronak_Shah code[25] “家庭、奇幻、音乐剧”“惊悚片”“剧情片”[28]“动作、犯罪、剧情片”“剧情片、浪漫片、惊悚片”“西部片”[ 31] “戏剧,家庭” “戏剧” “喜剧,家庭” [34] “戏剧,历史” “戏剧” “犯罪,戏剧” [37] “喜剧,戏剧,浪漫” “喜剧” “犯罪,恐怖,惊悚” 》[40]《戏剧、悬疑、惊悚》《戏剧、恐怖、惊悚》《动画、喜剧》[43]《动作》《喜剧、恐怖》《喜剧、恐怖》[46]《喜剧》《动作》《戏剧》 " code
  • length(genres) 返回 48 吗?这很奇怪,因为如图所示它对我有用。您是否有阻止抓取的防火墙或防病毒软件?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2010-09-25
  • 2018-07-15
  • 1970-01-01
  • 1970-01-01
  • 2015-10-23
  • 2012-10-28
  • 2011-01-14
相关资源
最近更新 更多