【问题标题】:How to download multiple files with the same name from html page?如何从html页面下载多个同名文件?
【发布时间】:2020-03-08 11:10:02
【问题描述】:

我想从http://insideairbnb.com/get-the-data.html 下载所有名为“listings.csv.gz”的文件,这些文件指的是美国城市,我可以通过编写每个链接来完成,但可以循环执行吗?

最后我将只保留每个文件中的几列并将它们合并到一个文件中。

由于@CodeNoob 解决了问题,我想分享一下它是如何解决的:

page <- read_html("http://insideairbnb.com/get-the-data.html")

# Get all hrefs (i.e. all links present on the website)
links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for listings.csv.gz, USA cities, data for March 2019
wanted <- grep('listings.csv.gz', links)
USA <- grep('united-states', links)
wanted.USA = wanted[wanted %in% USA]
wanted.links <- links[wanted.USA]
wanted.links = grep('2019-03', wanted.links, value = TRUE)

wanted.cols = c("host_is_superhost", "summary", "host_identity_verified", "street", 
                "city", "property_type", "room_type", "bathrooms", 
                "bedrooms", "beds", "price", "security_deposit", "cleaning_fee", 
                "guests_included", "number_of_reviews", "instant_bookable", 
                "host_response_rate", "host_neighbourhood", 
                "review_scores_rating", "review_scores_accuracy","review_scores_cleanliness",
                "review_scores_checkin" ,"review_scores_communication", 
                "review_scores_location", "review_scores_value", "space", 
                "description", "host_id", "state", "latitude", "longitude")


read.gz.url <- function(link) {
  con <- gzcon(url(link))
  df  <- read.csv(textConnection(readLines(con)))
  close(con)
  df  <- df %>% select(wanted.cols) %>%
    mutate(source.url = link)
  df
}

all.df = list()
for (i in seq_along(wanted.links)) {
  all.df[[i]] = read.gz.url(wanted.links[i])
}

all.df = map(all.df, as_tibble)

【问题讨论】:

标签: r


【解决方案1】:

您实际上可以提取所有链接,过滤包含listings.csv.gz 的链接,然后循环下载:

library(rvest)
library(dplyr)

# Get all download links

page <- read_html("http://insideairbnb.com/get-the-data.html")

# Get all hrefs (i.e. all links present on the website)
links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for listings.csv.gz
wanted <- grep('listings.csv.gz', links)
wanted.links <- links[wanted]

for (link in wanted.links) {
  con <- gzcon(url(link))
  txt <- readLines(con)
  df <- read.csv(textConnection(txt))
  # Do what you want
}

示例:下载并合并文件
为了得到你想要的结果,我建议编写一个下载函数来过滤你想要的列,然后将它们组合在一个数据框中,例如:

read.gz.url <- function(url) {
  con <- gzcon(url(link))
  df  <- read.csv(textConnection(readLines(con)))
  close(con)
  df  <- df %>% select(c('calculated_host_listings_count_shared_rooms', 'cancellation_policy' )) %>% # random columns I chose
    mutate(source.url = url) # You may need to remember the origin of each row
  df
}

all.df <- do.call('rbind', lapply(head(wanted.links,2), read.gz.url)) 

注意我只对前两个文件进行了测试,因为它们非常大

【讨论】:

  • 谢谢,很有帮助。我还按国家和日期对它们进行了过滤,最后得到了每个城市的漂亮列表。唯一的问题是 R 在这种事情上太慢了..
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-02-26
  • 1970-01-01
  • 1970-01-01
  • 2012-02-18
  • 1970-01-01
相关资源
最近更新 更多