【发布时间】:2015-10-02 15:47:15
【问题描述】:
我正在尝试编写一个 R 脚本,用于从网站上多个页面的表格中抓取数据。为此,我想首先创建要抓取的特定页面的列表。要抓取的页面的地址遵循格式“www.urlpart1/[year]/urlpart2/[page]”,其中 [year] 是 2003 到 2015 的范围(13 个元素),[page] 的值是 1 到 281增量为 40(8 个元素);最终,我想要的最终列表将包含 104 个元素。这是我的代码:
#specify components of URLs
url1 <- "www.urlpart1/"
url2 <- "/urlpart2/"
#specify range of years to scrape
years <- as.list(seq(from = 2003, to = 2015, by = 1)) #13 elements
#specify specific pages within each year to scrape
pages <- as.list(seq(from = 1, to = 281, by = 40)) #8 elements
#specify length of final list of URLs for scraping
loops <- as.list(seq(from = 1, to = (length(years)*length(pages)), by = 1)) #104 elements
#create empty list for storing output of for-loop
list1 <- list()
#initialize loop
for (i in loops){
for (j in years){
for (k in pages){
list1[[i]] <- paste0(url1,j,url2,k)
}
}
}
list1 #outputs 104 elements of last iteration of loop
最终列表将包含 104 个如下所示的元素:
"www.urlpart1/2003/urlpart2/1",
"www.urlpart1/2003/urlpart2/41",
"www.urlpart1/2003/urlpart2/81",
"www.urlpart1/2003/urlpart2/121",
"www.urlpart1/2003/urlpart2/161",
"www.urlpart1/2003/urlpart2/201",
"www.urlpart1/2003/urlpart2/241",
"www.urlpart1/2003/urlpart2/281",
"www.urlpart1/2004/urlpart2/1",
"www.urlpart1/2004/urlpart2/41",
"www.urlpart1/2004/urlpart2/81",
"www.urlpart1/2004/urlpart2/121",
"www.urlpart1/2004/urlpart2/161",
"www.urlpart1/2004/urlpart2/201",
"www.urlpart1/2004/urlpart2/241",
"www.urlpart1/2004/urlpart2/281",
...
"www.urlpart1/2015/urlpart2/1",
"www.urlpart1/2015/urlpart2/41",
"www.urlpart1/2015/urlpart2/81",
"www.urlpart1/2015/urlpart2/121",
"www.urlpart1/2015/urlpart2/161",
"www.urlpart1/2015/urlpart2/201",
"www.urlpart1/2015/urlpart2/241",
"www.urlpart1/2015/urlpart2/281"
不幸的是,我得到了正确长度的列表,但所有值都是循环的最后一次迭代。先前解决类似问题的线程似乎并未解决嵌套循环中的列表写入问题。我对不依赖 for 循环的解决方案完全开放。我可以使用 Excel 的 GUI 轻松完成此操作,但我需要提高我的编码技能以使其更易于重现。谢谢!
【问题讨论】: