【发布时间】:2016-09-23 11:17:00
【问题描述】:
有没有不使用 read.csv/read_csv 函数将多个 CSV 文件组合成一个超级文件?
我想将文件夹中的所有表格 (CSV) 合并到一个 csv 文件中,因为每个表格都代表一个单独的月份。该文件夹如下所示:
list.files(文件夹)
[1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
[4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
[7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"
[16] "201410-citibike-tripdata.csv" "201411-citibike-tripdata.csv" "201412-citibike-tripdata.csv"
[19] "201501-citibike-tripdata.csv" "201502-citibike-tripdata.csv" "201503-citibike-tripdata.csv"
[22] "201504-citibike-tripdata.csv" "201505-citibike-tripdata.csv" "201506-citibike-tripdata.csv"
[25] "201507-citibike-tripdata.csv" "201508-citibike-tripdata.csv" "201509-citibike-tripdata.csv"
[28] "201510-citibike-tripdata.csv" "201511-citibike-tripdata.csv" "201512-citibike-tripdata.csv"
[31] "201601-citibike-tripdata.csv" "201602-citibike-tripdata.csv" "201603-citibike-tripdata.csv"
我尝试了以下并确实获得了大数据,这是一个包含 33 个元素和 3.6 Gbs 的大列表。但是,整个过程需要一段时间。考虑到网站每月更新的事实,不断增加的数据量将使合并过程更加缓慢。因此,有人可以帮我将所有数据文件组合在一起而不将它们加载到环境中吗?数据源可以在这里找到:https://s3.amazonaws.com/tripdata/index.html。
filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)
数据文件长这样,不是我想要的形式。我想要一张将所有信息合并在一起的大表。
> head(data)
[[1]]
Source: local data frame [843,416 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 634 2013-07-01 00:00:00 2013-07-01 00:10:34 164 E 47 St & 2 Ave 40.75323
2 1547 2013-07-01 00:00:02 2013-07-01 00:25:49 388 W 26 St & 10 Ave 40.74972
3 178 2013-07-01 00:01:04 2013-07-01 00:04:02 293 Lafayette St & E 8 St 40.73029
4 1580 2013-07-01 00:01:06 2013-07-01 00:27:26 531 Forsyth St & Broome St 40.71894
5 757 2013-07-01 00:01:10 2013-07-01 00:13:47 382 University Pl & E 14 St 40.73493
6 861 2013-07-01 00:01:23 2013-07-01 00:15:44 511 E 14 St & Avenue B 40.72939
7 550 2013-07-01 00:01:59 2013-07-01 00:11:09 293 Lafayette St & E 8 St 40.73029
8 288 2013-07-01 00:02:16 2013-07-01 00:07:04 224 Spruce St & Nassau St 40.71146
9 766 2013-07-01 00:02:16 2013-07-01 00:15:02 432 E 7 St & Avenue A 40.72622
10 773 2013-07-01 00:02:23 2013-07-01 00:15:16 173 Broadway & W 49 St 40.76065
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
[[2]]
Source: local data frame [1,001,958 x 15]
tripduration starttime stoptime start station id start station name start station latitude
(int) (time) (time) (int) (chr) (dbl)
1 664 2013-08-01 00:00:00 2013-08-01 00:11:04 449 W 52 St & 9 Ave 40.76462
2 2115 2013-08-01 00:00:01 2013-08-01 00:35:16 254 W 11 St & 6 Ave 40.73532
3 385 2013-08-01 00:00:03 2013-08-01 00:06:28 460 S 4 St & Wythe Ave 40.71286
4 653 2013-08-01 00:00:10 2013-08-01 00:11:03 398 Atlantic Ave & Furman St 40.69165
5 954 2013-08-01 00:00:11 2013-08-01 00:16:05 319 Park Pl & Church St 40.71336
6 145 2013-08-01 00:00:37 2013-08-01 00:03:02 521 8 Ave & W 31 St 40.75045
7 331 2013-08-01 00:01:25 2013-08-01 00:06:56 2000 Front St & Washington St 40.70255
8 194 2013-08-01 00:01:26 2013-08-01 00:04:40 313 Washington Ave & Park Ave 40.69610
9 598 2013-08-01 00:01:40 2013-08-01 00:11:38 528 2 Ave & E 31 St 40.74291
10 360 2013-08-01 00:01:45 2013-08-01 00:07:45 500 Broadway & W 51 St 40.76229
.. ... ... ... ... ... ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)
【问题讨论】:
-
当你说合并时,你的意思是合并还是简单地追加?您是想完全修改数据还是将它们全部粘贴在一起?
-
将它们附加在一起有效,但排除列名的复制行。有什么方法可以粘贴它们并摆脱重复的名称行吗? @jamieRowen