【发布时间】:2016-08-01 17:26:45
【问题描述】:
我有一个数据框 (dtetags.df),其中的日期列包含许多重复日期:
dtetags.df$Date
"2016-07-22" "2016-07-22" "2016-07-21" "2016-07-21" "2016-07-20" "2016-07-20" "2016-07-19" "2016-07-19" "2016-07-18" "2016-07-18" "2016-07-15" "2016-07-15" "2016-07-15" "2016-07-14"
"2016-07-14" "2016-07-13" "2016-07-13" "2016-07-13" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-08" "2016-07-08"
"2016-07-08" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-06" "2016-07-06" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-01" "2016-07-01" "2016-06-30"
"2016-06-30" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-28" "2016-06-28" "2016-06-28" "2016-06-27" "2016-06-27" "2016-06-27" "2016-06-24" "2016-06-24"
"2016-06-23" "2016-06-23" "2016-06-22" "2016-06-22" "2016-06-21" "2016-06-21" "2016-06-20" "2016-06-20" "2016-06-17" "2016-06-17" "2016-06-16" "2016-06-16" "2016-06-15" "2016-06-15"
"2016-06-14" "2016-06-13" "2016-06-13" "2016-06-10" "2016-06-10" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-08" "2016-06-08" "2016-06-07" "2016-06-07" "2016-06-06"
"2016-06-06" "2016-06-06" "2016-06-01" "2016-06-01" "2016-05-29" "2016-05-29" "2016-05-27" "2016-05-27" "2016-05-26" "2016-05-26" "2016-05-25" "2016-05-25" "2016-05-24" "2016-05-23"
"2016-05-23" "2016-05-20"
以及一些二进制标签列,显示在该日期是否使用该标签发布了帖子,例如:
dtetags.df$Technology
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "1" "1" "0" "1" "0" "1"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
我正在尝试基于this question 使用ddply(dtetags.df,"Date",numcolwise(sum)),但它返回此错误消息<0 rows> (or 0-length row.names)。我尝试了许多不同的方法来格式化 ddply 命令,但我无法让它工作。
理想的输出应该是这样的:
Date Technology
1 2016-07-22 0
2 2016-07-21 0
3 2016-07-20 0
4 2016-07-19 0
5 2016-07-18 0
6 2016-07-15 0
7 2016-07-14 0
8 2016-07-13 0
9 2016-07-12 0
10 2016-07-11 0
11 2016-07-08 0
12 2016-07-07 0
13 2016-07-06 1
14 2016-07-05 0
15 2016-07-01 2
16 2016-06-30 1
17 2016-06-29 1
18 2016-06-28 0
19 2016-06-27 0
20 2016-06-24 1
21 2016-06-23 0
22 2016-06-22 0
23 2016-06-21 0
24 2016-06-20 0
25 2016-06-17 0
26 2016-06-16 0
27 2016-06-15 0
28 2016-06-14 1
29 2016-06-13 0
30 2016-06-10 0
31 2016-06-09 0
32 2016-06-08 0
33 2016-06-07 0
34 2016-06-06 0
35 2016-06-01 0
36 2016-05-29 0
37 2016-05-27 0
38 2016-05-26 0
39 2016-05-25 0
40 2016-05-24 0
41 2016-05-23 0
42 2016-05-20 0
有什么明显的我做错了吗?
从因子到数值的转换
我删除了 Date 列,将 data.frame(apply(dtetags.df, 2, function(x) as.numeric(as.character(x)))) 应用到数据框的其余部分,并重新添加了 Date 列。
dput(dtetags.df)
structure(list(Date = c("2016-07-22", "2016-07-22", "2016-07-21",
"2016-07-21", "2016-07-20", "2016-07-20", "2016-07-19", "2016-07-19",
"2016-07-18", "2016-07-18", "2016-07-15", "2016-07-15", "2016-07-15",
"2016-07-14", "2016-07-14", "2016-07-13", "2016-07-13", "2016-07-13",
"2016-07-12", "2016-07-12", "2016-07-12", "2016-07-12", "2016-07-11",
"2016-07-11", "2016-07-11", "2016-07-11", "2016-07-08", "2016-07-08",
"2016-07-08", "2016-07-07", "2016-07-07", "2016-07-07", "2016-07-07",
"2016-07-06", "2016-07-06", "2016-07-05", "2016-07-05", "2016-07-05",
"2016-07-05", "2016-07-01", "2016-07-01", "2016-06-30", "2016-06-30",
"2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29",
"2016-06-28", "2016-06-28", "2016-06-28", "2016-06-27", "2016-06-27",
"2016-06-27", "2016-06-24", "2016-06-24", "2016-06-23", "2016-06-23",
"2016-06-22", "2016-06-22", "2016-06-21", "2016-06-21", "2016-06-20",
"2016-06-20", "2016-06-17", "2016-06-17", "2016-06-16", "2016-06-16",
"2016-06-15", "2016-06-15", "2016-06-14", "2016-06-13", "2016-06-13",
"2016-06-10", "2016-06-10", "2016-06-09", "2016-06-09", "2016-06-09",
"2016-06-09", "2016-06-08", "2016-06-08", "2016-06-07", "2016-06-07",
"2016-06-06", "2016-06-06", "2016-06-06", "2016-06-01", "2016-06-01",
"2016-05-29", "2016-05-29", "2016-05-27", "2016-05-27", "2016-05-26",
"2016-05-26", "2016-05-25", "2016-05-25", "2016-05-24", "2016-05-23",
"2016-05-23", "2016-05-20"), `Technology` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Date",
"Technology"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -100L))
【问题讨论】:
-
请使用
dput和您的预期输出展示一个可重现的小示例 -
您的输入和预期输出似乎具有不同的值。也许
library(dplyr);dtetags.df %>% group_by(Date) %>% mutate(new = row_number() * as.numeric(as.character(Technology))) -
有通用的解决方案吗?这就是我试图通过不指定列来做的事情。另外,我对您对不同输入/输出的含义感到有些困惑。谢谢!
-
@arebearit:如果您的意思是将汇总应用到所有列,那么您可以将
dplyr与summaries_each一起使用,但我们仍在尝试确定您想要的确切内容。一个一致的输入输出和输出示例会有所帮助。 -
我已经更正了输出中的一个不一致之处,但总体主题是,这个 ddply 函数将获取日期重复的每个实例并在这些行之间求和以给出一种复合值。这就是您所说的输入/输出不一致的意思吗?
标签: r dplyr plyr consolidation