【发布时间】:2018-06-28 10:38:14
【问题描述】:
df <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007",
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015",
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA,
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA,
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
`2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15",
NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), .Names = c("ID", "Column1", "Column2", "Cut-Off",
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016"), row.names = c(NA, 6L), class = "data.frame")
给定上面的数据框。我希望 R 做的是查看截止年份(第 4 列),然后在数据框的末尾创建 2 个新列,其中一列具有每个元素内唯一“标识符”的总数为截止年份之前的每一年,另一栏为截止年份之后的总数。不应包括截止年份列中的标识符。
下面的数据框显示了所需的输出。
例如,在第一行中,截止年份为 2011 年,截止年份之前的 2007 年、2008 年和 2009 年分别具有标识符 15、16 和 15。所以标识符的唯一数量是 15 和 16(第二个 15 被删除),然后它在“之前”列中计数“2”。截止年份之后,只有 2013 年有标识符,因此在“之后”列中计数为“1”。
如果一个元素中有 2 个或更多标识符(例如在第 4 行和第 5 行中显示“30, 27”或“30, 18”),则仍应将其视为用逗号分隔的标识符。
df_solution <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007",
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015",
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA,
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA,
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
`2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15",
NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), Before = c(2, 0, 2, 0, 2, 0), After = c(1,
0, 0, 3, 1, 1)), .Names = c("ID", "Column1", "Column2", "Cut-Off",
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016", "Before", "After"), row.names = c(NA,
6L), class = "data.frame")
【问题讨论】:
-
我什至要说这种数据格式不可能干净地使用。尝试将单元格中的多个数字作为文本的字符串与以数字作为文本的列与截止数字列进行比较是非常困难的。我会尝试将整个事情重塑为一个长格式的数据集,
ID/Year/Value在页面下方运行,ID/Year对每个逗号分隔值重复。我还将ID/Cut-Off放在一个单独的表中,您可以加入反对。生命太短暂,无法以目前的形式继续与之抗争。
标签: r