向下填充具有 NA 的列的行（使用 R 基础或 data.table）答案

【问题标题】：fill rows downward for a column with NAs (with R base or data.table)向下填充具有 NA 的列的行（使用 R 基础或 data.table）
【发布时间】：2013-09-29 21:14:42
【问题描述】：

我想使用 Census 的 county-adjacency 数据，但 am 一直坚持要把它变成一个好的形式。数据分为四列：第一县、第一代码、第二县、第二代码。第一个县列不重复，取而代之的是我现在读到的值“”：

                     c1   cd1                    c2   cd2
1   Alamance County, NC 37001   Alamance County, NC 37001
2                          NA    Caswell County, NC 37033
3                          NA    Chatham County, NC 37037
4                          NA   Guilford County, NC 37081
5                          NA     Orange County, NC 37135
6                          NA   Randolph County, NC 37151
7                          NA Rockingham County, NC 37157
8  Alexander County, NC 37003  Alexander County, NC 37003
9                          NA   Caldwell County, NC 37027
10                         NA    Catawba County, NC 37035
11                         NA    Iredell County, NC 37097
12                         NA     Wilkes County, NC 37193
13 Alleghany County, NC 37005  Alleghany County, NC 37005
14                         NA       Ashe County, NC 37009
15                         NA      Surry County, NC 37171
16                         NA     Wilkes County, NC 37193
17                         NA    Grayson County, VA 51077
18     Anson County, NC 37007      Anson County, NC 37007
19                         NA Montgomery County, NC 37123
20                         NA   Richmond County, NC 37153

我碰巧只对该链接中发现的北卡罗来纳州部分数据感兴趣，其中一部分是您在上面看到的：

#
nc_cc <- structure(list(c1 = c("Alamance County, NC", "", "", "", "", "", "", "Alexander County, NC", "", "", "", "", "Alleghany County, NC", "", "", "", "", "Anson County, NC", "", ""), cd1 = c(37001L, NA, NA, NA, NA, NA, NA, 37003L, NA, NA, NA, NA, 37005L, NA, NA, NA, NA, 37007L, NA, NA), c2 = c("Alamance County, NC", "Caswell County, NC", "Chatham County, NC", "Guilford County, NC", "Orange County, NC", "Randolph County, NC", "Rockingham County, NC", "Alexander County, NC", "Caldwell County, NC", "Catawba County, NC", "Iredell County, NC", "Wilkes County, NC", "Alleghany County, NC", "Ashe County, NC", "Surry County, NC", "Wilkes County, NC", "Grayson County, VA", "Anson County, NC", "Montgomery County, NC", "Richmond County, NC" ), cd2 = c(37001L, 37033L, 37037L, 37081L, 37135L, 37151L, 37157L, 37003L, 37027L, 37035L, 37097L, 37193L, 37005L, 37009L, 37171L, 37193L, 51077L, 37007L, 37123L, 37153L)), .Names = c("c1", "cd1", "c2", "cd2"), row.names = c(NA, 20L), class = "data.frame")
#

我想要一个干净的邻接关联（县名是多余的），所以我想要的输出可以采用多种形式：data.frame、列表……

我想出的粗略解决方案（经过深思熟虑）是这样的：

require(data.table)
DT <- data.table(nc_cc)
DT[,list(cd1=cd1[1],cd2),by=cumsum(!is.na(cd1))][,list(cd1,cd2)]

给予

      cd1   cd2
 1: 37001 37001
 2: 37001 37033
 3: 37001 37037
 4: 37001 37081
 5: 37001 37135
 6: 37001 37151
 7: 37001 37157
 8: 37003 37003
 9: 37003 37027
10: 37003 37035
11: 37003 37097
12: 37003 37193
13: 37005 37005
14: 37005 37009
15: 37005 37171
16: 37005 37193
17: 37005 51077
18: 37007 37007
19: 37007 37123
20: 37007 37153

我用data.table 标记了它，因为我在上面的解决方案中使用了它，我怀疑roll 可以做一些不错的事情。真的，我从来没有理解过roll 的文档，所以我希望在这里能学到一些东西……所以：这可以做得更好吗？

编辑： This question 在问同样的事情，所以我将我的问题修改为：“有没有更好的方法可以使用 data.table 或基础 R 来做到这一点（因为我反对安装更多包）？”

【问题讨论】：

酷！我投了反对票；自从我的第一个（迅速收回）问题以来，还没有一个这样的问题。
如果有人知道优雅的非cumsum 解决方案，将不胜感激。总有一天，我可能会回来赏金。现在，我将@eddi 标记为答案，尽管很明显他没有阅读该问题（您知道，该问题包含该答案）。
虽然您链接到的问题的答案需要 zoo 包，但它提供了一些很好的功能，例如 maxgap 和 fromLast，可能会派上用场。您没有理由不能将na.locf 与data.table 一起使用。即DT[, cd1:=na.locf(cd1)]
@GSee：谢谢指点；也许我应该安装它。我看到它没有依赖关系并且保持最新。我只是想避免 (i) 学习使用和 (ii) 使用几年后可能无法使用的东西。
@Downvoters：如果这个问题有问题，你可以告诉我。如果你能说服我它有什么问题，我可以编辑、关闭或删除它...

标签： r data.table census

【解决方案1】：

一个非常标准的方法是：

library(data.table)
dt = data.table(nc_cc)

dt[, cd1 := cd1[1], by = cumsum(!is.na(cd1))]

【讨论】：

【解决方案2】：

我根据@Arun's 的答案找到了roll 解决方案！

在我的应用程序中，它比@eddi（......我在陈述问题时）使用的cumsum 答案更加复杂：

DT <- data.table(nc_cc)
setkey(DT[,i:=.I],i)

DT[
    DT[c1!=""][J(1:20),roll=TRUE][,list(c1,cd1),key=i],
    `:=`(c1=i.c1,cd1=i.cd1)
]

我从from @eddi 对我的另一个问题的回答中学到了i.name 的东西。

【讨论】：