【问题标题】：How to subtract a median only from 5 last integer values如何仅从最后 5 个整数值中减去中位数
【发布时间】：2018-06-19 11:52:54
【问题描述】：

我有这个数据集

    df=structure(list(Dt = structure(1:39, .Label = c("2018-02-20 00:00:00.000", 
"2018-02-21 00:00:00.000", "2018-02-22 00:00:00.000", "2018-02-23 00:00:00.000", 
"2018-02-24 00:00:00.000", "2018-02-25 00:00:00.000", "2018-02-26 00:00:00.000", 
"2018-02-27 00:00:00.000", "2018-02-28 00:00:00.000", "2018-03-01 00:00:00.000", 
"2018-03-02 00:00:00.000", "2018-03-03 00:00:00.000", "2018-03-04 00:00:00.000", 
"2018-03-05 00:00:00.000", "2018-03-06 00:00:00.000", "2018-03-07 00:00:00.000", 
"2018-03-08 00:00:00.000", "2018-03-09 00:00:00.000", "2018-03-10 00:00:00.000", 
"2018-03-11 00:00:00.000", "2018-03-12 00:00:00.000", "2018-03-13 00:00:00.000", 
"2018-03-14 00:00:00.000", "2018-03-15 00:00:00.000", "2018-03-16 00:00:00.000", 
"2018-03-17 00:00:00.000", "2018-03-18 00:00:00.000", "2018-03-19 00:00:00.000", 
"2018-03-20 00:00:00.000", "2018-03-21 00:00:00.000", "2018-03-22 00:00:00.000", 
"2018-03-23 00:00:00.000", "2018-03-24 00:00:00.000", "2018-03-25 00:00:00.000", 
"2018-03-26 00:00:00.000", "2018-03-27 00:00:00.000", "2018-03-28 00:00:00.000", 
"2018-03-29 00:00:00.000", "2018-03-30 00:00:00.000"), class = "factor"), 
    ItemRelation = c(158043L, 158043L, 158043L, 158043L, 158043L, 
    158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 
    158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 
    158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 
    158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 
    158043L, 158043L, 158043L, 158043L, 158043L, 158043L), stuff = c(200L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3600L, 0L, 0L, 0L, 0L, 
    700L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1000L, 
    2600L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 700L), num = c(1459L, 
    1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 
    1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 
    1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 
    1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 
    1459L, 1459L), year = c(2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L), action = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("Dt", "ItemRelation", 
"stuff", "num", "year", "action"), class = "data.frame", row.names = c(NA, 
-39L))

动作列只有两个值 0 和 1。我必须计算 1 类动作的中位数，然后按零类动作的中位数，使用一个类别之前的最后五个整数值。我只取最后5个观察值，在动作的零类中需要取最后5个观察值，但只取整数值，而不计算中位数由零类别的所有值。在我们的例子中是

然后从一个类别的中位数中减去零类别的中位数。

在 0 类动作中的观察次数可以在 0 到 10 之间变化。如果我们有 10 个零类别的整数值，我们取最后五个。如果只有 1,2,3,4,5 个整数值，我们减去整数值实数的中位数。如果我们只有 0 而没有 integer ，我们只需减去 0。

来自相邻主题 How to subtract a median only from integer value 的 Akshay 解决方案帮助了我

df.0 <- df %>% filter(action == 0 & stuff != 0) %>% arrange(Dt) %>% top_n(5)
df.1 <- df %>% filter(action==1 & stuff!=0)

new.df <- rbind(df.0,df.1)


View(
  df %>% select (everything()) %>%  group_by(ItemRelation, num, year) %>%
    summarise(
      median.1 = median(stuff[action == 1 & stuff != 0], na.rm = T),
      median.0 = median(stuff[action == 0 &
                                stuff != 0], na.rm = T)
    ) %>%
    mutate(
      value = median.1 - median.0,
      DocumentNum = num,
      DocumentYear = year
    ) %>%
    select(ItemRelation, DocumentNum, DocumentYear, value)

但是代码计算所有动作的0类obs的中位数，它必须按0类计算中位数，但在一个类别之前的最后5个obs。

如果有人帮助我原创，即相邻主题，我将删除这个新主题，而不是产生相关主题。

注意，除了零之外，零类动作可能还有其他值。

Edit2 我添加了新类别 - CustomerName

出来

put <- data.frame(mydat[which.max(as.Date(mydat$Dt)),
                           c("CustomerName","ItemRelation","DocumentNum","DocumentYear")], 
                     value = m,
                     row.names = 1:length(which.max(as.Date(mydat$Dt))))


CustomerName ItemRelation DocumentNum DocumentYear value
1  orange TC       157214        1529         2018   162

为什么我只得到一个字符串？输出必须作为示例。有很多阶层。没有一个

CustomerName ItemRelation DocumentNum DocumentYear value
1  orange TC       157214        1529         2018   162
2  appleTC              5        1529         2018   164

【问题讨论】：

我理解正确吗：您想将数据按action (1 or 2) 和stuff (!=0) 进行子集化，然后取相应的中位数，最后使用这两个值执行算术运算？
@nate.edwinton。该操作只有 0 或 1 个类别。我想按最后 5 个 obs 的东西计算中位数。零类别的行动。我们在第一类动作之前通过材料（仅大于零）进行 5 次观察，我们计算中值，然后从第一类动作的中值中减去中值。你了解我吗？此代码通过 stuff 计算零类动作的共同中位数
您想先对“大于零/零”的stuff / action 进行子集化，然后获取最后 5 个观测值，还是先获取最后 5 个观测值再获取子集？此外，在计算 action = 1 的中位数时，您是否还只考虑特定的观察值（最后 5 个或大于零）？
@nate.edwinton “您是否要先对“大于零/零”的内容/操作进行子集化，然后再进行最后 5 次观察？”- 是。 Action=1 可以有任意数量的观察。但必须从第一类操作的内容中删除零和负值。

标签： r dplyr plyr lapply

【解决方案1】：

我不太清楚你到底想完成什么。不过，这可能会有所帮助。

您可以使用which 和intersect 对您需要的部分数据进行子集化：

# df with action 0 and stuff > 0
v <- df$stuff[intersect(which(df$action == 0),
                        which(df$stuff > 0))]

# df with action 1 and stuff > 0
w <- df$stuff[intersect(which(df$action == 1),
                        which(df$stuff > 0))]

v 包含stuff 的所有元素，其中action 是0 而stuff 不是0。从现在开始，计算中位数是一种形式。（如果intersect(...) 为空，您可能需要添加安全措施，例如如果stuff 始终为0 而action 为0）。

# calulating the median of v for the last 5 observations
l <- length(v)
m0 <- median(v[(l-4):l]) # taking the median of the last 5 observations
# computing the final difference
m <- median(w) - m0

编辑

要重现上述输出，请考虑

output <- data.frame(df[which.max(as.Date(df$Dt)),
                        c("Dt","ItemRelation","num","year")], 
                     value = m,
                     row.names = 1:length(which.max(as.Date(df$Dt))))

其中which.max(as.Date(df$Dt)) 给出了最新日期的行号。但是，您为获得该结果而应用的逻辑可能会有所不同，因此在此建议谨慎。

不管怎样，这里是输出

> output
                       Dt ItemRelation  num year value
1 2018-03-30 00:00:00.000       158043 1459 2018  -300

【讨论】：

nate.edwinton，我编辑了我的帖子。你能提供这个输出吗，当我编辑时，我的类别需要课程 ItemRelation DocumentNum DocumentYear。结果必须为 ItemRelation DocumentNum DocumentYear 的每个类别
换句话说，您的代码应该在层 Itemrelation + DocumentNum+DocumentYear 中工作
我放错dt了，不需要了。
@D.Joe 请注意，which.max 仅返回满足最大值的第一个实例不是所有实例（参见 which.max(c(1:2,2)) 与 which(c(1:2,2) == max(c(1:2,2))) ）。所以 - 根据您的需要 - 将 which.max(as.Date(df$Dt)) 替换为 which(as.Date(df$Dt) == max(as.Date(df$Dt))) 可能更安全。
我做到了。并得到错误 > 输出 value = m，错误：“value = m”中的意外',' > row.names = 1:length(which(as.日期(mydat$Dt) == max(as.Date(mydat$Dt))))))))