如何在没有 for 循环的情况下重新格式化数据集以具有这种特定结构？答案

【问题标题】：How do I reformat a data set to have this particular structure without for loops?如何在没有 for 循环的情况下重新格式化数据集以具有这种特定结构？
【发布时间】：2019-11-28 17:16:28
【问题描述】：

我正在尝试将一些原始数据重新组织成更简洁的形式。目前，数据看起来像下面的 R 代码输出。我希望最终输出包含时间、ID 和所有可能的所需价格的列。然后，我希望每个 ID 每次只有一行，并以不同的所需价格输入数量（因此在此期间以特定价格需要多少个 ID）。例如，一个特定的 ID 可能在 100 时的数量为 1，在 101 时的数量为 2。如果是买入，则该值应为负数，如果是卖出，则为正数。例如，-1 表示以 100 买入，2 表示以 101 卖出。

我最初尝试通过双 for 循环来完成，第一个循环是时间，然后第二个循环是 ID。然后我可以查看数量列和 ID 的期望价格，并将它们放入向量中。之后，我将所有的向量组合在一起，然后重复这个。当我在实践中尝试使用它时，这是不可行的，因为代码太慢，因为有数百个 ID 和数千次。有人可以帮助我以更快更清洁的方式做到这一点吗？

set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")

我希望最终的数据集看起来像这样。

Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
                       ,"104"=NA,"105"=NA)

它基本上会压缩原始原始数据，以便在每个时间段内将特定 ID 的所有信息放在一行中。

编辑：如果一个 ID 在那个时间没有被采样（例如 ID 342 不在时间 1 中）他们应该在那个时间段有一行 NA（所以 ID 342 在时间 1 有一行 NA ）。我编辑了生成样本的代码以具有更多的 id 来反映这一点（因此它们不可能在每个时间段都被采样）。

【问题讨论】：

分享您的循环代码或与示例输入相对应的最终输出可能会有所帮助。如果您有兴趣，请提供一些指导：stackoverflow.com/questions/5963269/…

标签： r data-manipulation data-cleaning

【解决方案1】：

这是一种 tidyverse 方法。首先，根据 BUY/SELL 对数量进行签名，然后对每个 id / time / Desire.Price 求和，然后将它们分散成宽格式，每个 Desire.Price 有一列。

library(dplyr); library(tidyr)
data %>%
  mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
  count(id, time, Desire.Price, wt = quantity_signed) %>%
  complete(id, time) %>%  # EDIT to bring in all times for all id's
  spread(Desire.Price, n) %>% View("output")

【讨论】：

这几乎正是我想要的。执行速度也很完美（我需要学习使用您为此使用的那些包）。无论如何要稍微调整一下，这样如果一个 ID 不在一个时间段内，他们只会在那个时间段内获得一行 NA？我也编辑了原始问题以反映此更新。
已编辑以使每个 id 显示在每个时间段。
太完美了！谢谢！

【解决方案2】：

我觉得这个方法比较简单。

# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')

【讨论】：