匹配列和行然后替换答案

【问题标题】：Match column and rows then replace匹配列和行然后替换
【发布时间】：2018-02-26 00:10:00
【问题描述】：

我必须分析来自经济实验的数据。我的数据库由 14 976 个观测值和 212 个变量组成。在这个数据库中，我们还有其他信息，例如利润、总利润、治疗和其他变量。你可以看到我有两种类型：

类型 1 适用于卖家
类型 2 适用于买家

对于某些变量，结果放在买家（类型 2）行中，而不是卖家行中（这是一个完全任意的选择）。但是，我想分析多收（例如）的卖家的性别。所以我需要操作我的数据库，但我不知道该怎么做。

这里，你有部分数据库：

ID       Gender   Period   Matching group   Group    Type  Overcharging ...
654        1           1            73         1        1      NA
654        1           2            73         1        1      NA
654        1           3            73         1        1      NA
654        1           4            73         1        1      NA 
435        1           1            73         2        1      NA
435        1           2            73         2        1      NA
435        1           3            73         2        1      NA
435        1           4            73         2        1      NA 
708        0           1            73         1        2       1
708        0           2            73         1        2       0
708        0           3            73         1        2       0
708        0           4            73         1        2       1   
546        1           1            73         2        2       0
546        1           2            73         2        2       0
546        1           3            73         2        2       1
546        1           4            73         2        2       0

要做我想做的事，我有很多信息（在第 x 期、第 x 组、匹配第 x 组中，只有一位卖家与一位买家匹配，并且处理 x...）。举个例子，在匹配组 73 中，我们知道在第 1 阶段，受试者 708 被多收费用（组 1 中的那个）。据我所知，这个人属于第 1 组和第 73 组，我能够识别出在第 1 期向他多收费用的卖家：主题 654，性别 =1。

因此，我想在卖家行（类型 ==1）上放置过度收费（和其他一些）买家价值，以分析卖家行为，但在正确的时期，对于正确的组和正确的匹配组。

【问题讨论】：

标签： r database dataframe replace dplyr

【解决方案1】：

我在使用 data.frames 方面还有很长的路要走。如果您希望长期使用 R 编写代码，我建议您查看 (i) dplyr 包、tidyverse 套件的一部分或 (ii) data.table 包。第一个具有最流行的语法，并与一堆有用的包很好地结合在一起。第二个更难学习但更快。不过，对于您的尺寸数据，这可以忽略不计。

在基本 data.frames 中，我希望以下内容符合您的要求。如果我有任何错误或不清楚的地方，请告诉我。

# sellers data eg
dt1 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 1, 
                 Overcharging = NA)
# buyers data eg
dt2 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 2, 
                 Overcharging = c(1,0,0,1))
# make my current data view
dt <- rbind(dt1, dt2)
dt[]

# split in to two data frames, on the Type column:
dt_split <- split(dt, dt$Type)
dt_split

# move out of list
dt_suffix <- seq_along(dt_split)
dt_names <- sprintf("dt%s", dt_suffix)
for(name in dt_names){
  assign(name, dt_split[match(name, dt_names)][[1]])
}
dt1[]
dt2[]

# define the columns in which to match up the buyer to seller
merge_cols <- c("Period", "MatchGroup", "Group")
# define the columns you want to merge, that you know are NA
na_cols <- c("Overcharging")
# now use merge operation, and filter dt2, to pull in only columns you want
# I suggest dropping the na_cols first in dt1, as otherwise it will create two 
# columns post-merge: Overcharging, i.Overcharging
dt1 <- dt1[,setdiff(names(dt1), na_cols)]
dt1_new <- merge(dt1, 
                 dt2[, c(merge_cols, na_cols)], # filter dt2 
                 by = merge_cols, # columns to match on
                 all.x = TRUE) # dt1 is x, dt2 is y. Want to keep all of dt1

# if you want to bind them back together, ensure the column order matches, and
# bind e.g.
dt1_new <- dt1_new[, names(dt2)]
dt_final <- rbind(dt1_new, dt2)
dt_final[]

我的思路是将这些买家和卖家数据框分成两个独立的数据框。然后确定他们如何加入，并将您需要的数据从买家转移到卖家。如果需要，最后将它们重新组合在一起。

【讨论】：

谢谢。我想你很清楚我的问题的重点。但是，我尝试了您的解决方案，并且我有一个 dt1_new 数据库，其中包含 16704 个观察值（这比我的原始数据库要多）。这怎么可能？另一个问题;你谈到了 deplore 包，你有这个解决方案吗？
通常当数据框的大小增加时，这将是因为 merge 操作。如果 dt2 中有重复，对于 by 子句中的条件，它将创建以前不存在的行。尝试再次运行上面的代码，这次使用以下代码创建的 dt2：# 买家数据，例如 dt2
如果您确实有重复，您可以使用 duplicated 功能找到它们。 Stackoverflow 将有很多重复的数据，例如stackoverflow.com/questions/13742446/…我使用后一个包（data.table），所以我不是寻求dplyr帮助的最佳人选。我认为我发送的代码也可以与 data.table 一起使用。我建议先看一下两者的教程，然后选择你觉得最舒服的那个。
感谢您的宝贵时间！