如何根据r中数据框中的另一个变量将二进制变量添加到数据框中？答案

【问题标题】：How to add a binary variable to a data frame based on another variable in the data frame in r?如何根据r中数据框中的另一个变量将二进制变量添加到数据框中？
【发布时间】：2014-02-23 02:42:24
【问题描述】：

我的数据（train）是一个 443402 x 27 的数据框，我已经将一个新的二进制变量 train$researchedplan 初始化为“1”。有 64,673 个唯一的 train$customer_ID（每个客户在数据框中输入随机次数 - 但按顺序排列。即第一个客户有前 9 行，第二个客户有接下来的 6 行，等等）。

> train[1:20,c(1,27)]
> customer_ID researchedplan
1     10000000              1
2     10000000              1
3     10000000              1
4     10000000              1
5     10000000              1
6     10000000              1
7     10000000              1
8     10000000              1
9     10000000              1
10    10000005              1
11    10000005              1
12    10000005              1
13    10000005              1
14    10000005              1
15    10000005              1
24    10000013              1
25    10000013              1
26    10000013              1
27    10000013              1
28    10000014              1

我还有一个向量 (diff_than_researched)，它由一串独特的 train$customer_ID 组成，用于识别哪些客户没有研究特定计划。对于 diff_than_researched 中与 train$customer_ID 中的字符串匹配的字符串，我希望客户的所有条目的 train$researchedplan 为“0”。例如：

> head(diff_than_researched)
>[1] "10000019" "10000033" "10000036" "10000037" "10000055" "10000075"

因此，对于所有“10000019”条目，我希望 train$researchedplan 等于“0”。

现在，我可以使用“for 循环”来完成所有这些操作，但是遍历这么多条目需要很长时间：

for(i in 1:17210) { train$researchedplan[train$customer_ID == diff_than_researched[i]]

【问题讨论】：

train$researchedplan <- as.numeric(!train$customer_ID %in% diff_than_researched)
@JakeBurkhead 为什么在评估%in% 之后应用! 而不是!train$customer？
@rawr operator precedence。 special operators (including %% and %/%) 先于! negation
从来没有注意到这一点。好资料
@jake 看起来像一个答案 - 请复制/粘贴到答案。

标签： r performance dataframe

【解决方案1】：

使用略有不同的数据以提高可读性并在研究计划中获得一些 0。

train

##    customer_ID
## 1     10000000
## 10    10000005
## 24    10000013
## 28    10000014
## 5     10000019    

train$researchedplan <- as.numeric(!train$customer_ID %in% diff_than_researched)

##    customer_ID researchedplan
## 1     10000000              1
## 10    10000005              1
## 24    10000013              1
## 28    10000014              1
## 5     10000019              0

正如 @rawr 在 cmets 中所指出的，这首先检查每个 customer_ID 是否在 diff_than_researched 中，然后由于 operator precedence 而否定该逻辑向量。

【讨论】：