【问题标题】:How can I use the filter function to increase sample homogeniety?如何使用过滤器功能来提高样品的同质性?
【发布时间】:2021-08-02 07:54:57
【问题描述】:

我正在寻找一些数据争论的建议。我的数据集包含两组投资者(签字人=0,签字人=1)。都有对应的国家,但是两组的国家不匹配。

对于我的下一个分析,我需要将我的数据集减少到仅存在于两组中的国家/地区,因此每个组将在所列国家/地区中至少有一个单位(投资者)。

需要明确的是,如果一组在 45 个国家/地区有投资者,而另一组在 50 个国家/地区有投资者,但这些国家中只有 30 个匹配,我只想在新数据框中保留这 30 个匹配的国家。

我的数据如下所示:

investor year activity country region strategy signatory
123 IM 2002 4.45 France europe VC 1
123 IM 2003 3.2 France europe VC 1
123 IM 2004 7.8 France europe VC 1
21Invest 2002 4.45 France europe VC 0
21Invest 2003 3.2 France europe VC 0
21Invest 2004 7.8 France europe VC 0
Aegon 2005 5.4 Netherlands europe BY 1
Aegon 2006 4.2 Netherlands europe BY 1
Aegon 2007 1.3 Netherlands europe BY 1
ING 2005 5.4 Netherlands europe BY 0
ING 2006 4.2 Netherlands europe BY 0
ING 2007 1.3 Netherlands europe BY 0
aberdeen 2002 4.45 UK europe VC 1
aberdeen 2003 3.2 UK europe VC 1
aberdeen 2004 7.8 UK europe VC 1
JPM 2005 5.4 USA europe BY 0
JPM 2006 4.2 USA europe BY 0
JPM 2007 1.3 USA europe BY 0

我正在寻找的输出是:

investor year activity country region strategy signatory
123 IM 2002 4.45 France europe VC 1
123 IM 2003 3.2 France europe VC 1
123 IM 2004 7.8 France europe VC 1
21Invest 2002 4.45 France europe VC 0
21Invest 2003 3.2 France europe VC 0
21Invest 2004 7.8 France europe VC 0
Aegon 2005 5.4 Netherlands europe BY 1
Aegon 2006 4.2 Netherlands europe BY 1
Aegon 2007 1.3 Netherlands europe BY 1
ING 2005 5.4 Netherlands europe BY 0
ING 2006 4.2 Netherlands europe BY 0
ING 2007 1.3 Netherlands europe BY 0

注意:英国和美国的公司被删除,而法国和荷兰的公司被保留。

这是因为两个投资者样本(签署人 = 0 和签署人 = 1)在法国/荷兰国家/地区都有单位,而英国和美国仅在其中一个样本中出现。

df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen", "aberdeen", "aberdeen", "JPM", "JPM", "JPM"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "UK", "UK", "UK", "USA", "USA", "USA"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "north america", "north america", "north america"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))
df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))

任何提示将不胜感激!

罗里

【问题讨论】:

  • 您的示例数据框会建议您最终没有行,因为所有签署方都在法国,所有非签署方都在荷兰。根本没有重叠。你能提供更好的数据集吗?
  • 您好,刚刚按照您的评论进行了修改!
  • 是的,看到了。也修改了我的答案。
  • 如果我在下面的回答回答了您的问题,请点击旁边的 v 标志接受答案。您可以对您之前提出的已回答的其他问题执行相同的操作。如果您发现其中任何一个有帮助,您也可以点赞。
  • 嗨 coffieinjunky,感谢您迄今为止的帮助,它改进了数据,但似乎效果并不理想 - 我仍然有 3 个国家/地区出现在数据集中,只有 1 个投资者。知道这是怎么发生的吗?

标签: r filter data-wrangling


【解决方案1】:

你可以这样做:

library(tidyverse)

signatory_countries <- unique(df[df$signatory==1, 'country'])
non_signatory_countries <- unique(df[df$signatory==0, 'country'])

new_df <- bind_rows(
  df %>% filter(signatory==1, country %in% non_signatory_countries),
  df %>% filter(signatory==0, country %in% signatory_countries)
)
new_df
   investor year activity     country region strategy signatory
1    123 IM 2002     4.45      France europe       VC         1
2    123 IM 2003     3.20      France europe       VC         1
3    123 IM 2004     7.80      France europe       VC         1
4     Aegon 2002     4.45 Netherlands europe       VC         1
5     Aegon 2003     3.20 Netherlands europe       VC         1
6     Aegon 2004     7.80 Netherlands europe       VC         1
7  21Invest 2005     5.40      France europe       BY         0
8  21Invest 2006     4.20      France europe       BY         0
9  21Invest 2007     1.30      France europe       BY         0
10      ING 2005     5.40 Netherlands europe       BY         0
11      ING 2006     4.20 Netherlands europe       BY         0
12      ING 2007     1.30 Netherlands europe       BY         0

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-11-08
    • 1970-01-01
    • 2017-01-31
    • 1970-01-01
    • 2018-06-17
    • 1970-01-01
    相关资源
    最近更新 更多