【发布时间】:2021-08-02 07:54:57
【问题描述】:
我正在寻找一些数据争论的建议。我的数据集包含两组投资者(签字人=0,签字人=1)。都有对应的国家,但是两组的国家不匹配。
对于我的下一个分析,我需要将我的数据集减少到仅存在于两组中的国家/地区,因此每个组将在所列国家/地区中至少有一个单位(投资者)。
需要明确的是,如果一组在 45 个国家/地区有投资者,而另一组在 50 个国家/地区有投资者,但这些国家中只有 30 个匹配,我只想在新数据框中保留这 30 个匹配的国家。
我的数据如下所示:
| investor | year | activity | country | region | strategy | signatory |
|---|---|---|---|---|---|---|
| 123 IM | 2002 | 4.45 | France | europe | VC | 1 |
| 123 IM | 2003 | 3.2 | France | europe | VC | 1 |
| 123 IM | 2004 | 7.8 | France | europe | VC | 1 |
| 21Invest | 2002 | 4.45 | France | europe | VC | 0 |
| 21Invest | 2003 | 3.2 | France | europe | VC | 0 |
| 21Invest | 2004 | 7.8 | France | europe | VC | 0 |
| Aegon | 2005 | 5.4 | Netherlands | europe | BY | 1 |
| Aegon | 2006 | 4.2 | Netherlands | europe | BY | 1 |
| Aegon | 2007 | 1.3 | Netherlands | europe | BY | 1 |
| ING | 2005 | 5.4 | Netherlands | europe | BY | 0 |
| ING | 2006 | 4.2 | Netherlands | europe | BY | 0 |
| ING | 2007 | 1.3 | Netherlands | europe | BY | 0 |
| aberdeen | 2002 | 4.45 | UK | europe | VC | 1 |
| aberdeen | 2003 | 3.2 | UK | europe | VC | 1 |
| aberdeen | 2004 | 7.8 | UK | europe | VC | 1 |
| JPM | 2005 | 5.4 | USA | europe | BY | 0 |
| JPM | 2006 | 4.2 | USA | europe | BY | 0 |
| JPM | 2007 | 1.3 | USA | europe | BY | 0 |
我正在寻找的输出是:
| investor | year | activity | country | region | strategy | signatory |
|---|---|---|---|---|---|---|
| 123 IM | 2002 | 4.45 | France | europe | VC | 1 |
| 123 IM | 2003 | 3.2 | France | europe | VC | 1 |
| 123 IM | 2004 | 7.8 | France | europe | VC | 1 |
| 21Invest | 2002 | 4.45 | France | europe | VC | 0 |
| 21Invest | 2003 | 3.2 | France | europe | VC | 0 |
| 21Invest | 2004 | 7.8 | France | europe | VC | 0 |
| Aegon | 2005 | 5.4 | Netherlands | europe | BY | 1 |
| Aegon | 2006 | 4.2 | Netherlands | europe | BY | 1 |
| Aegon | 2007 | 1.3 | Netherlands | europe | BY | 1 |
| ING | 2005 | 5.4 | Netherlands | europe | BY | 0 |
| ING | 2006 | 4.2 | Netherlands | europe | BY | 0 |
| ING | 2007 | 1.3 | Netherlands | europe | BY | 0 |
注意:英国和美国的公司被删除,而法国和荷兰的公司被保留。
这是因为两个投资者样本(签署人 = 0 和签署人 = 1)在法国/荷兰国家/地区都有单位,而英国和美国仅在其中一个样本中出现。
df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen", "aberdeen", "aberdeen", "JPM", "JPM", "JPM"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "UK", "UK", "UK", "USA", "USA", "USA"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "north america", "north america", "north america"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))
df <- data.frame(
investor=c("123 IM", "123 IM", "123 IM", "21Invest", "21Invest", "21Invest", "Aegon", "Aegon", "Aegon", "ING", "ING", "ING", "aberdeen"), year=c(2002, 2003, 2004, 2005, 2006, 2007, 2002, 2003, 2004, 2005, 2006, 2007),
activity=c(4.45, 3.2, 7.8, 5.4, 4.2, 1.3, 4.45, 3.2, 7.8, 5.4, 4.2, 1.3),
country=c("France", "France", "France", "France", "France", "France", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands"),
region=c("europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe", "europe"),
strategy =c("VC", "VC", "VC", "BY", "BY", "BY", "VC", "VC", "VC", "BY", "BY", "BY"),
signatory =c(1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0))
任何提示将不胜感激!
罗里
【问题讨论】:
-
您的示例数据框会建议您最终没有行,因为所有签署方都在法国,所有非签署方都在荷兰。根本没有重叠。你能提供更好的数据集吗?
-
您好,刚刚按照您的评论进行了修改!
-
是的,看到了。也修改了我的答案。
-
如果我在下面的回答回答了您的问题,请点击旁边的 v 标志接受答案。您可以对您之前提出的已回答的其他问题执行相同的操作。如果您发现其中任何一个有帮助,您也可以点赞。
-
嗨 coffieinjunky,感谢您迄今为止的帮助,它改进了数据,但似乎效果并不理想 - 我仍然有 3 个国家/地区出现在数据集中,只有 1 个投资者。知道这是怎么发生的吗?
标签: r filter data-wrangling