【问题标题】:Filtering data using dplyr function in R在 R 中使用 dplyr 函数过滤数据
【发布时间】:2019-06-10 17:12:41
【问题描述】:

我有一个公司董事数据集。例如,对于 2005 年的 X 公司,他们有 3 名董事。因此,对于公司 x 在 2005 年有三个观察结果。每个董事都有一个唯一的 ID。现在我只想过滤那些今年董事和往年董事相同的观察结果(它们整体相同;如果今年的成员包括 1 个新成员和往年的 2 个老成员;我不想要这些观察结果)。每个董事都有一个唯一的 ID。此外,每家公司都有一个唯一的 ID,例如 ISIN。

只有一家公司的数据集看起来像这样 -

          ISIN year                    DirectorName   DirectorID
1  US9898171015 2006            Thomas (Tom) E Davin   2247441792
2  US9898171015 2006           Matthew (Matt) L Hyde   4842568996
3  US9898171015 2007             James (Jim) M Weber   3581636766
4  US9898171015 2007           Matthew (Matt) L Hyde   4842568996
5  US9898171015 2007         David (Dave) M DeMattei    759047198
6  US9898171015 2008             James (Jim) M Weber   3581636766
7  US9898171015 2008           Matthew (Matt) L Hyde   4842568996
8  US9898171015 2008         David (Dave) M DeMattei    759047198
9  US9898171015 2009 William (Bill) Milroy Barnum Jr  20462211719
10 US9898171015 2009             James (Jim) M Weber   3581636766
11 US9898171015 2009           Matthew (Matt) L Hyde   4842568996
12 US9898171015 2009         David (Dave) M DeMattei    759047198
13 US9898171015 2010 William (Bill) Milroy Barnum Jr  20462211719
14 US9898171015 2010             James (Jim) M Weber   3581636766
15 US9898171015 2010           Matthew (Matt) L Hyde   4842568996
16 US9898171015 2011      Sarah (Sally) Gaines McCoy  11434863691
17 US9898171015 2011 William (Bill) Milroy Barnum Jr  20462211719
18 US9898171015 2011             James (Jim) M Weber   3581636766
19 US9898171015 2011           Matthew (Matt) L Hyde   4842568996
20 US9898171015 2012      Sarah (Sally) Gaines McCoy  11434863691
21 US9898171015 2012                Ernest R Johnson  40425210975
22 US9898171015 2013      Sarah (Sally) Gaines McCoy  11434863691
23 US9898171015 2013                Ernest R Johnson  40425210975
24 US9898171015 2013                  Travis D Smith  53006212569
25 US9898171015 2014      Sarah (Sally) Gaines McCoy  11434863691
26 US9898171015 2014                Ernest R Johnson  40425210975
27 US9898171015 2014                  Travis D Smith  53006212569
28 US9898171015 2015                  Kalen F Holmes  11051172801
29 US9898171015 2015      Sarah (Sally) Gaines McCoy  11434863691
30 US9898171015 2015                Ernest R Johnson  40425210975
31 US9898171015 2015                  Travis D Smith  53006212569
32 US9898171015 2016      Sarah (Sally) Gaines McCoy  11434863691
33 US9898171015 2016                Ernest R Johnson  40425210975
34 US9898171015 2016                  Travis D Smith  53006212569
35 US9898171015 2017      Sarah (Sally) Gaines McCoy  11434863691
36 US9898171015 2017             Scott Andrew Bailey 174000000000
37 US9898171015 2017                Ernest R Johnson  40425210975
38 US9898171015 2017                  Travis D Smith  53006212569

我试过这些代码

endo <- ac %>% 
  group_by(ISIN) %>% 
  filter(DirectorID == lag (DirectorID, 1))

使用上面的代码后,我得到了以下结果。

          ISIN year                    DirectorName  DirectorID
1  US9898171015 2007           Matthew (Matt) L Hyde  4842568996
2  US9898171015 2008             James (Jim) M Weber  3581636766
3  US9898171015 2008           Matthew (Matt) L Hyde  4842568996
4  US9898171015 2008         David (Dave) M DeMattei   759047198
5  US9898171015 2009             James (Jim) M Weber  3581636766
6  US9898171015 2009           Matthew (Matt) L Hyde  4842568996
7  US9898171015 2009         David (Dave) M DeMattei   759047198
8  US9898171015 2010 William (Bill) Milroy Barnum Jr 20462211719
9  US9898171015 2010             James (Jim) M Weber  3581636766
10 US9898171015 2010           Matthew (Matt) L Hyde  4842568996
11 US9898171015 2011 William (Bill) Milroy Barnum Jr 20462211719
12 US9898171015 2011             James (Jim) M Weber  3581636766
13 US9898171015 2011           Matthew (Matt) L Hyde  4842568996
14 US9898171015 2012      Sarah (Sally) Gaines McCoy 11434863691
15 US9898171015 2013      Sarah (Sally) Gaines McCoy 11434863691
16 US9898171015 2013                Ernest R Johnson 40425210975
17 US9898171015 2014      Sarah (Sally) Gaines McCoy 11434863691
18 US9898171015 2014                Ernest R Johnson 40425210975
19 US9898171015 2014                  Travis D Smith 53006212569
20 US9898171015 2015      Sarah (Sally) Gaines McCoy 11434863691
21 US9898171015 2015                Ernest R Johnson 40425210975
22 US9898171015 2015                  Travis D Smith 53006212569
23 US9898171015 2016      Sarah (Sally) Gaines McCoy 11434863691
24 US9898171015 2016                Ernest R Johnson 40425210975
25 US9898171015 2016                  Travis D Smith 53006212569
26 US9898171015 2017      Sarah (Sally) Gaines McCoy 11434863691
27 US9898171015 2017                Ernest R Johnson 40425210975
28 US9898171015 2017                  Travis D Smith 53006212569

如果手动检查第一个数据(使用代码之前的数据),显然只有2007年和2008年; 2013 年和 2014 年,董事会组成相同。所以我只想要这些观察结果。

但是第二个数据(使用代码后的数据)没有产生预期的结果。

预期的结果在这里 -

          ISIN year               DirectorName  DirectorID
1  US9898171015 2007        James (Jim) M Weber  3581636766
2  US9898171015 2007      Matthew (Matt) L Hyde  4842568996
3  US9898171015 2007    David (Dave) M DeMattei   759047198
4  US9898171015 2008        James (Jim) M Weber  3581636766
5  US9898171015 2008      Matthew (Matt) L Hyde  4842568996
6  US9898171015 2008    David (Dave) M DeMattei   759047198
7  US9898171015 2013 Sarah (Sally) Gaines McCoy 11434863691
8  US9898171015 2013           Ernest R Johnson 40425210975
9  US9898171015 2013             Travis D Smith 53006212569
10 US9898171015 2014 Sarah (Sally) Gaines McCoy 11434863691
11 US9898171015 2014           Ernest R Johnson 40425210975
12 US9898171015 2014             Travis D Smith 53006212569

感谢您的帮助。

【问题讨论】:

标签: r dplyr


【解决方案1】:

这很冗长并且可能效率低下,但它使用嵌套数据框完成了工作。

library(dplyr)
library(purrr)
library(readr)
library(tidyr)

"ROW,ISIN,YEAR,DIRECTOR_NAME,DIRECTOR_ID
1,US9898171015,2006,Thomas (Tom) E Davin,2247441792
2,US9898171015,2006,Matthew (Matt) L Hyde,4842568996
3,US9898171015,2007,James (Jim) M Weber,3581636766
4,US9898171015,2007,Matthew (Matt) L Hyde,4842568996
5,US9898171015,2007,David (Dave) M DeMattei,759047198
6,US9898171015,2008,James (Jim) M Weber,3581636766
7,US9898171015,2008,Matthew (Matt) L Hyde,4842568996
8,US9898171015,2008,David (Dave) M DeMattei,759047198
9,US9898171015,2009,William (Bill) Milroy Barnum Jr,20462211719
10,US9898171015,2009,James (Jim) M Weber,3581636766
11,US9898171015,2009,Matthew (Matt) L Hyde,4842568996
12,US9898171015,2009,David (Dave) M DeMattei,759047198
13,US9898171015,2010,William (Bill) Milroy Barnum Jr,20462211719
14,US9898171015,2010,James (Jim) M Weber,3581636766
15,US9898171015,2010,Matthew (Matt) L Hyde,4842568996
16,US9898171015,2011,Sarah (Sally) Gaines McCoy,11434863691
17,US9898171015,2011,William (Bill) Milroy Barnum Jr,20462211719
18,US9898171015,2011,James (Jim) M Weber,3581636766
19,US9898171015,2011,Matthew (Matt) L Hyde,4842568996
20,US9898171015,2012,Sarah (Sally) Gaines McCoy,11434863691
21,US9898171015,2012,Ernest R Johnson,40425210975
22,US9898171015,2013,Sarah (Sally) Gaines McCoy,11434863691
23,US9898171015,2013,Ernest R Johnson,40425210975
24,US9898171015,2013,Travis D Smith,53006212569
25,US9898171015,2014,Sarah (Sally) Gaines McCoy,11434863691
26,US9898171015,2014,Ernest R Johnson,40425210975
27,US9898171015,2014,Travis D Smith,53006212569
28,US9898171015,2015,Kalen F Holmes,11051172801
29,US9898171015,2015,Sarah (Sally) Gaines McCoy,11434863691
30,US9898171015,2015,Ernest R Johnson,40425210975
31,US9898171015,2015,Travis D Smith,53006212569
32,US9898171015,2016,Sarah (Sally) Gaines McCoy,11434863691
33,US9898171015,2016,Ernest R Johnson,40425210975
34,US9898171015,2016,Travis D Smith,53006212569
35,US9898171015,2017,Sarah (Sally) Gaines McCoy,11434863691
36,US9898171015,2017,Scott Andrew Bailey,174000000000
37,US9898171015,2017,Ernest R Johnson,40425210975
38,US9898171015,2017,Travis D Smith,53006212569
" %>% 
  read_csv() %>% 
  group_by(ISIN, YEAR) %>% 
  nest(.key = "OTHER_DATA") %>% 
  group_by(ISIN) %>% 
  mutate(OTHER_DATA_LAG = lag(OTHER_DATA, 1), 
         OTHER_DATA_LEAD = lead(OTHER_DATA, 1), 
         KEEP = pmap(list(OTHER_DATA_LAG, OTHER_DATA, OTHER_DATA_LEAD), function(x, y, z) {
           isTRUE(all_equal(x["DIRECTOR_ID"], y["DIRECTOR_ID"])) || 
           isTRUE(all_equal(y["DIRECTOR_ID"], z["DIRECTOR_ID"]))
         })) %>% 
  filter(unlist(KEEP)) %>% 
  select(-OTHER_DATA_LAG, -OTHER_DATA_LEAD, -KEEP) %>% 
  unnest() %>% 
  ungroup()

【讨论】:

  • 嗨,马修:效果很好。非常感谢您的帮助。你知道我怎样才能得到委员会组成发生变化的意见。从数学上讲,它就像 =(原始数据集 - 使用您的命令创建的新数据集)。再次感谢您的帮助。
  • @Sharif 获得赞美否定 isTRUE() 并将 || 更改为 &amp;&amp;。所以这些行现在看起来像:!isTRUE(all_equal(x["DIRECTOR_ID"], y["DIRECTOR_ID"])) &amp;&amp; !isTRUE(all_equal(y["DIRECTOR_ID"], z["DIRECTOR_ID"])).
  • 嗨@Matthew,非常感谢您的帮助。这对我很有帮助。
  • 我几乎到现在都在使用你的代码(接受的答案),但最近它不起作用。它显示此错误Error: Problem with mutate()` 输入KEEP。 x 参数长度为零 i 输入 KEEPpmap(...)。 i 组 1 中出现错误:ISIN = "AN8068571086"。运行rlang::last_error() 以查看错误发生的位置。另外:警告消息:.key 已弃用`你知道我怎样才能运行你的代码吗?在我最近的帖子中,我问了这个问题并得到了一个答案,它稍微改变了你的代码,但是当我运行这些代码时,我对你的代码没有得到相同的观察结果。谢谢
【解决方案2】:

看起来您正在尝试做的是确定何时发生重复。你可能想要

a <- c(1,2,2,3)
a == lag(a)

对 3 产生 TRUE,在其他地方产生 FALSE。但事实并非如此,这是怎么回事?

lag 的问题在这篇博文https://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/ 中有更多讨论

博文有一个更复杂的版本,但根据您的需要,以下内容可能就足够了:

mylag <- function(v) { c(NA, head(v, -1)) }
a == mylag(a)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-02-27
    • 2019-08-08
    • 2021-11-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多