【问题标题】:R - Subset of data to get corresponding columns but no duplicatesR - 获取相应列但没有重复的数据子集
【发布时间】:2021-01-14 19:49:48
【问题描述】:

我有一个数据集,其中行是个人犯罪。其中一列是 LSOA 代码,另一列是地点名称。当然,有多个具有相同名称和 LSOA 的行。我想最终得到一个数据框,其中每个区域名称都带有相应的 LSOA 代码。我一直在兜圈子,试图找到一种方法来处理子集、计数、频率等,但要么丢失其中一列,要么就是不起作用。

这是数据集的一个示例。

 Code   |    Name      | Crime_Type |   Outcome
----------------------------------------------------
E01000852 Camden 026C     Vehicle     Under investigation
E01000982 Croydon 017C    Other       Unable to prosecute
E01000982 Croydon 017C    Other       Under investigation
E01003950 Southwark 032B  Assault     Status update unavailable
E01003950 Southwark 032B  Violence    Under investigation   
E01003950 Southwark 032B  Other       Under investigation

这就是我想要的输出

Code  |    Name      
-----------------
E01000852 Camden 026C
E01000982 Croydon 017C
E01003950 Southwark 032B

我尝试了以下方法,但我丢失了名称列。

name <- as.data.frame(table(data$Code)) 

任何帮助表示赞赏。

dput(head(data, 10)

structure(list(code = c("E01000013", "E01000852", "E01000982", 
"E01000982", "E01000996", "E01001227", "E01001591", "E01001751", 
"E01002848", "E01003171"), name = c("Barking and Dagenham 013A", 
"Camden 026C", "Croydon 017C", "Croydon 017C", "Croydon 009C", 
"Ealing 019D", "Greenwich 012C", "Hackney 021D", "Kensington and Chelsea 015C", 
"Lambeth 020B"), crime_type = c("Public order", "Vehicle crime", 
"Other crime", "Violence and sexual offences", "Violence and sexual offences", 
"Violence and sexual offences", "Violence and sexual offences", 
"Violence and sexual offences", "Other crime", "Violence and sexual offences"
), outcome_category = c("Unable to prosecute suspect", "Further investigation is not in the public interest", 
"Under investigation", "Under investigation", "Under investigation", 
"Status update unavailable", "Status update unavailable", "Under investigation", 
"Under investigation", "Unable to prosecute suspect"), outcome_recode = c("0", 
"1", NA, NA, NA, NA, NA, NA, NA, "0"), density = c(8927, 16348, 
11760, 11760, 11302, 8537, 10382, 11269, 17929, 16309), population = c(1855, 
2037, 1610, 1610, 1189, 1476, 2095, 1732, 1472, 1701), IMD_value = c(2, 
6, 5, 5, 5, 8, 3, 5, 3, 4), urban_rural_class = c("Urban major conurbation", 
"Urban major conurbation", "Urban major conurbation", "Urban major conurbation", 
"Urban major conurbation", "Urban major conurbation", "Urban major conurbation", 
"Urban major conurbation", "Urban major conurbation", "Urban major conurbation"
)), row.names = c(NA, 10L), class = "data.frame")

【问题讨论】:

  • 您只想要唯一的名称?其余信息会怎样?
  • 我不需要另外两列,只需要前两列。
  • 您能否通过dput(head(df,n)) 提供您的数据?你试过unique(df$code)吗?
  • 似乎与示例不同,但尝试使用 df %&gt;% filter(!duplicated(name)) %&gt;% select(1:2) 使用 dplyr

标签: r


【解决方案1】:
library(dplyr)

df %>% group_by(code, name) %>% slice_head() %>% select(1:2)

# A tibble: 9 x 2
# Groups:   code, name [9]
  code      name                       
  <chr>     <chr>                      
1 E01000013 Barking and Dagenham 013A  
2 E01000852 Camden 026C                
3 E01000982 Croydon 017C               
4 E01000996 Croydon 009C               
5 E01001227 Ealing 019D                
6 E01001591 Greenwich 012C             
7 E01001751 Hackney 021D               
8 E01002848 Kensington and Chelsea 015C
9 E01003171 Lambeth 020B  

或者在基础R中

df[match(unique(df$code), df$code), 1:2]

        code                        name
1  E01000013   Barking and Dagenham 013A
2  E01000852                 Camden 026C
3  E01000982                Croydon 017C
5  E01000996                Croydon 009C
6  E01001227                 Ealing 019D
7  E01001591              Greenwich 012C
8  E01001751                Hackney 021D
9  E01002848 Kensington and Chelsea 015C
10 E01003171                Lambeth 020B

这里的行号将根据原始数据框

【讨论】:

    猜你喜欢
    • 2015-08-17
    • 1970-01-01
    • 2014-10-18
    • 2015-10-04
    • 1970-01-01
    • 1970-01-01
    • 2011-04-11
    • 2018-05-08
    • 1970-01-01
    相关资源
    最近更新 更多