【问题标题】:using R to track and remove duplicated sets使用 R 跟踪和删除重复的集合
【发布时间】:2020-06-06 22:57:27
【问题描述】:

我有原始数据集(计量数据)。计量数据是重复的,因为当数据库中存在组名称、读取客户端 ID 或读取短名称时,它会被复制。不幸的是,每个仪表 ID 都是不同的——在某些情况下,不会有重复的数据,甚至是相同数据的两倍甚至三倍。作为最后一列的帮助,每个数据都有其时间戳。

问题: 我只想扫描仪表 ID 并在为组名称或读取客户端 ID 或读取短名称复制相同数据时丢弃副本只留下一组数据。下面的例子。当新的副本开始时,我已经注释了行。

我尝试过的:重复功能或以下功能:

    df %>% 
  distinct(Meter.ID, .keep_all = TRUE) %>%
  {. ->> df2 }

我目前的方法“过于”选择性且不通用。我很难用通用解决方案解决问题。 最好使用每次复制数据时重新开始的时间戳。

数据样本 {

"Meter ID","Group name","Reading Client ID","Reading Short Name",Reading,"Reading timestamp",Reading2
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580597999," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580594400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580590800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580587200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580583600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580580000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580576400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580572800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580569200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580565600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580562000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580558400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580554800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580551200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580547600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580544000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580540400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580536800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580533200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580529600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580526000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580522400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580518800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580515200," - "
204,0100G,199,06865,90.436,1580597999," - "
204,0100G,199,06865,90.436,1580594400," - "
204,0100G,199,06865,90.436,1580590800," - "
204,0100G,199,06865,90.436,1580587200," - "
204,0100G,199,06865,90.436,1580583600," - "
204,0100G,199,06865,90.436,1580580000," - "
204,0100G,199,06865,90.436,1580576400," - "
204,0100G,199,06865,90.436,1580572800," - "
204,0100G,199,06865,90.436,1580569200," - "
204,0100G,199,06865,90.436,1580565600," - "
204,0100G,199,06865,90.436,1580562000," - "
204,0100G,199,06865,90.436,1580558400," - "
204,0100G,199,06865,90.436,1580554800," - "
204,0100G,199,06865,90.436,1580551200," - "
204,0100G,199,06865,90.436,1580547600," - "
204,0100G,199,06865,90.436,1580544000," - "
204,0100G,199,06865,90.436,1580540400," - "
204,0100G,199,06865,90.436,1580536800," - "
204,0100G,199,06865,90.436,1580533200," - "
204,0100G,199,06865,90.436,1580529600," - "
204,0100G,199,06865,90.436,1580526000," - "
204,0100G,199,06865,90.436,1580522400," - "
204,0100G,199,06865,90.436,1580518800," - "
204,0100G,199,06865,90.436,1580515200," - "
204,"0100G test2",199,06865,90.436,1580597999," - "
204,"0100G test2",199,06865,90.436,1580594400," - "
204,"0100G test2",199,06865,90.436,1580590800," - "
204,"0100G test2",199,06865,90.436,1580587200," - "
204,"0100G test2",199,06865,90.436,1580583600," - "
204,"0100G test2",199,06865,90.436,1580580000," - "
204,"0100G test2",199,06865,90.436,1580576400," - "
204,"0100G test2",199,06865,90.436,1580572800," - "
204,"0100G test2",199,06865,90.436,1580569200," - "
204,"0100G test2",199,06865,90.436,1580565600," - "
204,"0100G test2",199,06865,90.436,1580562000," - "
204,"0100G test2",199,06865,90.436,1580558400," - "
204,"0100G test2",199,06865,90.436,1580554800," - "
204,"0100G test2",199,06865,90.436,1580551200," - "
204,"0100G test2",199,06865,90.436,1580547600," - "
204,"0100G test2",199,06865,90.436,1580544000," - "
204,"0100G test2",199,06865,90.436,1580540400," - "
204,"0100G test2",199,06865,90.436,1580536800," - "
204,"0100G test2",199,06865,90.436,1580533200," - "
204,"0100G test2",199,06865,90.436,1580529600," - "
204,"0100G test2",199,06865,90.436,1580526000," - "
204,"0100G test2",199,06865,90.436,1580522400," - "
204,"0100G test2",199,06865,90.436,1580518800," - "
204,"0100G test2",199,06865,90.436,1580515200," - "

处理后想要的效果:

204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580597999," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580594400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580590800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580587200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580583600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580580000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580576400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580572800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580569200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580565600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580562000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580558400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580554800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580551200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580547600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580544000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580540400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580536800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580533200," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580529600," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580526000," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580522400," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580518800," - "
204,"Strefa XX Pomorzany Kępa",199,06865,90.436,1580515200," - "

【问题讨论】:

  • 我可以想到一个解决方案,但是代码应该如何决定应该选择三个副本中的哪一个呢?这对于 Group Name 列很重要。

标签: r dataframe


【解决方案1】:

这是一个基于dplyr 的解决方案,假设 OP 希望保留重复行的第一个实例。 我假设数据存储在 csv 文件名Rtmp.csv 中。

## Read the data
readr::read_csv("Rtmp.csv") %>% 
## Clean column names to remove spaces
janitor::clean_names() %>% 
## Remove duplicates
distinct(meter_id, reading_client_id, reading_timestamp, .keep_all = TRUE)

【讨论】:

    【解决方案2】:

    您可以使用 duplicated 函数过滤掉行。此数据集不是必需的,但您可能希望按组名对数据进行排序以删除重复项。不重复将保留重复列的第一个实例。

    df2 <- df2[order(df[['Group name']], decreasing=TRUE),]
    df <- df2[!duplicated(df[["Reading timestamp"]]),]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-04-01
      • 2016-02-27
      • 2015-02-20
      • 1970-01-01
      • 2018-07-21
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多