【问题标题】:Unique values by date按日期划分的唯一值
【发布时间】:2017-03-17 17:05:00
【问题描述】:

我想获得一个包含两列的数据框: 1. 不同的水果(无重复) 2. 特定水果(即猕猴桃)出现的第一个日期

fruits <- c("apples, oranges, pears, bananas",
"pineapples, mangos, guavas",
"bananas, apples, kiwis") 


fruits<-as.data.frame(fruits)
fruits$date<-c( "12.8.16", "22.4.17", "12.9.16")

fruits[with(fruits, order(date)), ]

我尝试编写循环或使用 match 命令。但是,无法识别唯一的字符串值。

提前感谢您! 詹尼斯

【问题讨论】:

    标签: r match unique


    【解决方案1】:

    这里有一些解决方案:

    1) strsplit/unnest/summarize 这使用 dplyr 和 tidyr。首先将date 列转换为"Date" 类,然后拆分fruits 列,生成一个列,其中每个单元格都包含一个水果向量。 unnest 并找到最小值:

    library(dplyr)
    library(tidyr)
    
    fruits %>%
           mutate(date = as.Date(date, "%d.%m.%y"),
                  fruits = strsplit(as.character(fruits), ", ")) %>%
           unnest %>%
           group_by(fruits) %>%
           summarize(date = min(date)) %>%
           ungroup
    

    给予:

    # A tibble: 8 × 2
          fruits       date
           <chr>     <date>
    1     apples 2016-08-12
    2    bananas 2016-08-12
    3     guavas 2017-04-22
    4      kiwis 2016-09-12
    5     mangos 2017-04-22
    6    oranges 2016-08-12
    7      pears 2016-08-12
    8 pineapples 2017-04-22
    

    1a) separate_rows/summarize 这个稍短的变体使用separate_rows(用一个更简单的命令替换strsplitunnest 行)。它需要 tidyr 0.5 或更高版本。它给出了相同的结果:

    fruits %>%
           mutate(date = as.Date(date, "%d.%m.%y")) %>%
           separate_rows(fruits) %>%
           group_by(fruits) %>%
           summarize(date = min(date)) %>%
           ungroup
    

    2) strsplit/stack/aggregate 这不使用任何包。首先,我们拆分水果列并用日期命名结果列表L 的组件。然后我们堆叠列表创建一个数据框并重命名列,同时创建一个真正的"Date" 类列。最后我们aggregate 找到最小值。

    L <- with(fruits, setNames(strsplit(as.character(fruits), ", "), as.Date(date,"%d.%m.%y")))
    stk <- with(stack(L), data.frame(fruits = values, date = as.Date(ind)))
    aggregate(date ~ fruits, stk, min)
    

    给出这个data.frame:

          fruits       date
    1     apples 2016-08-12
    2    bananas 2016-08-12
    3     guavas 2017-04-22
    4      kiwis 2016-09-12
    5     mangos 2017-04-22
    6    oranges 2016-08-12
    7      pears 2016-08-12
    8 pineapples 2017-04-22
    

    【讨论】:

      【解决方案2】:

      这是一种使用 splitstackshape 包的方法,它使用下面的 data.table 包。我们可以使用cSplit()在逗号处分割fruits列,然后使用data.table语法取最小date

      library(splitstackshape)
      ## create the long data frame from the split 'fruits' column
      DT <- cSplit(fruits, "fruits", sep = ",", direction = "long")
      ## convert the 'date' column to date class and take the minimum row
      DT[, .(date = min(as.IDate(date, "%d.%m.%y"))), by = fruits]
      #        fruits       date
      # 1:     apples 2016-08-12
      # 2:    oranges 2016-08-12
      # 3:      pears 2016-08-12
      # 4:    bananas 2016-08-12
      # 5: pineapples 2017-04-22
      # 6:     mangos 2017-04-22
      # 7:     guavas 2017-04-22
      # 8:      kiwis 2016-09-12
      

      【讨论】:

        【解决方案3】:

        我想这就是你想要的。

        fruits <- c("apples, oranges, pears, bananas",
                "pineapples, mangos, guavas",
                "bananas, apples, kiwis") 
        
        fruits<-as.data.frame(fruits,stringsAsFactors=FALSE) #probably easier for the fruits to be strings rather than factors
        fruits$date<-as.Date(c( "12.8.16", "22.4.17", "12.9.16"),format="%d.%m.%y") #and set your dates to be Dates rather than strings (otherwise they will be sorted alphabetically)
        
        fruits[with(fruits, order(date)), ]
        
        #need to convert your df to one-fruit-per-row
        fruits2 <- do.call(rbind, #this binds together the data frames created by the lapply loop
                       lapply(1:nrow(fruits), #loops through the rows of fruits df to create a list of data frames, each corresponding to one row
                              function(i) data.frame(fruit=trimws(strsplit((fruits$fruits),",")[[i]]), #splits your strings at commas, and trims off the whitespace
                                                     date=fruits$date[i],stringsAsFactors = FALSE))) #adds the date corresponding to each row
        
        #finding the first appearance is easily done using dplyr
        library(dplyr)
        fruits3 <- fruits2 %>% group_by(fruit) %>% summarise(firstdate=min(date))
        

        或者另一种方法是使用水果的唯一名称设置数据框,然后使用grep 查找每个水果的第一个日期...

        fruits <- fruits[order(fruits$date),]
        firstfruits <- data.frame(fruit=unique(trimws(unlist(strsplit(fruits$fruits,",")))),stringsAsFactors = FALSE)
        firstfruits$date <- do.call(c,lapply(firstfruits$fruit, function(F) fruits$date[grep(F,fruits$fruits)[1]]))
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2016-08-12
          • 2015-07-22
          • 1970-01-01
          • 2016-02-14
          • 2020-08-06
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多