【问题标题】:Create a variable in `df1` depending on one variable of `df1` (`df1$var1`) and one variable of `df2` that is changeable depending on `df1$var1`根据 `df1` 的一个变量 (`df1$var1`) 在 `df1` 中创建一个变量,以及根据 `df1$var1` 改变的 `df2` 的一个变量
【发布时间】:2019-05-17 16:20:19
【问题描述】:

我有数据框df1,它总结了一段时间内的鱼深度。 df1$Site 告诉你鱼的位置,df1$Ind 告诉你个体,df1$Depth 告诉你鱼在特定df1$Datetime 的深度。

另一方面,我有df2,它总结了电流强度随时间(每三个小时)从地表到 39 米深度,间隔 8 米(m0-7m8-15、@987654330 @、m24-31m32-39)。举个例子:

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")


> df1
             Datetime Site Ind Depth
1 2016-08-01 15:34:07   BD  16   5.3
2 2016-08-01 16:25:16   HG  17  24.0
3 2016-08-01 17:29:16   BD  19  36.4
4 2016-08-01 18:33:16   BD  16  42.0
5 2016-08-01 20:54:16   BD  17    NA
6 2016-08-01 22:48:16   BD  16  22.1

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

> df2
             Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39
1 2016-08-01 12:00:00   BD 2.75  3.00   2.75   3.25   3.00
2 2016-08-01 15:00:00   BD 4.00  4.00   4.00   3.00   4.00
3 2016-08-01 18:00:00   BD 6.75  4.75   5.75   6.50   4.75
4 2016-08-01 21:00:00   BD 2.25  3.00   2.25   2.75   3.00
5 2016-08-02 00:00:00   BD 4.30  2.10   1.40   3.40   1.70

我想在df1 中创建一个名为df1$Current.Int 的新列,根据df2 对水流的描述,总结鱼在何时何地所在深度的水流强度。

我想得到这个:

> df1
             Datetime Site Ind Depth Current.Int
1 2016-08-01 15:34:07   BD  16   5.3        4.00
2 2016-08-01 16:25:16   HG  17  24.0          NA # Currents of this site are not included in df2
3 2016-08-01 17:29:16   BD  19  36.4        4.75
4 2016-08-01 18:33:16   BD  16  42.0        4.75
5 2016-08-01 20:54:16   BD  17    NA          NA
6 2016-08-01 22:48:16   BD  16  22.1        1.40

只是要指出,由于当前记录是每三个小时,df2$Datetime 中指示的每个小时表示多一个半小时,少一个半小时。也就是说,df2 中在21:00:00 处指出的电流强度反映了19:30:0022:30:00 之间的电流。其余时间也一样。

有人知道怎么做吗?

【问题讨论】:

标签: r dplyr tidyverse lubridate


【解决方案1】:

日期不匹配,因此针对示例进行了更改。使用这种方法,您可以准确检查匹配的工作方式并确保它符合您的要求。

df1<-data.frame(Datetime=c("2016-08-18 15:34:07","2016-08-18 16:25:16","2016-08-18 17:29:16","2016-08-18 18:33:16","2016-08-18 20:54:16","2016-08-18 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-18 12:00:00","2016-08-18 15:00:00","2016-08-18 18:00:00","2016-08-18 21:00:00","2016-08-19 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

library(dplyr)
library(lubridate)

# Round the date and convert the depth to match the look-up. 
df1 = df1 %>% 
  mutate(
    Datetime_rounded = round_date(Datetime, "3 hour"),
    Depth_ind = ifelse(Depth < 8, "m0-7", 
                  ifelse(Depth > 7 & Depth < 16, "m8-15", 
                    ifelse(Depth > 15 & Depth < 24, "m16-23",
                      ifelse(Depth > 23 & Depth < 32, "m24-31",
                        ifelse(Depth > 31 & Depth < 40, "m32-39", NA)
                      )
                    )
                  )
                )
  )

# Wide to long on the intensity columns. 
df2 = df2 %>% 
  tidyr::gather("Depth_ind", "Intensity", 3:7)

# Join
df1 %>% 
  left_join(df2, by = c("Datetime_rounded" = "Datetime", 
                        "Site",
                        "Depth_ind"))

             Datetime Site Ind Depth    Datetime_rounded Depth_ind Intensity
1 2016-08-18 15:34:07   BD  16   5.3 2016-08-18 15:00:00      m0-7      4.00
2 2016-08-18 16:25:16   HG  17  24.0 2016-08-18 15:00:00    m24-31        NA
3 2016-08-18 17:29:16   BD  19  36.4 2016-08-18 18:00:00    m32-39      4.75
4 2016-08-18 18:33:16   BD  16  42.0 2016-08-18 18:00:00      <NA>        NA
5 2016-08-18 20:54:16   BD  17    NA 2016-08-18 21:00:00      <NA>        NA
6 2016-08-18 22:48:16   BD  16  22.1 2016-08-19 00:00:00    m16-23      1.40

# EDIT ----
## As per the request, the width of the final depth range can be adjusted as you wish, e.g. to a max depth of 60 m.

# Round the date and convert the depth to match the look-up. 
df1 = df1 %>% 
  mutate(
    Datetime_rounded = round_date(Datetime, "3 hour"),
    Depth_ind = ifelse(Depth < 8, "m0-7", 
                  ifelse(Depth > 7 & Depth < 16, "m8-15", 
                    ifelse(Depth > 15 & Depth < 24, "m16-23",
                      ifelse(Depth > 23 & Depth < 32, "m24-31",
                        ifelse(Depth > 31 & Depth < 60, "m32-39", NA)
                      )
                    )
                  )
                )
  )

【讨论】:

  • 嗨伊文!你的代码几乎是完美的!!!我只有一个问题,在df1[4, ] 中,您进入IntensityNA,因为深度大于最深间隔(32-39 米)。在这些情况下,当鱼的深度超过 39 米时,我想为其分配最深层的强度,在这种情况下例如 4.75。你知道怎么做吗?在此先感谢:)
  • 谢谢。只需将“m32-39”的范围更改为您想要的任何值,例如ifelse(Depth &gt; 31 &amp; Depth &lt; 60, "m32-39", NA)
【解决方案2】:

这可以直接在单个 SQL 语句中完成。我们将df1df2 的连接离开了on 条件分组,由df1 行分组。在指定的组上计算max(b.Datetime) 将挑选出df2 的适当行。 (如果a.Datetimea.Site 没有唯一定义一行df1,则改为按a.rowid 分组。)最后我们使用[-1] 删除该列。

由于问题中的数据在df1df2 中没有对应的日期,因此我们使用了末尾注释中显示的数据。

library(sqldf)

sqldf("select max(b.Datetime), a.*,
  case when a.Depth <= 7 then b.[m0-7]
       when a.Depth <= 15 then b.[m8-15]
       when a.Depth <= 23 then b.[m16-23]
       when a.Depth <= 31 then b.[m24-31]
       else b.[m32-39]
  end as [Current.Int]
  from df1 a
  left join df2 b on a.Site = b.Site and a.Datetime >= b.Datetime
  group by a.Datetime, a.Site")[-1]

给予:

             Datetime Site Ind Depth Current.Int
1 2016-08-01 15:34:07   BD  16   5.3        4.00
2 2016-08-01 16:25:16   HG  17  24.0          NA
3 2016-08-01 17:29:16   BD  19  36.4        4.00
4 2016-08-01 18:33:16   BD  16  42.0        4.75
5 2016-08-01 20:54:16   BD  17    NA        4.75
6 2016-08-01 22:48:16   BD  16  22.1        2.25

注意

这是使用的输入,与问题中的相同,除了:

  1. UTC 时区已被取消。如果要保留 UTC 时区,请使用 Sys.setenv(TZ='UTC') 将会话时区更改为 UTC。处理时区的另一种可能性是对Datetime 列使用字符串而不是POSIXct,在这种情况下,您首先不会遇到时区问题。

  2. 添加了最后一行以改进示例,因为日期不匹配。

这是使用的输入。

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S")

df2<-data.frame(Datetime=c("2016-08-18 12:00:00","2016-08-18 15:00:00","2016-08-18 18:00:00","2016-08-18 21:00:00","2016-08-19 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

df2$Datetime <- as.POSIXct(paste("2016-08-01", sub(".* ", "", df2$Datetime)))

【讨论】:

  • 感谢 G. Grothendieck!你是对的!我更改了df2 中的日期,因为它们与df1 中的时间不匹配。运行您的代码后,我不太了解您的输出。你的输出改变了Datetime。例如,第一行应该包含时间15:34:07。我猜Current.Int 的值与我的预期不匹配,因此...只是一个额外的评论:我需要使用来自df2 的电流信息最接近@987654346 中的Datetime @。例如,对于df1 中的16:25:16df2 的适当Datetime15:00:00
  • 另一个例子,在df12016-08-01 22:48:16df2 最接近的Datetime2016-08-02 00:00:00。你知道如何解决这些问题吗?提前感谢您的时间:)
  • 这是一个时区问题。我现在已经删除了注释中显示的输入中的 tz= 以避免这种情况。您也可以将会话的时区设置为 UTC 或将 Datetime 列定义为字符串。
【解决方案3】:

只要您的数据不是很大,您就不必走上条件联接的道路。相反,首先仅使用站点加入,然后过滤掉额外的观察结果。这不是特别有效,但它可能比转向sqldf 更容易。

请注意,我对您提供的数据进行了一些更改,以便日期匹配。

library(tidyverse)  

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),
                Site=c("BD","HG","BD","BD","BD","BD"),
                Ind=c(16,17,19,16,17,16), 
                Depth=c(5.3,24,36.4,42,NA,22.1),
                stringsAsFactors = FALSE)
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), 
                Site=c("BD","BD","BD","BD","BD"),
                var1=c(2.75,4,6.75,2.25,4.3),
                var2=c(3,4,4.75,3,2.1),
                var3=c(2.75,4,5.75,2.25,1.4),
                var4=c(3.25,3,6.5,2.75,3.4),
                var5=c(3,4,4.75,3,1.7),
                stringsAsFactors = FALSE)
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime_CI","Site","m0-7","m8-15","m16-23","m24-31","m32-39")



#Tidy the data in df2 so that that we have two columns for min and max Depth
#and a single column for the value of the current intensity
df2 <- df2 %>% 
  gather(-Datetime_CI, -Site, key = Depth, value = Current.Int) %>% 
  separate(Depth, c("minDepth", "maxDepth")) %>% 
  mutate(minDepth = as.numeric(str_sub(minDepth, 2, nchar(minDepth))))

#join df1 and df2 based on the Site alone
df1 %>% 
  inner_join(df2, by = "Site") %>% 
  #now filter out any observations where depth is not between the min and max
  filter(Depth >= minDepth,
         Depth <= maxDepth,
         #now exclude any current intensity observations prior to Datetime
         Datetime > Datetime_CI) %>% 
  #finally, take the first current intensity observation after Datetime
  group_by(Datetime, Site, Ind, Depth) %>% 
  filter(Datetime_CI == max(Datetime_CI))


# A tibble: 6 x 8
# Groups:   Datetime, Site, Ind, Depth [4]
Datetime            Site    Ind Depth Datetime_CI         minDepth maxDepth Current.Int
<dttm>              <chr> <dbl> <dbl> <dttm>                 <dbl> <chr>          <dbl>
1 2016-08-01 15:34:07 BD       16   5.3 2016-08-01 15:00:00        0 7               4   
2 2016-08-01 17:29:16 BD       19  36.4 2016-08-01 15:00:00        0 7               4   
3 2016-08-01 17:29:16 BD       19  36.4 2016-08-01 15:00:00       32 39              4   
4 2016-08-01 18:33:16 BD       16  42   2016-08-01 18:00:00        0 7               6.75
5 2016-08-01 22:48:16 BD       16  22.1 2016-08-01 21:00:00        0 7               2.25
6 2016-08-01 22:48:16 BD       16  22.1 2016-08-01 21:00:00       16 23              2.25

【讨论】:

  • 您好 Jordo82,感谢您的宝贵时间。当我运行您的代码时,在最后一步(加入 df1 和 df2),我收到以下消息:“警告消息:列 Site 加入不同级别的因子,强制转换为字符向量”。该脚本有效,但我得到一个包含零行和 8 个变量的数据框。你知道可能是什么问题吗?我尝试将 df1$Sitedf2$Site 更改为字符(而不是因子),但我得到了相同的结果。
  • 另外,我不明白你的最终输出。在你的输出中你没有时间16:25:16 也没有20:54:16,为什么?而不是这个,你有时间 17:29:1622:48:16 重复。我也看到时间分配是错误的。比如时间17:29:16对应的是4.75的Current.Int,因为你要取df2中对应时间18:00:00的行,看一下df2$m32-39列(深度这条鱼是 42 米,在这种情况下,我想从最深层获取当前值。
  • 关于分配的另一个例子:df1[6 , ] 的当前分配对应于来自df2 的时间21:00:00。由于df1[6 , ]的时间是22:48:16,所以它距离df2最近的时间是第二天的00:00:002016-08-01)。你了解我的cmets吗?你知道如何修复它们吗?在此先感谢您的时间。代码真的很接近我需要的东西。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-04-07
  • 2016-01-18
  • 2020-09-26
  • 2020-07-27
相关资源
最近更新 更多