根据 `df1` 的一个变量 (`df1$var1`) 在 `df1` 中创建一个变量，以及根据 `df1$var1` 改变的 `df2` 的一个变量答案

【问题标题】：Create a variable in `df1` depending on one variable of `df1` (`df1$var1`) and one variable of `df2` that is changeable depending on `df1$var1`根据 `df1` 的一个变量 (`df1$var1`) 在 `df1` 中创建一个变量，以及根据 `df1$var1` 改变的 `df2` 的一个变量
【发布时间】：2019-05-17 16:20:19
【问题描述】：

我有数据框df1，它总结了一段时间内的鱼深度。 df1$Site 告诉你鱼的位置，df1$Ind 告诉你个体，df1$Depth 告诉你鱼在特定df1$Datetime 的深度。

另一方面，我有df2，它总结了电流强度随时间（每三个小时）从地表到 39 米深度，间隔 8 米（m0-7、m8-15、@987654330 @、m24-31 和 m32-39)。举个例子：

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")


> df1
             Datetime Site Ind Depth
1 2016-08-01 15:34:07   BD  16   5.3
2 2016-08-01 16:25:16   HG  17  24.0
3 2016-08-01 17:29:16   BD  19  36.4
4 2016-08-01 18:33:16   BD  16  42.0
5 2016-08-01 20:54:16   BD  17    NA
6 2016-08-01 22:48:16   BD  16  22.1

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

> df2
             Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39
1 2016-08-01 12:00:00   BD 2.75  3.00   2.75   3.25   3.00
2 2016-08-01 15:00:00   BD 4.00  4.00   4.00   3.00   4.00
3 2016-08-01 18:00:00   BD 6.75  4.75   5.75   6.50   4.75
4 2016-08-01 21:00:00   BD 2.25  3.00   2.25   2.75   3.00
5 2016-08-02 00:00:00   BD 4.30  2.10   1.40   3.40   1.70

我想在df1 中创建一个名为df1$Current.Int 的新列，根据df2 对水流的描述，总结鱼在何时何地所在深度的水流强度。

我想得到这个：

> df1
             Datetime Site Ind Depth Current.Int
1 2016-08-01 15:34:07   BD  16   5.3        4.00
2 2016-08-01 16:25:16   HG  17  24.0          NA # Currents of this site are not included in df2
3 2016-08-01 17:29:16   BD  19  36.4        4.75
4 2016-08-01 18:33:16   BD  16  42.0        4.75
5 2016-08-01 20:54:16   BD  17    NA          NA
6 2016-08-01 22:48:16   BD  16  22.1        1.40

只是要指出，由于当前记录是每三个小时，df2$Datetime 中指示的每个小时表示多一个半小时，少一个半小时。也就是说，df2 中在21:00:00 处指出的电流强度反映了19:30:00 和22:30:00 之间的电流。其余时间也一样。

有人知道怎么做吗？

【问题讨论】：

您需要使用一些条件/不等连接。这篇博客文章会有所帮助：r-bloggers.com/in-between-a-rock-and-a-conditional-join/amp

标签： r dplyr tidyverse lubridate

【解决方案1】：

日期不匹配，因此针对示例进行了更改。使用这种方法，您可以准确检查匹配的工作方式并确保它符合您的要求。

df1<-data.frame(Datetime=c("2016-08-18 15:34:07","2016-08-18 16:25:16","2016-08-18 17:29:16","2016-08-18 18:33:16","2016-08-18 20:54:16","2016-08-18 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-18 12:00:00","2016-08-18 15:00:00","2016-08-18 18:00:00","2016-08-18 21:00:00","2016-08-19 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

library(dplyr)
library(lubridate)

# Round the date and convert the depth to match the look-up. 
df1 = df1 %>% 
  mutate(
    Datetime_rounded = round_date(Datetime, "3 hour"),
    Depth_ind = ifelse(Depth < 8, "m0-7", 
                  ifelse(Depth > 7 & Depth < 16, "m8-15", 
                    ifelse(Depth > 15 & Depth < 24, "m16-23",
                      ifelse(Depth > 23 & Depth < 32, "m24-31",
                        ifelse(Depth > 31 & Depth < 40, "m32-39", NA)
                      )
                    )
                  )
                )
  )

# Wide to long on the intensity columns. 
df2 = df2 %>% 
  tidyr::gather("Depth_ind", "Intensity", 3:7)

# Join
df1 %>% 
  left_join(df2, by = c("Datetime_rounded" = "Datetime", 
                        "Site",
                        "Depth_ind"))

             Datetime Site Ind Depth    Datetime_rounded Depth_ind Intensity
1 2016-08-18 15:34:07   BD  16   5.3 2016-08-18 15:00:00      m0-7      4.00
2 2016-08-18 16:25:16   HG  17  24.0 2016-08-18 15:00:00    m24-31        NA
3 2016-08-18 17:29:16   BD  19  36.4 2016-08-18 18:00:00    m32-39      4.75
4 2016-08-18 18:33:16   BD  16  42.0 2016-08-18 18:00:00      <NA>        NA
5 2016-08-18 20:54:16   BD  17    NA 2016-08-18 21:00:00      <NA>        NA
6 2016-08-18 22:48:16   BD  16  22.1 2016-08-19 00:00:00    m16-23      1.40

# EDIT ----
## As per the request, the width of the final depth range can be adjusted as you wish, e.g. to a max depth of 60 m.

# Round the date and convert the depth to match the look-up. 
df1 = df1 %>% 
  mutate(
    Datetime_rounded = round_date(Datetime, "3 hour"),
    Depth_ind = ifelse(Depth < 8, "m0-7", 
                  ifelse(Depth > 7 & Depth < 16, "m8-15", 
                    ifelse(Depth > 15 & Depth < 24, "m16-23",
                      ifelse(Depth > 23 & Depth < 32, "m24-31",
                        ifelse(Depth > 31 & Depth < 60, "m32-39", NA)
                      )
                    )
                  )
                )
  )

【讨论】：

嗨伊文！你的代码几乎是完美的！！！我只有一个问题，在df1[4, ] 中，您进入Intensity 和NA，因为深度大于最深间隔（32-39 米）。在这些情况下，当鱼的深度超过 39 米时，我想为其分配最深层的强度，在这种情况下例如 4.75。你知道怎么做吗？在此先感谢:)
谢谢。只需将“m32-39”的范围更改为您想要的任何值，例如ifelse(Depth > 31 & Depth < 60, "m32-39", NA)

【解决方案2】：

这可以直接在单个 SQL 语句中完成。我们将df1 与df2 的连接离开了on 条件分组，由df1 行分组。在指定的组上计算max(b.Datetime) 将挑选出df2 的适当行。（如果a.Datetime、a.Site 没有唯一定义一行df1，则改为按a.rowid 分组。）最后我们使用[-1] 删除该列。

由于问题中的数据在df1 和df2 中没有对应的日期，因此我们使用了末尾注释中显示的数据。

library(sqldf)

sqldf("select max(b.Datetime), a.*,
  case when a.Depth <= 7 then b.[m0-7]
       when a.Depth <= 15 then b.[m8-15]
       when a.Depth <= 23 then b.[m16-23]
       when a.Depth <= 31 then b.[m24-31]
       else b.[m32-39]
  end as [Current.Int]
  from df1 a
  left join df2 b on a.Site = b.Site and a.Datetime >= b.Datetime
  group by a.Datetime, a.Site")[-1]

给予：

             Datetime Site Ind Depth Current.Int
1 2016-08-01 15:34:07   BD  16   5.3        4.00
2 2016-08-01 16:25:16   HG  17  24.0          NA
3 2016-08-01 17:29:16   BD  19  36.4        4.00
4 2016-08-01 18:33:16   BD  16  42.0        4.75
5 2016-08-01 20:54:16   BD  17    NA        4.75
6 2016-08-01 22:48:16   BD  16  22.1        2.25

注意

这是使用的输入，与问题中的相同，除了：

UTC 时区已被取消。如果要保留 UTC 时区，请使用 Sys.setenv(TZ='UTC') 将会话时区更改为 UTC。处理时区的另一种可能性是对Datetime 列使用字符串而不是POSIXct，在这种情况下，您首先不会遇到时区问题。
添加了最后一行以改进示例，因为日期不匹配。

这是使用的输入。

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S")

df2<-data.frame(Datetime=c("2016-08-18 12:00:00","2016-08-18 15:00:00","2016-08-18 18:00:00","2016-08-18 21:00:00","2016-08-19 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

df2$Datetime <- as.POSIXct(paste("2016-08-01", sub(".* ", "", df2$Datetime)))

【讨论】：

感谢 G. Grothendieck！你是对的！我更改了df2 中的日期，因为它们与df1 中的时间不匹配。运行您的代码后，我不太了解您的输出。你的输出改变了Datetime。例如，第一行应该包含时间15:34:07。我猜Current.Int 的值与我的预期不匹配，因此...只是一个额外的评论：我需要使用来自df2 的电流信息最接近@987654346 中的Datetime @。例如，对于df1 中的16:25:16，df2 的适当Datetime 是15:00:00。
另一个例子，在df1 中2016-08-01 22:48:16 与df2 最接近的Datetime 是2016-08-02 00:00:00。你知道如何解决这些问题吗？提前感谢您的时间:)
这是一个时区问题。我现在已经删除了注释中显示的输入中的 tz= 以避免这种情况。您也可以将会话的时区设置为 UTC 或将 Datetime 列定义为字符串。

【解决方案3】：

只要您的数据不是很大，您就不必走上条件联接的道路。相反，首先仅使用站点加入，然后过滤掉额外的观察结果。这不是特别有效，但它可能比转向sqldf 更容易。

请注意，我对您提供的数据进行了一些更改，以便日期匹配。

library(tidyverse)  

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),
                Site=c("BD","HG","BD","BD","BD","BD"),
                Ind=c(16,17,19,16,17,16), 
                Depth=c(5.3,24,36.4,42,NA,22.1),
                stringsAsFactors = FALSE)
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), 
                Site=c("BD","BD","BD","BD","BD"),
                var1=c(2.75,4,6.75,2.25,4.3),
                var2=c(3,4,4.75,3,2.1),
                var3=c(2.75,4,5.75,2.25,1.4),
                var4=c(3.25,3,6.5,2.75,3.4),
                var5=c(3,4,4.75,3,1.7),
                stringsAsFactors = FALSE)
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime_CI","Site","m0-7","m8-15","m16-23","m24-31","m32-39")



#Tidy the data in df2 so that that we have two columns for min and max Depth
#and a single column for the value of the current intensity
df2 <- df2 %>% 
  gather(-Datetime_CI, -Site, key = Depth, value = Current.Int) %>% 
  separate(Depth, c("minDepth", "maxDepth")) %>% 
  mutate(minDepth = as.numeric(str_sub(minDepth, 2, nchar(minDepth))))

#join df1 and df2 based on the Site alone
df1 %>% 
  inner_join(df2, by = "Site") %>% 
  #now filter out any observations where depth is not between the min and max
  filter(Depth >= minDepth,
         Depth <= maxDepth,
         #now exclude any current intensity observations prior to Datetime
         Datetime > Datetime_CI) %>% 
  #finally, take the first current intensity observation after Datetime
  group_by(Datetime, Site, Ind, Depth) %>% 
  filter(Datetime_CI == max(Datetime_CI))


# A tibble: 6 x 8
# Groups:   Datetime, Site, Ind, Depth [4]
Datetime            Site    Ind Depth Datetime_CI         minDepth maxDepth Current.Int
<dttm>              <chr> <dbl> <dbl> <dttm>                 <dbl> <chr>          <dbl>
1 2016-08-01 15:34:07 BD       16   5.3 2016-08-01 15:00:00        0 7               4   
2 2016-08-01 17:29:16 BD       19  36.4 2016-08-01 15:00:00        0 7               4   
3 2016-08-01 17:29:16 BD       19  36.4 2016-08-01 15:00:00       32 39              4   
4 2016-08-01 18:33:16 BD       16  42   2016-08-01 18:00:00        0 7               6.75
5 2016-08-01 22:48:16 BD       16  22.1 2016-08-01 21:00:00        0 7               2.25
6 2016-08-01 22:48:16 BD       16  22.1 2016-08-01 21:00:00       16 23              2.25

【讨论】：

您好 Jordo82，感谢您的宝贵时间。当我运行您的代码时，在最后一步（加入 df1 和 df2），我收到以下消息：“警告消息：列 Site 加入不同级别的因子，强制转换为字符向量”。该脚本有效，但我得到一个包含零行和 8 个变量的数据框。你知道可能是什么问题吗？我尝试将 df1$Site 和 df2$Site 更改为字符（而不是因子），但我得到了相同的结果。
另外，我不明白你的最终输出。在你的输出中你没有时间16:25:16 也没有20:54:16，为什么？而不是这个，你有时间 17:29:16 和 22:48:16 重复。我也看到时间分配是错误的。比如时间17:29:16对应的是4.75的Current.Int，因为你要取df2中对应时间18:00:00的行，看一下df2$m32-39列（深度这条鱼是 42 米，在这种情况下，我想从最深层获取当前值。
关于分配的另一个例子：df1[6 , ] 的当前分配对应于来自df2 的时间21:00:00。由于df1[6 , ]的时间是22:48:16，所以它距离df2最近的时间是第二天的00:00:00（2016-08-01）。你了解我的cmets吗？你知道如何修复它们吗？在此先感谢您的时间。代码真的很接近我需要的东西。