如何在第二个df中找到最接近坐标的值/匹配答案

【问题标题】：How to find value for/match to coordinates of closest proximity in a second df如何在第二个df中找到最接近坐标的值/匹配
【发布时间】：2015-01-05 12:24:33
【问题描述】：

我有一系列海上地理位置，我正在尝试获取其地质沉积物类型信息。我正在使用英国国家地质沉积物数据库 (df1) 的导出，这是一个包含坐标和沉积物信息的大型数据集。目前，我一直在对 BGS 导出文件 (df1) 中的坐标进行四舍五入，并对这些坐标方块的沉积物类型进行平均/重新计算，然后在 (df2) 中对我的坐标进行四舍五入，并将这些坐标与这些方块相匹配以获得沉积物分类。

BGS 导出看起来像这样 (df1);

    NUM     X       Y           GRAV    SAND    MUD
1   228     1.93656 52.31307    1.07    98.83   0.10
2   142     1.84667 52.45333    0.00    52.60   47.40
3   182     1.91950 52.17750    9.48    90.38   0.14
4   124     1.88333 52.70833    0.00    98.80   1.20
5   2807    1.91050 51.45000    2.05    97.91   0.05
6   2787    1.74683 51.99382    41.32   52.08   6.60
7   2776    1.66117 51.63550    9.83    87.36   2.81
8   2763    1.82467 51.71767    43.92   47.25   8.83
9   2753    1.76867 51.96349    57.66   39.18   3.15
10  68      2.86967 52.96333    0.30    98.90   0.80
11  2912    1.70083 51.77783    26.90   64.87   8.22
12  2914    1.59750 51.88882    32.00   65.02   2.97
13  2886    1.98833 51.34267    1.05    98.91   0.04
14  2891    1.87817 51.31549    68.57   31.34   0.08
15  2898    1.37433 51.41249    35.93   61.48   2.59
16  45      2.06667 51.82500    9.70    88.10   2.20
17  2904    1.63617 51.45999    16.28   66.67   17.05

我在海上的位置是这样的（df2）；

haul    DecStartLat DecStartLong
1993H_2 55.23983    -5.512830
2794H_1 55.26670    -5.516700
1993H_1 55.27183    -5.521330
0709A_71    55.26569    -5.519730
0396H_2 55.44120    -5.917800
0299H_2 55.44015    -5.917310
0514A_26    55.46897    -5.912167
0411A_64    55.47289    -5.911820
0410A_65    55.46869    -5.911930
0514A_24    55.63585    -5.783500
0295H_4 55.57250    -5.754300
0410A_62    55.63656    -6.041870
0413A_53    55.73280    -6.020600
0396H_13    55.66470    -6.002300
2794H_8 55.83330    -5.883300
0612A_15    55.84025    -5.912130
0410A_74    55.84311    -5.910180
0299H_16    55.90568    -5.732490
0200H_18    55.88600    -5.742900
0612A_18    55.90450    -5.835880

这是我的脚本...

get.Sed.type <- function(x,y) {
  x$Y2 <- round(x$Y, digits=1)
  x$X2 <- round(x$X, digits=1)
  x$BGSQ <- paste(x$Y2,x$X2,sep="_")
  x$RATIO <- x$SAND/x$MUD
  x <- aggregate(cbind(GRAV,RATIO)~BGSQ,data=x,FUN=mean)

  FOLK <- (x$GRAV)
  FOLK[(FOLK)<1] <- 0
  FOLK[(FOLK)>=1&(FOLK)<5] <- 1
  FOLK[(FOLK)>=5&(FOLK)<30] <- 5
  FOLK[(FOLK)>=30&(FOLK)<80] <- 30
  FOLK[(FOLK)>=80] <- 80

  R_CLASS <- (x$RATIO)
  R_CLASS[(R_CLASS)<1/9] <- 0
  R_CLASS[(R_CLASS)>=1/9&(R_CLASS)<1] <- 0.1
  R_CLASS[(R_CLASS)>=1&(R_CLASS)<9] <- 1
  R_CLASS[(R_CLASS)>=9] <- 9

  x$FOLK_CLASS <- NULL
  x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==0] <- "M"
  x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==5] <- "gM"
  x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==0] <- "sM"
  x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==1] <- "(g)M"
  x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==1] <- "(g)sM"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==0] <- "S"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==0] <- "mS"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==1] <- "(g)S"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==1] <- "(g)sM"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==5] <- "gmS"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==5] <- "gS"
  x$FOLK_CLASS[(FOLK)==80] <- "G"
  x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==30] <- "mG"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==30] <- "msG"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==30] <- "sG"

  y$Lat <- round(y$DecStartLat, digits=1)
  y$Long <- round(y$DecStartLong, digits=1)
  y$LATLONG100_sq <- paste(y$Lat,y$Long,sep="_")

  y <- merge(y, x[,c(1,4)],all.x=TRUE,by.x="LATLONG100_sq",by.y="BGSQ")

  #Delete unwanted columns
  y <- y[, !(colnames(y) %in% c("Lat","Long","LATLONG100_sq"))]
  #Name column something logical
  colnames(y)[colnames(y) == 'FOLK_CLASS'] <- 'BGS_class'

  return(y)
}

但是，我在 db2 中有十几个位置，在 BGS 导出 (db1) 中没有相应的值，我想知道如何要求它对围绕相应正方形的正方形进行另一个平均值 (即四舍五入到更大的数字并重复该过程）或要求它在 BGS 导出文件中找到最接近的坐标并取现有值。

【问题讨论】：

将您发布的样本限制在回答问题所需的最低限度将非常有帮助。您发布的数据和脚本包含很多不相关的东西，这使得它更难回答。另见stackoverflow.com/questions/5963269/…
stackoverflow.com/questions/47382708/… 可以帮忙

标签： r coordinates match

【解决方案1】：

对于问题中所述的第二个选项，我建议将问题框架如下：

假设您有一组来自 db1 的 m 坐标和来自 db2 的 n 个坐标，m

您希望将 db1 中的每个点与 db2 中的一个点进行匹配，这样匹配的“错误”，例如距离之和，将被最小化。

解决这个问题的一个简单的贪心方法可能是生成一个 m x n 矩阵，其中包含每对坐标之间的距离，并为每个点依次选择最接近的匹配。当然，如果要匹配的点很多，或者如果您想要一个最佳解决方案，您可能需要考虑更精细的匹配算法（例如Hungarian algorithm）。

代码：

  #generate some data (this data will generate sub-optimal matching with greedy matching)
  db1 <- data.frame(id=c("a1","a2","a3","a4"), x=c(1,5,10,20), y=c(1,5,10,20))
  db2 <- data.frame(id=c("b1","b2","b3","b4"),x=c(1.1,2.1,8.1,14.1), y=c(1.1,1.1,8.1,14.1))

  #create cartesian product
  product <- merge(db1, db2, by=NULL)
  #calculate auclidean distances for each possible matching
  product$d <- sqrt((product$x.x - product$x.y)^2 + (product$y.x - product$y.y)^2)

  #(naively & greedily) find the best match for each point
  sorted <- product[ order(product[,"d"]), ]
  found <- vector()
  res <- vector() #this vector will hold the result
  for (i in 1:nrow(db1)) {
    for (j in 1:nrow(sorted)) {
      db2_val <- as.character(sorted[j,"id.y"])
      if (sorted[j,"id.x"] == db1[i, "id"] && length(grep(db2_val, found)) == 0) {    
        #print(paste("matching ", db1[i, "id"], " with ", db2_val))
        res[i] <- db2_val      
        found <- c(found, db2_val)
        break
      }
    }
  }

请注意，我确信使用循环以外的方法可以改进代码并使其更加优雅。

【讨论】：

您可以通过在内部循环中迭代 for (j in i:nrow(sorted)) 来节省时间吗？
@BondedDust，不，我不认为你可以，例如考虑排序数据帧的第一个元素与第二个 db1 元素相关的情况。那么当 i 等于 2 时，最好的匹配就是这个第一个元素。
所以“最佳匹配”不是传递性的？
这是一个贪婪的解决方案，而不是最优的解决方案，所以它很可能是不可传递的......
所以如果y 是x 的最佳匹配，那么y 的匹配可能仍然更好？

【解决方案2】：

希望我没有误解，但就我从标题中得到的，您需要根据最小距离进行匹配。如果允许这个距离是Euclidean distance，那么可以使用快速的RANN package，如果不是，则需要计算great circle distance。

提供的一些数据

BGS_df <- 
  read.table(text = 
               "    NUM     X       Y           GRAV    SAND    MUD
                1   228     1.93656 52.31307    1.07    98.83   0.10
                2   142     1.84667 52.45333    0.00    52.60   47.40
                3   182     1.91950 52.17750    9.48    90.38   0.14
                4   124     1.88333 52.70833    0.00    98.80   1.20
                5   2807    1.91050 51.45000    2.05    97.91   0.05",
             header = TRUE)

my_positions <-
  read.table(text = 
               "haul    DecStartLat DecStartLong
                1993H_2 55.23983    -5.512830
                2794H_1 55.26670    -5.516700
                1993H_1 55.27183    -5.521330",
             header = TRUE)

欧式距离（使用`RANN`包）

library(RANN)
# For each point in my_positions, find the nearest neighbor from BGS_df:
# Give X and then Y (longtitude and then latitude)
# Note that argument k sets the number of nearest neighbours, here 1 (the closest)
closest_RANN <- RANN::nn2(data = BGS_df[, c("X", "Y")], 
                          query = my_positions[, c("DecStartLong", "DecStartLat")], 
                          k = 1)
results_RANN <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                      BGS_df[closest_RANN$nn.idx, ])
results_RANN
#        haul DecStartLong DecStartLat NUM       X        Y GRAV SAND MUD
# 4   1993H_2     -5.51283    55.23983 124 1.88333 52.70833    0 98.8 1.2
# 4.1 2794H_1     -5.51670    55.26670 124 1.88333 52.70833    0 98.8 1.2
# 4.2 1993H_1     -5.52133    55.27183 124 1.88333 52.70833    0 98.8 1.2

大圆距离（使用`geosphere`包）

library(geosphere)
# Compute matrix of great circle distances
dist_mat <- geosphere::distm(x = BGS_df[, c("X", "Y")],
                             y = my_positions[, c("DecStartLong", "DecStartLat")],
                             fun = distHaversine) # can try other distances
# For each column (point in my_positions) get the index of row of min dist
# (corresponds to row index in BGS_df) 
BGS_idx <- apply(dist_mat, 2, which.min)

results_geo <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                     BGS_df[BGS_idx, ])
identical(results_geo, results_RANN) # here TRUE, but not always expected

【讨论】：

提供的一些数据

欧式距离（使用RANN包）

大圆距离（使用geosphere包）

欧式距离（使用`RANN`包）

大圆距离（使用`geosphere`包）