尝试创建一个函数以通过最近的 gps 坐标连接两个数据集答案

【问题标题】：Trying to create a function to join two datasets by closest gps coordinate尝试创建一个函数以通过最近的 gps 坐标连接两个数据集
【发布时间】：2016-08-30 16:49:12
【问题描述】：

我正在尝试合并两个包含 GPS 坐标的数据集，这样我就剩下一个数据集，其中包含来自两个数据集的变量。我正在尝试使用一个函数来实现这一点。问题是来自两个数据集的 GPS 坐标并不完全匹配。因此，任务是通过找到最接近的 gps 坐标对，将一个数据集的变量与另一个数据集的变量进行匹配。

我已经成功使用了模糊连接包，但只能获得部分匹配 (~75%)。通过下面的功能，我希望获得更高的匹配度。一个数据集比另一个短，所以这里的想法是使用两个 for 循环，每个 for 循环遍历每个数据集。

建立一个“锚点”（两个数据集的第一次观察之间的距离），这样如果两个点之间的距离小于锚点，则新的（更短的）距离成为新的锚点。 for 循环继续进行，直到找到最短距离，并将两个数据集的变量附加到新数据集的末尾，此处称为pairedData。只要使用从两个数据集中获取的最短数据集（6314 行），我就应该留下一个数据集。

我认为这个函数应该可以工作，但是 rbind() 非常慢，而且我在实现 rbindlist() 时遇到了麻烦。关于如何实现这一目标的任何想法？

combineGPS <- function(harvest,planting) {
require(sp)
require(data.table)
longH <- harvest$long
latH <- harvest$lat
longP <- planting$long
latP <- planting$lat
rowsH <- nrow(harvest)
rowsP <- nrow(planting)
harvestCoords <- cbind(longH,latH)
harvestPoints <- SpatialPoints(harvestCoords)
plantingCoords <- cbind(longP,latP)
plantingPoints <- SpatialPoints(plantingCoords)

#种植数据比收获数据短

#需要取每一行种植数据（6314），找到最近的收获数据点（16626），然后附上

anchor <- spDistsN1(plantingPoints[1,],harvestPoints[1,],longlat=FALSE)
pairedData <- data.frame(long=numeric(),
               lat=numeric(), 
               variety=factor(), 
               seedling_rate=numeric(),
               seed_spacing=numeric(),
               speed=numeric(),
               yield=numeric(),
               stringsAsFactors=FALSE) 

for (p in 1:rowsP){
     for (h in 1:rowsH){

   if(spDistsN1(plantingPoints[p,],harvestPoints[h,],longlat=FALSE) <= anchor){
    anchor <- spDistsN1(plantingPoints[p,],harvestPoints[h,],longlat=FALSE)
    pairedData[p,]<-c(planting[p,]$long, planting[p,]$lat, planting[p,]$variety, planting[p,]$seedling_rate, planting[p,]$seed_spacing, planting[p,]$speed, harvest[h,]$yield)
   }    

       }
   }
  return(pairedData)
}
doesItWork=combineGPS(harvest,planting)
doesItWork

【问题讨论】：

标签： r for-loop gps spatial sp

【解决方案1】：

如果我正确理解了您的问题，我不确定您为什么需要在收获数据上使用 for 循环。函数spDistsN1 将返回到指定点的距离矩阵。我认为你应该使用你的收获数据作为pts，并将种植数据作为pt 输入到这个函数，然后找到与每个pt 距离最短的索引。仅循环种植数据。会节省很多时间。另外，不要在spDistsN1 中指定longlat，因为您的数据是SpatialPoints，并且函数说不要为这些对象指定。

示例循环：

for (p in 1:rowsP){
     #Get the distance from the pth planting point to all of the havest points
     Dists <- spDistsN1(pts = harvestPoints, pt = plantingPoints[p,])

     #Find the index of the nearest harvest point to p. This is the minimum of Dists. (Note that there may be more than one minimum)
     NearestHarvest <- which(Dists == min(Dists))

     #Add information to the paired data
     pairedData[p,]<-c(planting[p,]$long, planting[p,]$lat, planting[p,]$variety, planting[p,]$seedling_rate, planting[p,]$seed_spacing, planting[p,]$speed, harvest[NearestHarvest,]$yield) 
   }

如果这是您要找的，请告诉我。

此外，您可以使用种植数据初始化pairedData 数据框，并在for 循环中仅将收获产量数据添加到pairedData 数据框。这也将为您节省一些循环时间。

【讨论】：

【解决方案2】：

您需要将收获文件 (16626) 中的每一行映射到种植 (6314) 文件中的一行，而不是相反。下图是 xy 平面上的收获和植物 gps 坐标图。红点是收割机点。

精密农机是多行播种机和收割机。 gps 设备安装在机器内部。即每个 gps 点指的是许多行作物。在这种情况下，与每次行程的收割机相比，播种机覆盖 2X 行。这解释了为什么收获文件有 ~2X+ 数据点。

基本方法是蛮力搜索，因为 gps 坐标不会在文件之间重叠。我在 R 和 Python 中通过将整个字段分割成更小的统一网格并将搜索限制在最近的相邻网格中解决了这个问题。在效率方面，大约需要 3-4 分钟才能解决，平均种植点和收获点之间的距离约为 3 米，这是合理的。

你可以在我的Github找到代码

【讨论】：