加入具有范围的数据框答案

【问题标题】：Join data frames with ranges加入具有范围的数据框
【发布时间】：2020-10-01 18:26:19
【问题描述】：

我有两个数据框

一个数据框有一列“邮政编码”，其中包含完整的邮政编码。它还有其他几个列，例如商店名称等。

第二个有一个名为邮政编码范围的列，其中只有每个城市的邮政编码范围。

如何在邮政编码上加入这两个数据框，以便将正确的城市添加到数据框 1？

我可以想到嵌套的 for 循环，并将每个范围的最大/最小值与第二个数据帧中的邮政编码进行比较。但这需要很长时间才能运行 ~ 100 次比较

编辑：数据框 1：这个只有邮政编码。我希望城市在这里居住。

| Shop names            | Zip Codes |
|-----------------------|-----------|
| Bergin and botts      | 029888    |
| WW and Co             | 100397    |
| Higgin Bothams        | 100430    |
| Bertie's Beans        | 100459    |
| Leaky Cauldron        | 310283    |
| Pet Peeves            | 310330    |
| Lucy's coffee shop    | 910345    |
| Dream cathers         | 465250    |
| Dragon supplies       | 479187    |
| SLUG AND   JIGGER'S   | 934464    |
| FLOURISH AND BLOTTS.  | 937833    |
| MADAM MALKIN'S ROBES  | 931283    |

Dataframe2：这个有邮政编码范围和对应的城市。

| City   | Zip ranges    |
|----------------|---------------|
| braavos        | 029918-100290 |
| highgarden     | 100389-100440 |
| vale           | 200410-219000 |
| dorne          | 310229-367890 |
| storms end     | 389032-567000 |
| king's landing | 601000-898000 |
| winterfell     | 910230-940200 |

我在这里创建了一些示例数据。原始数据对于 dataframe1 大约有一百万行，对于 dataframe2 大约有 5k 行。所以for循环的逻辑会很繁琐。

感谢任何帮助！

输入（df1）

structure(list(ï...Shop.names = c(" Bergin and botts      ", 
" WW and Co             ", " Higgin Bothams        ", " Bertie's Beans        ", 
" Leaky Cauldron        ", " Pet Peeves            ", " Lucy's coffee shop    ", 
" Dream cathers         ", " Dragon supplies       ", " SLUG AND   JIGGER'S   ", 
" FLOURISH AND BLOTTS.  ", " MADAM MALKIN'S ROBES  "), Zip.Codes = c("29888", 
"100397", "100430", "100459", "310283", "310330", "910345", "465250", 
"479187", "934464", "937833", "931283")), class = "data.frame", row.names = c(NA, 
-12L))

输入（df2）

structure(list(ï...City = c(" braavos        ", " highgarden     ", 
" vale           ", " dorne          ", " storms end     ", " king's landing ", 
" winterfell     "), Zip.ranges = c(" 029918-100290 ", " 100389-100440 ", 
" 200410-219000 ", " 310229-367890 ", " 389032-567000 ", " 601000-898000 ", 
" 910230-940200 ")), class = "data.frame", row.names = c(NA, 
-7L))

【问题讨论】：

您能use dput 提供两个数据集的样本吗？
嗨，刚刚在问题中进行了编辑。谢谢！
感谢您的反馈：这很清楚，解决方案也很简单。但是，我需要 dput 的结果，它是一个文本结构对象，可以直接复制到控制台中，以便重新创建两个数据帧。手动重新创建它们太乏味了
在主要问题中添加。谢谢！

标签： r join range

【解决方案1】：

data.tables：

df <- structure(list(Shop.names = c(" Bergin and botts      ", 
                                  " WW and Co             ", " Higgin Bothams        ", " Bertie's Beans        ", 
                                  " Leaky Cauldron        ", " Pet Peeves            ", " Lucy's coffee shop    ", 
                                  " Dream cathers         ", " Dragon supplies       ", " SLUG AND   JIGGER'S   ", 
                                  " FLOURISH AND BLOTTS.  ", " MADAM MALKIN'S ROBES  "), Zip.Codes = c("29888", 
                                                                                                       "100397", "100430", "100459", "310283", "310330", "910345", "465250", 
                                                                                                       "479187", "934464", "937833", "931283")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                     -12L))
dfranges <- structure(list(City = c(" braavos        ", " highgarden     ", 
                                        " vale           ", " dorne          ", " storms end     ", " king's landing ", 
                                        " winterfell     "), Zip.ranges = c(" 029918-923004 ", " 100389-100440 ", 
                                                                            " 200410-219000 ", " 310229-367890 ", " 389032-567000 ", " 601000-898000 ", 
                                                                            " 910230-940200 ")), class = "data.frame", row.names = c(NA, 
                                                                                                                                     -7L))
# Extract from-to, convert to numeric
dfranges <- cbind(dfranges,purrr::map_df(stringr::str_split(dfranges$Zip.ranges,"-"),~(data.frame(from=as.numeric(.x[1]),to=as.numeric(.x[2])))))



library(data.table)
setDT(df)
setDT(dfranges)

# convert Zip.Code to numeric
df[,Zip.Codes:=as.numeric(Zip.Codes)]

dfranges[df, .(City,x.from,x.to,Zip.Codes,Shop.names),on = .(from <= Zip.Codes, to >= Zip.Codes)]
#>                 City x.from   x.to Zip.Codes              Shop.names
#>  1:             <NA>     NA     NA     29888  Bergin and botts      
#>  2:  braavos          29918 923004    100397  WW and Co             
#>  3:  highgarden      100389 100440    100397  WW and Co             
#>  4:  braavos          29918 923004    100430  Higgin Bothams        
#>  5:  highgarden      100389 100440    100430  Higgin Bothams        
#>  6:  braavos          29918 923004    100459  Bertie's Beans        
#>  7:  braavos          29918 923004    310283  Leaky Cauldron        
#>  8:  dorne           310229 367890    310283  Leaky Cauldron        
#>  9:  braavos          29918 923004    310330  Pet Peeves            
#> 10:  dorne           310229 367890    310330  Pet Peeves            
#> 11:  braavos          29918 923004    910345  Lucy's coffee shop    
#> 12:  winterfell      910230 940200    910345  Lucy's coffee shop    
#> 13:  braavos          29918 923004    465250  Dream cathers         
#> 14:  storms end      389032 567000    465250  Dream cathers         
#> 15:  braavos          29918 923004    479187  Dragon supplies       
#> 16:  storms end      389032 567000    479187  Dragon supplies       
#> 17:  winterfell      910230 940200    934464  SLUG AND   JIGGER'S   
#> 18:  winterfell      910230 940200    937833  FLOURISH AND BLOTTS.  
#> 19:  winterfell      910230 940200    931283  MADAM MALKIN'S ROBES

^{由reprex package (v0.3.0) 于 2020 年 10 月 2 日创建}

请注意，您提供的某些 Zip 范围重叠，因此同一商店有两个结果。

【讨论】：