R重命名行并创建主键和外键答案

【问题标题】：R renaming rows and creating primary and foreign keyR重命名行并创建主键和外键
【发布时间】：2021-06-08 13:08:27
【问题描述】：

我正在研究 R 中的一个项目。我为一个机构实施的所有项目的表格创建了一个数据框。数据框表包括一个 Country 列，其中包含项目实施所在国家/地区的名称

看起来像这样，有超过 20,000 行

$ ProjectID                      <chr> "P163945", "P169561", "P171613", "P172627"…
$ Region                         <chr> "Africa West", "Africa East", "Africa West…
$ Country                        <chr> "Western Africa", "United Republic of Tanz…
$ PName                          <chr> "Investments towards Resilient Management …

我还有第二张表，它也有国家名称，但格式更短

$ Rank                         <int> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ `Country/region`             <chr> "Kenya", "Libya", "Dominica", "Ethiopia", "B…
$ `Real GDP growthrate (%)[1]` <chr> "1.9", "-66.7", "-8.8", "1.9", "3.8", "4.5",…

现在，我想重命名表 1 中的国家名称，使它们看起来与表 2 中的国家名称相同（即，表 1 中的坦桑尼亚联合共和国从第 2 列变为坦桑尼亚）。我尝试使用 countrycode 包，但看起来对我的情况没有帮助。我想避免手动重命名超过 100 个名称。一旦两列中的名称相同，我想使用 R 中的 SQL 包来设置主键和外键并将表中的数据连接在一起。我会很感激任何建议！

【问题讨论】：

特别是考虑到我们所拥有的（即只有一个完整的Country），很难帮助您。我认为最好的方法是在一个新框架中生成一个框架映射Country 到Country/region，这可能是 100 行长。是的，这可能需要以编程方式完成，从那里，您可以使用此映射框架merge 您的 20,000 行框架并选择您需要的国家/地区列。（关于合并/加入的参考：stackoverflow.com/q/1299871/3358272，stackoverflow.com/a/6188334/3358272.）
如果这没有意义，我不会对我们在这里看到的样本做任何事情：请edit您的问题并为每一帧粘贴dput(x)的输出，其中@987654331 @ 足够大，可以清楚地提供一个易于使用的框架，但不会大到完全破坏页面。两个框架之间的国家应该有足够的重叠，请不要提供两个没有共同点的样本。谢谢。

标签： r

【解决方案1】：

这正是countrycode 包的用途...

library(countrycode)

df1 <- 
  data.frame(
    ProjectID = c("P163945", "P169561", "P171613"),
    Region = c("Africa West", "Africa East", "Africa South"),
    Country = c(" Republic of Guinea-Bissau", "United Republic of Tanzania", "Republic of Madagascar")
  )

df2 <- 
  data.frame(
    Rank = c(1, 3, 4),
    `Country/region` = c("Tanzania", "Guinea-Bissau", "Madagascar"),
    `Real GDP growthrate (%)[1]` = c("1.9", "-66.7", "-8.8")
  )


df1$iso3c <- countrycode(df1$Country, "country.name", "iso3c")
df2$iso3c <- countrycode(df2$Country.region, "country.name", "iso3c")


dplyr::full_join(df1, df2, by = "iso3c")
#>   ProjectID       Region                     Country iso3c Rank Country.region
#> 1   P163945  Africa West   Republic of Guinea-Bissau   GNB    3  Guinea-Bissau
#> 2   P169561  Africa East United Republic of Tanzania   TZA    1       Tanzania
#> 3   P171613 Africa South      Republic of Madagascar   MDG    4     Madagascar
#>   Real.GDP.growthrate.....1.
#> 1                      -66.7
#> 2                        1.9
#> 3                       -8.8

【讨论】：

感谢您的回答！效果很好。

【解决方案2】：

这是一个字符串匹配问题。查看stringdist 包。 stringdistmatrix(a, b) 函数比较两个字符串向量。

所以策略可以是计算成对的字符串距离并选择指示最小距离的那些。

dmat <- stringdistmatrix(table1$country, table2$country)
matched <- apply(dmat,1,which.min)
new_id <- table2$country[matched]

new_id 然后可以作为列添加到表 1。单行将是

table2$country[apply(stringdistmat(table1$country, table2$country), 1, which.min)]

您需要检查结果，因为可能存在歧义（在大多数字符串匹配操作中）。但是这种方法应该可以减少需要手动调整的情况。

【讨论】：