在R中自动加入/合并文件[关闭]答案

【问题标题】：Automate file joining/merging in R [closed]在R中自动加入/合并文件[关闭]
【发布时间】：2017-02-13 15:07:00
【问题描述】：

我有大量数据文件需要通过以下方式连接。

第 1 步。

需要在“位置”中连接数据，这些位置令人讨厌地分为两部分（A 部分和 B 部分）。这些文件存储在一个文件夹中，具有随机的、不连续的位置编号，例如：

PartA_location012843.csv
PartB_location012843.csv
PartA_location465475.csv
PartB_location465475.csv
...

有没有办法遍历所有文件，在每个位置连接 A 和 B 部分，而无需手动指定位置 ID 号？然后连接将是来自dplyr 的简单left_join(PartA_locationX, PartB_locationX, by='common_field')。我猜输出将是 R 工作空间中的一系列数据框对象，每个位置一个：

location012843
location465475
...

第 2 步。

接下来，需要将所有位置一起附加到一个带有rjoin 的数据帧中，并保留location_id，这样：

     location_id fieldA fieldB common_field
1 location012843      x      y            c
2 location012843      x      y            c
...

【问题讨论】：

标签： r loops for-loop merge

【解决方案1】：

您没有提供大量详细信息，因此此代码假定每个位置都有两个 CSV，并且没有丢失的 CSV 文件。它还假设所有位置代码都是六位数。这将为您在步骤 2 结束时指定的所有位置创建一个 data.frame，并跳过为每个位置分别创建数据框，如您在步骤 1 中提到的 - 如果您需要这些，您可以稍后将它们 filter 删除。

library(dplyr)
library(stringr)

# Create list of CSV files to pull in
a_files <- list.files("your_folder/", pattern = "PartA")
b_files <- list.files("your_folder/", pattern = "PartB")

# Create df for final output
final_df < - data.frame()

for (i in seq_along(a_files)) {

  # Extract location name from PartA file name
  loc_a <- str_extract(a_files[i], "location[0-9]{6}")

  # Read in CSVs and store location as variable
  parta <- read.csv(a_files[i]) %>%
    mutate(location_id = str_extract(a_files[i], "location[0-9]{6}"))
  partb <- read.csv(b_files[i]) %>%
    mutate(location_id = str_extract(b_files[i], "location[0-9]{6}"))

  # Join on common field and location 
  # Will throw errors if the locations are off in parta and partb
  final_df <- left_join(parta, partb, by= c('common_field', 'location_id')) %>%
    bind_rows(final_df)

}

【讨论】：

谢谢！它在bind_rows 下运行良好。它不是将field_A 和field_B 中的所有值绑定在一起，而是创建额外的field_A.x 和field_B.x 并用NA 填充组合。
field_A 和 field_B 都在 PartA 和 PartB 文件中吗？如果是这样，您可能希望从 PartA 数据集中删除 field_B 变量，反之亦然
不，field_A 仅在 PartA 中，field_B 仅在 PartB 中。如果我知道怎么做，我可以分享我正在玩的虚拟文件。
你能不能 dput() 一些在 parta 和 partb 数据帧中生成的内容？ dplyr 添加了 .x 和 .y 以澄清在加入具有相同变量的 tbls 时的事情，这表明它们正在共享列名......
实际上，只是查看了您之前提供的输出 - 看起来在 location312523 中您在两个 CSV 中都有 field_A，而在 location012345 中的东西正确合并。这可能是该位置的 CSV 存在问题