将对单个 .csv 文件执行的一系列操作自动化到 R 中同一目录中的所有 .csv 文件答案

【问题标题】：Automate a series of actions done on a single .csv file to all .csv files within the same directory in R将对单个 .csv 文件执行的一系列操作自动化到 R 中同一目录中的所有 .csv 文件
【发布时间】：2021-08-18 05:30:43
【问题描述】：

我正在进行一个研究项目，我需要处理来自一副触觉手套的数据。导出数据后，有4行包含我做分析后不需要的日期和时间，还有很多列我也不需要。长话短说，我需要删除前 4 行，只保留列 [1,2,33,53,76,95,114,133,164,184,207,226,245]。我为我编写了一个非常简单的 R 脚本，但我想知道如何将这组操作应用于同一目录中的所有 .csv 文件？每次手动输入每个文件名非常痛苦。提前谢谢！

# read uncleaned, raw, data
uncleaned_data<-read.csv("C:/Users/jiang/Desktop/Ready_Clean/Hongjiao_Medium_High1.csv", header = FALSE)

# remove the date and time headers
data_without_head<-uncleaned_data[-c(1,2,3,4),]

# extract the useful columns
cleaned_data<-data_without_head[,c(1,2,33,53,76,95,114,133,164,184,207,226,245)]

# write the new cleaned data into a new file name (adding "_cleaned" in the end)
write.table(cleaned_data,"C:/Users/jiang/Desktop/Ready_Clean/Hongjiao_Medium_High1_Cleaned.csv",row.names=FALSE,col.names=FALSE,sep=",")

【问题讨论】：

欢迎来到 Stackoverflow。 R 非常适合此类任务。该过程可以如下完成：（1）编写一个完成这项工作的函数。 (2) 使用dir 列出文件，(3) 使用lapply 或for 循环将函数应用于dir 返回的所有文件。更多详细信息可在文档中找到。

标签： r csv series

【解决方案1】：

您好，我为您编写了一些代码来回答您的问题。

首先设置工作目录
列出您需要处理的所有文件。我的假设是所有文件都以“Hongjiao_Medium_High”开头，然后有一些数字
使用 FOR 循环遍历文件名列表
通过一些调整将您的代码粘贴到 FOR 循环中

下面是代码：

setwd("C:/Users/jiang/Desktop/Ready_Clean")
list_of_file_names <- list.files(pattern = "*png")

for(i in list_of_file_names){
  # read uncleaned, raw, data
  print(i)
  uncleaned_data<-read.csv( i , header = FALSE)
  
  # remove the date and time headers
  data_without_head<-uncleaned_data[-c(1,2,3,4),]
  
  # extract the useful columns
  cleaned_data<-data_without_head[,c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
  
  # write the new cleaned data into a new file name (adding "_cleaned" in the end)
  write.table(cleaned_data,paste(i,"_Cleaned.csv"),row.names=FALSE,col.names=FALSE,sep=",")
}

【讨论】：

这是迄今为止我见过的唯一一个不需要任何包的。非常感谢阿努普！

【解决方案2】：

您可以列出目录中的所有文件，然后过滤以.csv结尾的文件：

我假设你的目录路径是“C:/Users/jiang/Desktop/Ready_Clean/”

很遗憾，我无法在我的电脑中测试代码，但如果您有任何问题，请告诉我。

library(tidyverse)
library(stringr)

#get all the .csvs present in the directory and then fabricate the new names just by appending '_cleaned' before .csv 

paths <- list.files(path = "C:/Users/jiang/Desktop/Ready_Clean/") %>%
          str_subset(pattern = '.csv$') #capture all the files ending in .csv


paths <- str_c("C:/Users/jiang/Desktop/Ready_Clean/", paths)


paths_cleaned <- str_replace(paths, '.csv$', '_cleaned.csv')

get_csv <- function(path, path_clean){
    # read uncleaned, raw, data
    uncleaned_data    <- read.csv(path, header = FALSE)
    
    # remove the date and time headers
    data_without_head <- uncleaned_data[-c(1,2,3,4),]
    
    # extract the useful columns
    cleaned_data      <- data_without_head[, c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
    
    # write the new cleaned data into a new file name (adding "_cleaned" in the end)
    write.table(cleaned_data,
                path_clean,
                row.names = FALSE,
                col.names = FALSE,
                sep = ",")
}

#walk2 would also be an option because we only care of side-effects here.
map2(path, path_cleaned, ~get_csv(.x, .y))

【讨论】：

我试图尝试使用这种方法，我试图将清理后的 csv 文件保存到 Ready_Clean 文件夹内的一个新文件夹中。我想我应该更改 paths_cleaned
您可以在 Ready_Clean 中创建一个新文件夹并将其命名为 New_Folder。之后您可以尝试：paths_cleaned % str_replace('Ready_Clean/', 'Ready_Clean/New_Folder/') %>% str_replace('.csv$', '_cleaned.csv')
哇太棒了！我必须了解更多关于 %>% 的信息。非常感谢！
R中可以使用dir.create()函数新建目录，无需手动创建子目录。此外，可以在尝试创建目录之前使用dir.exists() 检查目录是否存在，如我更新的答案所示。 @Paul726 - 从 R 4.1 开始，Base R 中有一个管道函数 |>，因此您可以使用管道而无需导入 tidyverse 软件包系列使用的 magrittr 软件包。

【解决方案3】：

Base R 解决方案如下所示。首先，我们使用list.files()提取以.csv结尾的文件，然后使用文件列表驱动@987654328@读取数据，对其进行子集化，并使用write.table()写入。

theFiles <- list.files(path="C:/Users/jiang/Desktop/Ready_Clean/",
                       pattern="\\.csv$",full.names=TRUE)
dataList <- lapply(theFiles,function(x){
     y <- read.csv(x,skip = 4,header=FALSE)[c(1,2,33,53,76,95,114,133,164,184,207,226,245)]
     write.table(y,paste0(x,".cleaned"))
})

请注意，我们在读取每个文件时使用skip = 参数跳过前四行，然后立即通过提取运算符的[ 形式对read.csv() 创建的对象进行子集化。

在write.table() 操作中，我们使用paste0() 将.cleaned 附加到每个原始文件名，以区分清理后的文件和原始文件。

由于最初的问题不包括最小的可重现示例，我们将使用我的Pokémon Stats GitHub 存储库中的数据来说明解决方案。

神奇宝贝统计数据的维度与原始问题中描述的数据有很大不同，因此我们将跳过每个文件的前四行，只保留第 1、2、4 和 6 列。

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
                  "pokemonData.zip",mode="wb")
    unzip("pokemonData.zip",exdir="./pokemonData")


theFiles <- list.files("./pokemonData",pattern="\\.csv$",full.names=TRUE)
dataList <- lapply(theFiles,function(x){
     y <- read.csv(x,skip = 4,header=FALSE)[c(1,2,4,6)]
     write.table(y,file=paste0(x,".cleaned"),row.names=FALSE,col.names=FALSE,sep=",")
})

原始文件之一的屏幕截图可用于验证输出。我从输入的第四行（包括标题行）开始突出显示了第 1、2、4 和 6 列。

..../pokemonData/gen01.csv.cleaned 的前几行的输出是：

4,"Charmander","Fire",309
5,"Charmeleon","Fire",405
6,"Charizard","Fire",534
7,"Squirtle","Water",314
8,"Wartortle","Water",405
9,"Blastoise","Water",530

文件gen01.csv 包含第一代神奇宝贝。此文件中的前三个神奇宝贝是 Bulbasaur、Ivysaur 和 Vensuaur。从输出中我们可以看到这些神奇宝贝和原始文件中的标题行被跳过了，所以第一个观察是神奇宝贝4，Charmander。我们还看到，第六列的Total stat 匹配输入文件中已写入输出文件的行。

验证写入的文件

因为我们在每个文件的末尾附加了.cleaned，所以我们可以使用相同的技术来列出.cleaned 文件，就像我们列出.csv 文件并使用read.csv() 读取它们一样。这使我们可以将原始文件与清理后的文件保持距离。

# now read the cleaned files
theFiles <- list.files("./pokemonData",pattern="\\.cleaned$",full.names=TRUE)
dataList <- lapply(theFiles,read.csv,header=FALSE)
head(dataList[[1]])

此时dataList 对象是一个list()，其中包含8 个数据帧，每一代神奇宝贝一个。

我们使用head()打印列表中第一个数据框的前几行，与上面的结果匹配：

> head(dataList[[1]])
  V1         V2    V3  V4
1  4 Charmander  Fire 309
2  5 Charmeleon  Fire 405
3  6  Charizard  Fire 534
4  7   Squirtle Water 314
5  8  Wartortle Water 405
6  9  Blastoise Water 530

将清理后的文件写入单独的目录

根据 cmets 中对我的回答提出的要求，这是一个解决方案，它在最初存储文件的目录中创建一个 /cleaned 子目录，并将文件写入该目录。

首先，我们为输入和输出目录创建对象。然后我们为输出文件创建一个新的子目录，如果它不存在的话。

# solution that creates a ./cleaned subdirectory

inputDirectory <- "./pokemonData"
outputDirectory <- paste0(inputDirectory,"/cleaned")
if(!dir.exists(outputDirectory)) dir.create(outputDirectory)

通过在尝试创建目录之前检查目录是否存在，我们消除了该脚本第二次和后续运行时出现的错误。

接下来，我们列出输入目录中的文件。因为我们稍后会在脚本中使用inputDirectory 和outputDirectory 对象来手动构建每个输入和输出文件的完整路径名，所以我们将list.files() 的full.names= 参数设置为FALSE。

theFiles <- list.files(inputDirectory,pattern="\\.csv$",full.names=FALSE)

接下来，我们使用lapply() 读取文件，对正确的行和列进行子集化，并将清理后的文件写入输出目录。

dataList <- lapply(theFiles,function(x){
     y <- read.csv(paste0(inputDirectory,"/",x),skip = 4,header=FALSE)[c(1,2,4,6)]
     write.table(y,file=paste0(outputDirectory,"/",x),row.names=FALSE,col.names=FALSE,sep=",")
})

# verify that files were written to cleaned directory
list.files(outputDirectory,full.names=TRUE)

...和输出：

> list.files(outputDirectory,full.names=TRUE)
[1] "./pokemonData/cleaned/gen01.csv" "./pokemonData/cleaned/gen02.csv"
[3] "./pokemonData/cleaned/gen03.csv" "./pokemonData/cleaned/gen04.csv"
[5] "./pokemonData/cleaned/gen05.csv" "./pokemonData/cleaned/gen06.csv"
[7] "./pokemonData/cleaned/gen07.csv" "./pokemonData/cleaned/gen08.csv"
>

附录

由于评论者断言paste0() 中文件名中的点未正确呈现，因此以下子目录的屏幕截图表明代码确实按我的预期工作。

【讨论】：

谢谢！我运行了代码，但似乎在 .csv 之后添加了“.cleaned”，这使得新文件成为“CLEANED 文件”类型。有没有办法在 .csv 之前更改“_cleaned”或“.cleaned”？我试图做 write.table(y,paste0(x,".cleaned.csv")) 但没有工作。
你可以试试 write.table(y,paste0(x,"_cleaned.csv"))。这两个点可能会导致解释扩展出现问题。
是的，有一种方法可以在.csv 之前添加令牌cleaned，但这会破坏list.files() 命令，因为下次运行程序时，已清理和未清理文件将被list.files() 阅读。更优雅的解决方案是将文件写入不同的子目录，该子目录很容易编码。我将更新我的解决方案以展示如何做到这一点。
我明白了。这真的很有帮助。非常感谢！
@Paul726 - 我更新了我的答案以包含一个将输出文件写入单独子目录的版本。