【问题标题】:R - subset data frame keeping only the rows that agree with multiple conditions over ALL columnsR - 子集数据框仅保留与所有列上的多个条件一致的行
【发布时间】:2017-07-18 19:26:48
【问题描述】:

我想要的快速总结是这样的:

我在同一个文件夹中有数千个 .csv 文件,其中包含诸如 discount ratediscounted cash flow 之类的短语,主要位于第一列,但也随机位于前 10 列。

使用某些函数(可能是 grepl()subset()filter()),我想提取包含这些短语的行并将它们与名称一起放入一个新的数据框中他们各自来自的文件。


我遇到的问题是,我一直在尝试的每个功能一次只允许查看一到两列。这是我一直在使用的代码:

#Reading in a single .csv file for now:
MyData <- read.csv("c:/____________/.csv", header = TRUE, sep=",")

#Assigning numbers to each column since each file I will be plugging in has different column headings:
colnames(MyData) <- c(1:ncol(MyData))

#Using subset to check the 1st column and 5th column for discount rate 
#(only because I knew these 2 columns contained the phrase "discount rate" ahead of time.)
my.data.frame <- subset(MyData, MyData$`1`=="discount rate" | MyData$`5`=="discount rate")

所以重申一下,我想知道是否有办法搜索许多短语,例如 discount ratediscounted ratesdiscounted cash flow 在某些 data.frame 中的每一列。感谢您提供的任何帮助。

此外,我提供的代码确实会返回包含指定列的行 折扣率,但不包含包含其他词的行,例如折扣率是 5.0%。如果知道此问题的解决方案,我将不胜感激。

【问题讨论】:

  • 查看grep函数。
  • 使用 grep,这似乎很复杂,因为您必须指定一个列名来搜索,但我正在查看的所有文件都没有一致的名称或列数

标签: r csv filter subset grepl


【解决方案1】:

考虑使用grepl(在正则表达式匹配时返回TRUE/FALSE)放在apply 中。并且都包含在一个更大的lapply 中,以通过您的许多带有子集行的 csv 文件构建数据帧列表,然后在最后完全绑定行:

setwd("C:/path/to/my/folder")
myfiles <- list.files(path="C:/path/to/my/folder")

dfList <- lapply(myfiles, function(file){
    df <- read.csv(file, header = TRUE)
    colnames(df) <- c(1:ncol(df))

    # ADD COLUMN FOR FILENAME
    df$filename <- file

    # RETURNS 1 IF ANY COLUMN HAS MATCH
    df$discountfound <- apply(df, 1, function(col) 
                              max(grepl("discount rate|discounted cash flow", col)))

    # SUBSET AND REMOVE discountfound COLUMN
    df <- transform(subset(df, df$discountfound == TRUE), discountfound=NULL)
})

# ASSUMES ALL DATAFRAMES HAVE EQUAL NUMBER OF COLUMNS
finaldf <- do.call(rbind, dfList)

【讨论】:

    【解决方案2】:

    你可以试试这个。我希望这是你想要的。

    mydata = data.frame(a = c(1:3,"discount rate","discounted rates",2:5),
                        b = c("discount rate","discounted rates",2:8))
    
    row = c()
    for (i in 1:nrow(mydata)){
      good_row = grep(paste("discount rate","discounted rates",sep="|"),unlist(mydata[i,]))
      if (length(good_row) != 0){
        row = c(row,i)
      }
    }
    
    mydata = mydata[row,]
    

    【讨论】:

      【解决方案3】:

      这样的事情会奏效吗?可以使用正则表达式(regex)修改 '折扣'根据你的需要。

      #Sample dataframe with 'discount rate', 'discounted rates', or 'discounted cash flow' randomly placed
      df <- data.frame(a=c('discount rate', 'nothing', 'discounted cash flow', 'nothing', 'nothing'), b=1:5,
        c=6:10, d=c('nothing', 'discounted rates', 'nothing', 'nothing', 'nothing'), stringsAsFactors = F)
      df
                           a b  c                d
      1        discount rate 1  6          nothing
      2              nothing 2  7 discounted rates
      3 discounted cash flow 3  8          nothing
      4              nothing 4  9          nothing
      5              nothing 5 10          nothing
      
      #Get rows where the word 'discount' occurs in any row
      discountRows <- unique(unlist(apply(df, 2, function(x) grep('discount', x))))
      
      #Subset df with only rows where the word 'discount' occurs
      df[discountRows,]
                           a b c                d
      1        discount rate 1 6          nothing
      3 discounted cash flow 3 8          nothing
      2              nothing 2 7 discounted rates
      
      #Assign subsetted df to new dataframe with original name in it
      assign(paste0(deparse(substitute(df)), '_discountRows'), df[discountRows,])
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-08-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-03
        • 2021-01-24
        • 1970-01-01
        • 1970-01-01
        • 2018-03-22
        相关资源
        最近更新 更多