【问题标题】:Remove data in parentheses from data frame从数据框中删除括号中的数据
【发布时间】:2018-08-16 00:44:20
【问题描述】:

所以我有一个包含 49 行和 109 个特征的数据集,其中数据被格式化,以便每个条目都有一个平均值和 sd 值。这是一个示例:

> head(score_data[,1:4])
# A tibble: 6 x 4
  Variable                                                     Overall       `18 to 29`    `30 to 39`   
  <chr>                                                        <chr>         <chr>         <chr>        
1 ts.tsmart_partisan_score (mean (sd))                         94.01 (9.73)  92.56 (10.82) 94.14 (9.55) 
2 ts.tsmart_presidential_general_turnout_score (mean (sd))     66.23 (24.38) 51.56 (20.02) 58.44 (24.36)
3 ts.tsmart_midterm_general_turnout_score (mean (sd))          50.29 (29.05) 31.09 (18.81) 34.82 (22.15)
4 ts.tsmart_offyear_general_turnout_score (mean (sd))          20.71 (15.08) 25.38 (17.36) 18.84 (14.35)
5 ts.tsmart_presidential_primary_turnout_score (mean (sd))     48.34 (28.12) 38.26 (22.26) 36.19 (22.72)
6 ts.tsmart_non_presidential_primary_turnout_score (mean (sd)) 40.21 (29.00) 27.03 (20.14) 23.52 (19.32)

我希望从数据集中提取数据集中所有 109 列的平均值。由于特征是字符,我知道我可以使用单独的命令根据第一个括号的索引将数据分成两列,如下所示:

data <- data %>% separate(PrecinctName, into = c("Precinct", "PrecinctCode"), sep = 5)

但是,我想对整个数据集中的每个特征都执行此操作,并且使用上述方法既耗时又痛苦。有没有人有更优雅的解决方案?我并不特别关心保存 sd 数据,因此该方法不必将其包含在其函数中。

根据要求,这里是替代输出:

    > dput( head(score_data[,1:4]))
structure(list(Variable = c("ts.tsmart_partisan_score (mean (sd))", 
"ts.tsmart_presidential_general_turnout_score (mean (sd))", "ts.tsmart_midterm_general_turnout_score (mean (sd))", 
"ts.tsmart_offyear_general_turnout_score (mean (sd))", "ts.tsmart_presidential_primary_turnout_score (mean (sd))", 
"ts.tsmart_non_presidential_primary_turnout_score (mean (sd))"
), Overall = c("94.01 (9.73)", "66.23 (24.38)", "50.29 (29.05)", 
"20.71 (15.08)", "48.34 (28.12)", "40.21 (29.00)"), `18 to 29` = c("92.56 (10.82)", 
"51.56 (20.02)", "31.09 (18.81)", "25.38 (17.36)", "38.26 (22.26)", 
"27.03 (20.14)"), `30 to 39` = c("94.14 (9.55)", "58.44 (24.36)", 
"34.82 (22.15)", "18.84 (14.35)", "36.19 (22.72)", "23.52 (19.32)"
)), .Names = c("Variable", "Overall", "18 to 29", "30 to 39"), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

  • 您的目标是分开它们还是删除/删除括号中的那些?
  • @Onyambu 我的意思是删除括号中的那些和括号之前的前导空格。
  • 你能发帖dput( head(score_data[,1:4]))
  • 刚刚编辑帖子以包含它。

标签: r dplyr substring character tidyr


【解决方案1】:

这样的?

means <- sapply(score_data[, -1], function(x) as.numeric(substr(x, 1,
     regexpr(" ", x) - 1)))
means
#      Overall 18 to 29 30 to 39
# [1,]   94.01    92.56    94.14
# [2,]   66.23    51.56    58.44
# [3,]   50.29    31.09    34.82
# [4,]   20.71    25.38    18.84
# [5,]   48.34    38.26    36.19
# [6,]   40.21    27.03    23.52

【讨论】:

    【解决方案2】:

    一个简单的正则表达式应该这样做:

            for (i in names(score_data)[-(1)]) {   
                   score_data[[i]] <- as.numeric(gsub( " .*$", "", score_data[[i]] )) 
            }
    

    【讨论】:

      【解决方案3】:
       read.table(stringsAsFactors = F,text=gsub("\\(.*?\\)|\\)","",do.call(paste,dat)))
                                                      V1    V3    V4    V5
      1                         ts.tsmart_partisan_score 94.01 92.56 94.14
      2     ts.tsmart_presidential_general_turnout_score 66.23 51.56 58.44
      3          ts.tsmart_midterm_general_turnout_score 50.29 31.09 34.82
      4          ts.tsmart_offyear_general_turnout_score 20.71 25.38 18.84
      5     ts.tsmart_presidential_primary_turnout_score 48.34 38.26 36.19
      6 ts.tsmart_non_presidential_primary_turnout_score 40.21 27.03 23.52
      

      【讨论】:

      • 您需要注意的一件事是列是数字的。因此,您可以按照自己的方式操作它们
      【解决方案4】:

      您可以使用 gsub() 和正则表达式来删除括号内的任何字符,如下所示:

      test <- score_data %>% mutate_at(vars(-Variable),funs(gsub("\\([^\\)]+\\)", "", ., perl = T)))
      
      
                                                          Variable Overall X18.to.29 X30.to.39
      1                         ts.tsmart_partisan_score (mean (sd))  94.01     92.56     94.14 
      2     ts.tsmart_presidential_general_turnout_score (mean (sd))  66.23     51.56     58.44 
      3          ts.tsmart_midterm_general_turnout_score (mean (sd))  50.29     31.09     34.82 
      4          ts.tsmart_offyear_general_turnout_score (mean (sd))  20.71     25.38     18.84 
      5     ts.tsmart_presidential_primary_turnout_score (mean (sd))  48.34     38.26     36.19 
      6 ts.tsmart_non_presidential_primary_turnout_score (mean (sd))  40.21     27.03     23.52 
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-10-09
        • 1970-01-01
        • 2022-11-16
        • 1970-01-01
        • 1970-01-01
        • 2018-01-12
        • 2022-12-23
        • 2016-11-03
        相关资源
        最近更新 更多