【问题标题】:Shorter method to replace entries in R [duplicate]替换R中条目的更短方法[重复]
【发布时间】:2016-11-19 06:16:51
【问题描述】:

我最近开始学习 R。这是我正在使用的源文件 (https://github.com/cosname/art-r-translation/blob/master/data/Grades.txt)。无论如何我可以在不使用循环的情况下将字母等级从 A 更改为 4.0、A- 到 3.7 等?

我问是因为如果有 1M 条目,“for”循环可能不是修改数据的最有效方式。我将不胜感激。


由于其中一位发帖人告诉我发布我的代码,我想运行 for 循环来看看我是否能够做到。这是我的代码:

mygrades<-read.table("grades.txt",header = TRUE)

i <- for (i in 1:nrow(mygrades))
{
  #print(i)  
  #for now, see whether As get replaced with 4.0.
  if(mygrades[i,1]=="A")
  {
    mygrades[i,1]=4.0
  }
  else if (mygrades[i,2]=="A")
  {
    mygrades[i,2]=4.0
  }
  else if (mygrades[i,3]=="A")
  {
    mygrades[i,3]=4.0
  }
  else
  {
    #do nothing...continues
  }

}

write.table(mygrades,"newgrades.txt")

但是,输出有点奇怪。对于某些“A”,我得到 NA,而其他人则保持原样。有人可以帮我处理这段代码吗?


@alistaire,我确实尝试了 Hadley 的查找表,它确实有效。我还查看了 dplyr 代码,它运行良好。但是,为了我的理解,我仍在尝试使用 for 循环。请注意,我打开一本 R 书已经过去了大约两天。这是修改后的代码。

#there was one mistake in my code: I didn't use stringsAsFactors=False.
#now, this code doesn't work for all "A"s. It spits out 4.0 for some As, and #doesn't do so for others. Why would that be?

mygrades<-read.table("grades.txt",header = TRUE,stringsAsFactors=FALSE)

i <- for (i in 1:nrow(mygrades))
{
  #print(i)  
  if(mygrades[i,1]=="A")
  {
    mygrades[i,1]=4.0
  }
  else if (mygrades[i,2]=="A")
  {
    mygrades[i,2]=4.0
  }
  else if (mygrades[i,3]=="A")
  {
    mygrades[i,3]=4.0
  }
  else
  {
    #do nothing...continues
  }

}

write.table(mygrades,"newgrades.txt")

输出是:

"final_exam" "quiz_avg" "homework_avg"
"1" "C" "4" "A"
"2" "C-" "B-" "4"
"3" "D+" "B+" "4"
"4" "B+" "B+" "4"
"5" "F" "B+" "4"
"6" "B" "A-" "4"
"7" "D+" "B+" "A-"
"8" "D" "A-" "4"
"9" "F" "B+" "4"
"10" "4" "C-" "B+"
"11" "A+" "4" "A"
"12" "A-" "4" "A"
"13" "B" "4" "A"
"14" "D-" "A-" "4"
"15" "A+" "4" "A"
"16" "B" "A-" "4"
"17" "F" "D" "A-"
"18" "B" "4" "A"
"19" "B" "B+" "4"
"20" "A+" "A-" "4"
"21" "4" "A" "A"
"22" "B" "B+" "4"
"23" "D" "B+" "4"
"24" "A-" "A-" "4"
"25" "F" "4" "A"
"26" "B+" "B+" "4"
"27" "A-" "B+" "4"
"28" "A+" "4" "A"
"29" "4" "A-" "A"
"30" "A+" "A-" "4"
"31" "4" "B+" "A-"
"32" "B+" "B+" "4"
"33" "C" "4" "A"

正如您在第一行中看到的,第一个 A 被重新编码为 4,但第二个 A 没有被重新编码。知道为什么会这样吗?

提前致谢。

【问题讨论】:

  • key &lt;- c(A = 4, 'A-' = 3.7, F = 0); key[c('F','A','A-')]
  • @rawr 你去哪儿了?它只是一个带有密钥的数据框,没有任何实际替换,对吧?
  • @rawr 您发布的代码。它什么也没做。我试图帮助你解释它的用途?
  • 这是一个通用的概念,可以用于许多其他方面。这是一个:grades &lt;- as.matrix(read.table('https://raw.githubusercontent.com/cosname/art-r-translation/master/data/Grades.txt', header = TRUE)); un &lt;- unique(c(grades)); key &lt;- setNames(c(1:100, seq(un)), c(1:100, sort(un))); data.frame(matrix(key[grades], nrow(grades)))
  • @Hack-R 哈哈好吧:“我从不投反对票”

标签: r data-analysis


【解决方案1】:

base R 中的一种典型方法是将命名向量作为查找表,例如

# data with fewer levels for simplicity
df <- data.frame(x = rep(1:3, 2), y = rep(1:2, 3))

lookup <- c(`1` = "A", `2` = "B", `3` = "C")

并用每一列对其进行子集化:

data.frame(lapply(df, function(x){lookup[x]}))
##   x y
## 1 A A
## 2 B B
## 3 C A
## 4 A B
## 5 B A
## 6 C B

另外,dplyr 最近添加了一个对此类工作很有用的 recode 函数:

library(dplyr)

df <- read.table('https://raw.githubusercontent.com/cosname/art-r-translation/master/data/Grades.txt', header = TRUE)

df %>% mutate_all(funs(recode(., A = '4.0', 
                              `A-` = '3.7'))) %>%    # etc.
    as_data_frame()    # for prettier printing

## # A tibble: 33 x 3
##    final_exam quiz_avg homework_avg
##        <fctr>   <fctr>       <fctr>
## 1           C      4.0          4.0
## 2          C-       B-          4.0
## 3          D+       B+          4.0
## 4          B+       B+          4.0
## 5           F       B+          4.0
## 6           B      3.7          4.0
## 7          D+       B+          3.7
## 8           D      3.7          4.0
## 9           F       B+          4.0
## 10         39       C-           B+
## # ... with 23 more rows

【讨论】:

  • 感谢 Alistaire 和其他海报。我已经使用“for”循环发布了我的代码。我得到古怪的输出。你觉得你能帮我吗?我对R相当陌生,所以我为愚蠢的问题道歉。如果您能提供任何帮助,我将不胜感激。
  • @watchtower 此解决方案大量使用for 循环和其他慢速控制流语句。它只是将它们隐藏在recodegithub.com/hadley/dplyr/blob/master/R/recode.R的源代码中
  • @watchtower 老实说,您所使用的方法将需要大约 400 行代码来编写,并且您可能会在某些时候打错字,这会让您发疯。您可以从查看an example of a lookup table that Hadley wrote 开始,然后看看您是否可以理清第一个示例的工作原理,以及它对您的数据的影响(更大,但小于for 循环)。 dplyr 方法更高级一些,但却是一个有趣的替代方案。
猜你喜欢
  • 2016-08-26
  • 1970-01-01
  • 1970-01-01
  • 2020-12-25
  • 1970-01-01
  • 1970-01-01
  • 2014-06-12
  • 2019-12-14
  • 1970-01-01
相关资源
最近更新 更多