【发布时间】:2017-10-13 22:55:12
【问题描述】:
好的,所以我有一点困境,我知道它必须有一个解决方案。 我有一个包含 13 列的数据表,但是我们只关心两个(Fare 和 pClass)。有 1309 行,1308 行有票价值,我想通过基于不同类别(pClass)的平均值来找到缺失值。所以我想要的是告诉R找到Fare = NA的行,读取pClass(1,2或3)中的值,然后找到该指定类的平均值,然后替换Fare中的缺失值以这个平均值
所以我想总结一下你的任务,无论谁勇敢和善良地帮助我。我想找到一个缺失值,找出它是什么类,专门平均那个缺失值类,然后用正确的平均值替换那个缺失值
当我在 R 中有多个缺失值时,我可以用正确的平均值替换而不管决定列如何,使用它而不是仅仅找到缺失的行并读取它是一个更好的途径。
感谢您的宝贵时间,
-迪伦
好的,因为这太具体了,无法回答最初的问题,所以新计划男孩(和女孩以及其他任何你想成为 idrc 的人,只要你知道你在说什么)。所以!新计划是使 3 个变量对应于三个不同的 pClass(1、2 和 3)。这些 pClass 平均值中的每一个(将调用 'em pClassAVG.(x) 其中 x = 1、2 或 3)然后让 R 找到票价的缺失值并将它们替换为相应 pClass 的 pClass 变量(平均值) R 的思考过程应该是这样的“好吧,缺失值。pClass 是什么?好吧它是 2,所以我们应该用 pClassAVG.2 替换缺失值”
上次我因为不包含我的代码而得到 -1,所以在这里
setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the
#first line as data but as a title
#currently reads incorrectly
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new column to tell us if it is the train set or test set
titanic.test$Survived <- NA
#makes a new column and fills it with NA to make the columns line up and have the same names
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes
age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assigns it the median of the age column excluding the missing values (if we included missing
#values it would break bc its adding an undefined number)
#this method is better for replacing data that can change for example real time data that changes over the course of the day and your
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.
titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the column age of titanic.full and returns true if there are missing value
pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1 )
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2 )
最后两行是我试图告诉它生成上述 pClassAVG.1 和 pClassAVG.2
【问题讨论】:
-
Dylan,对于您的下一个问题,请查看@thecatalyst 刚刚提供的此链接
标签: r missing-data