【发布时间】:2021-06-11 02:32:18
【问题描述】:
我正在尝试编写一个函数来计算数值和分类变量(因子)的描述性统计。 对于数值型变量,计算均值(MEAN)、中位数(MEDIAN)、标准差(SD),并计算缺失值个数(NMiss)。 对于字符变量,应将变量各级别内的计数制表并统计缺失值的个数。
起始输入数据为:
ID GLUC TGL HDL LDL HRT MAMM SMOKE
1 A 88 NA 32 99 Y <NA> ever
2 B NA 150 60 NA <NA> no never
3 C 110 NA NA 120 N <NA> <NA>
4 D NA 200 65 165 <NA> yes never
我希望它看起来像这样:
> table1 (dat=patient, numvar=c("TGL", "HDL", "LDL"), charvar=c("HRT", "MAMM"))
$numericStats
varName MEAN MEDIAN SD NMiss
1 TGL 180.66667 180.0 23.03620 4
2 HDL 55.66667 62.5 19.00175 4
3 LDL 160.28571 165.0 40.06126 3
$FactorStats
varName group count
1 HRT N 2
2 Y 3
3 NMiss 5
4 MAMM no 2
5 yes 4
6 NMiss 4
这是我目前的代码:
#numericstats
findnum = function(dat, numvar){
numstats=data.frame()
for (i in length(numvar[])){
var_select = dat[[numvar[i]]]
mean_value = round(mean(var_select, na.rm=T),2)
median_value = round(median(var_select, na.rm=T),2)
SD = round(sd(var_select, na.rm=T),2)
N = length(var_select[!is.na(var_select)])
N_miss = length(var_select[is.na(var_select)])
numstats =
cbind(varname = numvar, mean = mean_value, median = median_value, sd = SD, nmissing = N_miss)
}
return(numstats)
}
findnum(dat=patient, numvar=c("TGL","HDL","LDL"))
#factorstats
findfactor = function(dat, charvar){
factstats=data.frame()
for (i in length(charvar[])){
var_select = dat[[charvar[i]]]
count = length(charvar)
group = charvar
factstats =
cbind(varname = charvar, group = charvar, count = count)
}
return(factstats)
}
findfactor(dat=patient, charvar=c("MAMM","SMOKE"))
#full function
table1 = function(dat, numvar, charvar){
for (i in 1:length(dat)){
if (!is.numeric(i))
numericstats = findnum(dat, i)
else factorstats = findfactor(dat, i)
return(data.frame(numericstats, factorstats))
}
}
【问题讨论】:
标签: r function statistics character numeric