【问题标题】:If function returning values as NAs如果函数返回值作为 NA
【发布时间】:2021-06-28 00:32:41
【问题描述】:

我正在对大型数据集中有多少值缺失、抑制等进行基本分析。我正在使用下面的函数对各种类型的缺失数据进行分类。

Iffunction2 <- function(x) ifelse(is.na(x), "NA",ifelse(x == "-1", "Suppressed", ifelse(is.null(x), "Blank", ifelse(x == "Not Provided", "Not Provided", "Value"))))

奇怪的是,对于一列(下面的 d138),此函数将看似实际的值返回为 NA(如灰色斜体 NA)。此列中的数据为“double”类型。我也尝试过转换为“整数”,但没有成功。

非常感谢任何帮助!

最好, 雷切尔


摘录

    structure(list(d101_eligible_training_provider = c("Quincy College", 
"LARE INSTITUTE", "Springfield Technical Community College", 
"Network Technology Academy Institute", "Network Technology Academy Institute", 
"John Mason Institute at Hellenic University"), d103_provider_address = c("1250 Hancock Street Quincy MA  02169", 
"6 Campanelli Drive Andover MA  01810", "1 ARMORY SQUARE SPRINGFIELD MA  01105", 
"100 Pleasant Street Malden MA  02148", "100 Pleasant Street Malden MA  02148", 
"436 Amherst Street Nashua NH  03063"), d104_entity_type = c("Other", 
"Private For-Profit", "Higher Ed: Associate's Degree", "Other", 
"Other", "Other"), d105_program_name = c("Certificate in Digital Marketing", 
"MEDICAL BILLING/MEDICAL SECRETARIAL(W/0 GED PREP)", "Online Spanish Medical Interpreting Certification Program", 
"Coding And Web Development 2 Program", "Certified Professional Ethical Hacker", 
"Desktop Application User / Introduction"), d106_program_description = c("Digital marketing helps organizations promote and sell products and services through online marketing methods such as social media messaging website ads Facebook marketing campaigns Google Adwords and more. It's vital to develop a marketing strategy that keeps up with the technology.", 
"THIS PROGRAM IS DESIGNED TO RESPOND TO THE NEEDS O F THE MEDICAL RELATED ENVIRONMENT.", 
"Online Course. English/Spanish.  This course will help prepare new and experienced interpreters to work in hospitals health clinics law offices governmental agencies and more.  This program is open to all languages but students must be able to fully comprehend and communicate in both English and Spanish. Prospective students will be screened for pronunciation accuracy comprehension and overall readiness for the course. Must be 18 years or older.", 
"This course covers some important technologies of modern Server-Side development.", 
"Certified Information Systems Security Office training and certification program prepares and certifies individuals to analyze an organization's information security threats and risks and design a security program to mitigate these risks.", 
"Core program includes several desktop application tools typically used in everyday business. This program can be customized with electives depending on the student's employment requirements and personal goals. The core program includes Word Processing Spreadsheets E-mailand Presentation Tools with electives such as general database operation accounting software office communication tool reporting and project management tools."
), d107_program_url = c("http://www.quincycollege.edu", NA, "http://www.stcc.edu/wdc", 
"http://www.ntai.net", "http://ntai.net", "http://www.JohnMasonInstitute.com"
), d108_program = c(1, 1, 1, 1, 1, 1), d109_associated_credential = c("None Provided", 
"None Provided", "None Provided", "None Provided", "None Provided", 
"None Provided"), d110_cip_code = c(52.1402, 52.0402, 16.0103, 
11.0201, 52.1206, 52.0402), d111_non_wioa_tuition_cost = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), d112_non_wioa_supplies_cost = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), d113_program_length_hours = c(330, 
440, 60, 240, 400, 272), d114_program_length_weeks = c(11, 22, 
10, 12, 20, 8), d115_program_prerequisites = c(0, 0, 0, 0, 0, 
0), d116_program_format = c("This program provides online instruction, e-learning, or distance learning only.", 
"This program provides online instruction, e-learning, or distance learning only.", 
"This program provides online instruction, e-learning, or distance learning only.", 
"This program provides online instruction, e-learning, or distance learning only.", 
"This program provides online instruction, e-learning, or distance learning only.", 
"This program provides online instruction, e-learning, or distance learning only."
), d117_program_soc_occupation_1 = c("13-116100", "43-601100", 
"27-309100", "15-113100", "11-302100", "43-601400"), d118_program_soc_occupation_2 = c("-", 
"-", "-", "-", "-", "-"), d119_program_soc_occupation_3 = c("-", 
"-", "-", "-", "-", "-"), d120_total_served = c(-1, -1, -1, -1, 
-1, -1), d121_total_exited = c(-1, -1, -1, -1, -1, -1), d122_total_completed = c(-1, 
-1, -1, -1, -1, -1), d123_total_employed_q2 = c(-1, -1, -1, -1, 
-1, -1), d124_total_employed_q4 = c(-1, -1, -1, -1, -1, -1), 
    d125_median_earnings = c(-1, -1, -1, -1, -1, -1), d126_total_credential = c(-1, 
    -1, -1, -1, -1, -1), d133_total_wioa_served = c(-1, -1, -1, 
    -1, -1, -1), d134_total_wioa_exiters = c(-1, -1, -1, -1, 
    -1, -1), d135_total_wioa_served_with_ita = c(-1, -1, -1, 
    -1, -1, -1), d136_total_wioa_exited_with_ita = c(-1, -1, 
    -1, -1, -1, -1), d137_total_wioa_completed = c(-1, -1, -1, 
    -1, -1, -1), d138_cost_per_wioa_num = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), d139_total_wioa_exiters_employed_q2 = c(-1, 
    -1, -1, -1, -1, -1), d140_total_wioa_exiters_employed_q4 = c(-1, 
    -1, -1, -1, -1, -1), d142_total_wioa_credential = c(-1, -1, 
    -1, -1, -1, -1), c_wioa_completed_percent = c(-1, -1, -1, 
    -1, -1, -1), c_total_employed_WIOA_q2_percent = c(-1, -1, 
    -1, -1, -1, -1), c_total_employed_WIOA_q4_percent = c(-1, 
    -1, -1, -1, -1, -1), c_completed_percent = c(-1, -1, -1, 
    -1, -1, -1), c_total_emp_q2_perc_comp = c(-1, -1, -1, -1, 
    -1, -1), c_wioa_earned_cred_percent = c(-1, -1, -1, -1, -1, 
    -1), c_cost_per_wioa = c(-1, -1, -1, -1, -1, -1), c_q2_employment_percent = c(-1, 
    -1, -1, -1, -1, -1), address = c("1250 Hancock Street Quincy", 
    "6 Campanelli Drive Andover", "1 Armory Square Springfield", 
    "100 Pleasant Street Malden", "100 Pleasant Street Malden", 
    "436 Amherst Street Nashua"), city = c("Quincy", "Andover", 
    "Springfield", "Malden", "Malden", "Nashua"), state = c("MA", 
    "MA", "MA", "MA", "MA", "NH"), zip = c(2169, 1810, 1105, 
    2148, 2148, 3063), lat = c(42.26, 42.65, 42.1, 42.43, 42.43, 
    42.78), long = c(-71, -71.14, -72.58, -71.05, -71.05, -71.52
    ), cip_formatted_4 = c(52.14, 52.04, 16.01, 11.02, 52.12, 
    52.04), reportingstate = c("MA", "MA", "MA", "MA", "MA", 
    "MA"), CIP_Title = c("Marketing.", "Business Operations Support and Assistant Services.", 
    "Linguistic, Comparative, and Related Language Studies and Services.", 
    "Computer Programming.", "Management Information Systems and Services.", 
    "Business Operations Support and Assistant Services.")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

  • 能否请您与dput(head(data)) 分享您的可重现数据片段,以便我们清楚地了解您的数据集的样子。
  • @AnoushiravanR 谢谢!使用添加到上面帖子的 dput(head(data)) 进行提取。另请注意,违规列是 d138!
  • 首先,您的数据集中没有null 值。因此,您可以将is.null 放到另一个逻辑表达式中。但我不明白您想以何种方式更改数据集。 d138_cost_per_wioa_num 在任何转换之前都是整数类型。
  • @AnoushiravanR,因为我想分析丢失的内容,所以我想使用上面帖子中的“if”表达式将所有内容重新转换为丢失数据的类型。对于 d138,我想区分空白/NA 和实际值。但是当我应用上面的 if 函数时,实际值显示为“NA”。 (我意识到它已经显示为整数的原因是因为我已经尝试从双精度转换为整数,并且在应用 dput 之前没有从我的代码中删除这一行)。这有意义吗?
  • 不幸的是,我没有得到它,您在代码中指定如果有 NA 值,它应该返回 "NA" 这实际上不是 NA 值,这就是您的函数所做的。

标签: r


【解决方案1】:

在我看来,这是一种从rle$lengths 的角度来描述我称之为constants 的方法,其中所有值都相同,is.na 分别处理。

我会以不同于您的 ifelse 方法的方式记录和控制我的指数,因为它将提供以不同方式做事的机会,如果您发现所提供的数据已更改,请保持灵活性,并允许您从治疗项目中排除其他情况正确,但一开始出现 constant,比如reporting_state。

假设您的 data.frame/tbl 称为 constants_df,并使用上面的 structure(,获取转置后的 data.frame 上所有条目(非 NA)具有相同值的所有列的索引:

 constant_t <- t(constants_df)
 constant_data_t <- vector(mode = 'integer') # to receive rle$lengths
 for (j in 1:nrow(constant_t)) {
 constant_data_t[j] <- rle(constant_t[j, ])$lengths[1]
 }
 # index where all values are the same, in this case 6L
 constant_t_idx <- which(constant_data_t == 6L)
 > constant_t_idx
 [1]  7  8 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38
[26] 39 40 41 42 50

 # what are the values? strip out tibble stuff 
 constant_t_vals <- unname(unlist(constants_df[1, constant_t_idx]))

 unique(constant_t_vals)
 [1] "1"                                                                               
 [2] "None Provided"                                                                   
 [3] "0"                                                                               
 [4] "This program provides online instruction, e-learning, or distance learning only."
 [5] "-"                                                                               
 [6] "-1"                                                                              
 [7] "MA" 

> which(constant_t_vals == "-1")
 [1]  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

> which(is.na(constants_df[1, ]) == TRUE)
[1] 10 11 31
# are all column entries NA or NA constants `?NA`
> any(is.na(constants_df))
[1] TRUE
# does that really mean they're all NA
> is.na(constants_df[, c(10:11, 31)])
     d111_non_wioa_tuition_cost d112_non_wioa_supplies_cost
[1,]                       TRUE                        TRUE
[2,]                       TRUE                        TRUE
[3,]                       TRUE                        TRUE
[4,]                       TRUE                        TRUE
[5,]                       TRUE                        TRUE
[6,]                       TRUE                        TRUE
     d138_cost_per_wioa_num
[1,]                   TRUE
[2,]                   TRUE
[3,]                   TRUE
[4,]                   TRUE
[5,]                   TRUE
[6,]                   TRUE
# is so

您可以获取概念并制作索引并应用您的特征,抑制,未提供等。您可以决定将 NA 列与仅是 '-1' 的列区别对待,并且仅将 '-1' 替换为在那些列中被抑制

>simple_minus_1 <-setdiff(which(constant_t_vals == "-1"),which(is.na(constants_df[1, ]) == TRUE))
>simple_minus_1
 [1]  7  8  9 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
> constants_df[, simple_minus_1] <- 'suppressed'
> constants_df[, 7]
[1] "suppressed" "suppressed" "suppressed" "suppressed" "suppressed"
[6] "suppressed"

经过进一步考虑,您可能决定不更改您的来源数据,而是使用索引方法来查看事物随时间的演变,因为这似乎是您正在评估的贯穿时间的报告系统。首次采购时报告的内容随后可能会有所不同。

【讨论】:

    猜你喜欢
    • 2021-12-22
    • 2015-05-20
    • 1970-01-01
    • 2021-05-03
    • 2016-10-08
    • 1970-01-01
    • 1970-01-01
    • 2011-08-30
    • 2012-11-15
    相关资源
    最近更新 更多