【发布时间】:2021-06-28 00:32:41
【问题描述】:
我正在对大型数据集中有多少值缺失、抑制等进行基本分析。我正在使用下面的函数对各种类型的缺失数据进行分类。
Iffunction2 <- function(x) ifelse(is.na(x), "NA",ifelse(x == "-1", "Suppressed", ifelse(is.null(x), "Blank", ifelse(x == "Not Provided", "Not Provided", "Value"))))
奇怪的是,对于一列(下面的 d138),此函数将看似实际的值返回为 NA(如灰色斜体 NA)。此列中的数据为“double”类型。我也尝试过转换为“整数”,但没有成功。
非常感谢任何帮助!
最好, 雷切尔
摘录
structure(list(d101_eligible_training_provider = c("Quincy College",
"LARE INSTITUTE", "Springfield Technical Community College",
"Network Technology Academy Institute", "Network Technology Academy Institute",
"John Mason Institute at Hellenic University"), d103_provider_address = c("1250 Hancock Street Quincy MA 02169",
"6 Campanelli Drive Andover MA 01810", "1 ARMORY SQUARE SPRINGFIELD MA 01105",
"100 Pleasant Street Malden MA 02148", "100 Pleasant Street Malden MA 02148",
"436 Amherst Street Nashua NH 03063"), d104_entity_type = c("Other",
"Private For-Profit", "Higher Ed: Associate's Degree", "Other",
"Other", "Other"), d105_program_name = c("Certificate in Digital Marketing",
"MEDICAL BILLING/MEDICAL SECRETARIAL(W/0 GED PREP)", "Online Spanish Medical Interpreting Certification Program",
"Coding And Web Development 2 Program", "Certified Professional Ethical Hacker",
"Desktop Application User / Introduction"), d106_program_description = c("Digital marketing helps organizations promote and sell products and services through online marketing methods such as social media messaging website ads Facebook marketing campaigns Google Adwords and more. It's vital to develop a marketing strategy that keeps up with the technology.",
"THIS PROGRAM IS DESIGNED TO RESPOND TO THE NEEDS O F THE MEDICAL RELATED ENVIRONMENT.",
"Online Course. English/Spanish. This course will help prepare new and experienced interpreters to work in hospitals health clinics law offices governmental agencies and more. This program is open to all languages but students must be able to fully comprehend and communicate in both English and Spanish. Prospective students will be screened for pronunciation accuracy comprehension and overall readiness for the course. Must be 18 years or older.",
"This course covers some important technologies of modern Server-Side development.",
"Certified Information Systems Security Office training and certification program prepares and certifies individuals to analyze an organization's information security threats and risks and design a security program to mitigate these risks.",
"Core program includes several desktop application tools typically used in everyday business. This program can be customized with electives depending on the student's employment requirements and personal goals. The core program includes Word Processing Spreadsheets E-mailand Presentation Tools with electives such as general database operation accounting software office communication tool reporting and project management tools."
), d107_program_url = c("http://www.quincycollege.edu", NA, "http://www.stcc.edu/wdc",
"http://www.ntai.net", "http://ntai.net", "http://www.JohnMasonInstitute.com"
), d108_program = c(1, 1, 1, 1, 1, 1), d109_associated_credential = c("None Provided",
"None Provided", "None Provided", "None Provided", "None Provided",
"None Provided"), d110_cip_code = c(52.1402, 52.0402, 16.0103,
11.0201, 52.1206, 52.0402), d111_non_wioa_tuition_cost = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), d112_non_wioa_supplies_cost = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), d113_program_length_hours = c(330,
440, 60, 240, 400, 272), d114_program_length_weeks = c(11, 22,
10, 12, 20, 8), d115_program_prerequisites = c(0, 0, 0, 0, 0,
0), d116_program_format = c("This program provides online instruction, e-learning, or distance learning only.",
"This program provides online instruction, e-learning, or distance learning only.",
"This program provides online instruction, e-learning, or distance learning only.",
"This program provides online instruction, e-learning, or distance learning only.",
"This program provides online instruction, e-learning, or distance learning only.",
"This program provides online instruction, e-learning, or distance learning only."
), d117_program_soc_occupation_1 = c("13-116100", "43-601100",
"27-309100", "15-113100", "11-302100", "43-601400"), d118_program_soc_occupation_2 = c("-",
"-", "-", "-", "-", "-"), d119_program_soc_occupation_3 = c("-",
"-", "-", "-", "-", "-"), d120_total_served = c(-1, -1, -1, -1,
-1, -1), d121_total_exited = c(-1, -1, -1, -1, -1, -1), d122_total_completed = c(-1,
-1, -1, -1, -1, -1), d123_total_employed_q2 = c(-1, -1, -1, -1,
-1, -1), d124_total_employed_q4 = c(-1, -1, -1, -1, -1, -1),
d125_median_earnings = c(-1, -1, -1, -1, -1, -1), d126_total_credential = c(-1,
-1, -1, -1, -1, -1), d133_total_wioa_served = c(-1, -1, -1,
-1, -1, -1), d134_total_wioa_exiters = c(-1, -1, -1, -1,
-1, -1), d135_total_wioa_served_with_ita = c(-1, -1, -1,
-1, -1, -1), d136_total_wioa_exited_with_ita = c(-1, -1,
-1, -1, -1, -1), d137_total_wioa_completed = c(-1, -1, -1,
-1, -1, -1), d138_cost_per_wioa_num = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), d139_total_wioa_exiters_employed_q2 = c(-1,
-1, -1, -1, -1, -1), d140_total_wioa_exiters_employed_q4 = c(-1,
-1, -1, -1, -1, -1), d142_total_wioa_credential = c(-1, -1,
-1, -1, -1, -1), c_wioa_completed_percent = c(-1, -1, -1,
-1, -1, -1), c_total_employed_WIOA_q2_percent = c(-1, -1,
-1, -1, -1, -1), c_total_employed_WIOA_q4_percent = c(-1,
-1, -1, -1, -1, -1), c_completed_percent = c(-1, -1, -1,
-1, -1, -1), c_total_emp_q2_perc_comp = c(-1, -1, -1, -1,
-1, -1), c_wioa_earned_cred_percent = c(-1, -1, -1, -1, -1,
-1), c_cost_per_wioa = c(-1, -1, -1, -1, -1, -1), c_q2_employment_percent = c(-1,
-1, -1, -1, -1, -1), address = c("1250 Hancock Street Quincy",
"6 Campanelli Drive Andover", "1 Armory Square Springfield",
"100 Pleasant Street Malden", "100 Pleasant Street Malden",
"436 Amherst Street Nashua"), city = c("Quincy", "Andover",
"Springfield", "Malden", "Malden", "Nashua"), state = c("MA",
"MA", "MA", "MA", "MA", "NH"), zip = c(2169, 1810, 1105,
2148, 2148, 3063), lat = c(42.26, 42.65, 42.1, 42.43, 42.43,
42.78), long = c(-71, -71.14, -72.58, -71.05, -71.05, -71.52
), cip_formatted_4 = c(52.14, 52.04, 16.01, 11.02, 52.12,
52.04), reportingstate = c("MA", "MA", "MA", "MA", "MA",
"MA"), CIP_Title = c("Marketing.", "Business Operations Support and Assistant Services.",
"Linguistic, Comparative, and Related Language Studies and Services.",
"Computer Programming.", "Management Information Systems and Services.",
"Business Operations Support and Assistant Services.")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
【问题讨论】:
-
能否请您与
dput(head(data))分享您的可重现数据片段,以便我们清楚地了解您的数据集的样子。 -
@AnoushiravanR 谢谢!使用添加到上面帖子的 dput(head(data)) 进行提取。另请注意,违规列是 d138!
-
首先,您的数据集中没有
null值。因此,您可以将is.null放到另一个逻辑表达式中。但我不明白您想以何种方式更改数据集。d138_cost_per_wioa_num在任何转换之前都是整数类型。 -
@AnoushiravanR,因为我想分析丢失的内容,所以我想使用上面帖子中的“if”表达式将所有内容重新转换为丢失数据的类型。对于 d138,我想区分空白/NA 和实际值。但是当我应用上面的 if 函数时,实际值显示为“NA”。 (我意识到它已经显示为整数的原因是因为我已经尝试从双精度转换为整数,并且在应用 dput 之前没有从我的代码中删除这一行)。这有意义吗?
-
不幸的是,我没有得到它,您在代码中指定如果有
NA值,它应该返回"NA"这实际上不是NA值,这就是您的函数所做的。
标签: r