如何将 Excel 中的唯一数据提取到 R 中的表中？答案

【问题标题】：How can I extract unique data from Excel to a table in R?如何将 Excel 中的唯一数据提取到 R 中的表中？
【发布时间】：2020-12-15 16:38:47
【问题描述】：

我有一个 Excel 工作表中的数据集，其中包含地区名称和课程代码作为列。此数据处于学生注册级别，这就是多次显示学区的原因。看起来像这样：

District Name	Course Code
District 1	Course A
District 1	Course A
District 2	Course A
District 2	Course B
District 2	Course B
District 2	Course D
District 3	Course B
District 3	Course C

我正在寻找没有课程 a、b 或 c 的地区。

我想要一个返回如下内容的表格：

District Name	Course A	Course B	Course C
District 1	yes	no	no
District 2	yes	yes	no
District 3	no	yes	yes

任何指导将不胜感激！

【问题讨论】：

不清楚预期。你能显示预期的
我不确定我是否理解你的问题。但是，我希望有一个表格来显示哪些地区有或没有课程 a、b、c 或 d。如果该表的结构最好在学区确实提供课程时返回“是”而在学区不提供课程时返回“否”。
在示例输入中，我猜District1不会被选中？
使用readxl 包读取您的 Excel 表。然后使用dplyr 中的filter 函数过滤所有课程，仅在A、B、C 和D 中选择“否”值。
您想要的输出（4 列）与语句“提取没有课程的地区...” 不同。如果您需要知道如何将表格从“长”格式转换为“宽”格式，那么在 SO 上有很多关于此的问题。如果您需要知道如何根据组内多行的存在进行过滤，dplyr 的组合 group_by、filter 和 any/all 是您需要的功能。

标签： r excel dataframe if-statement unique

【解决方案1】：

更新

也许您只需要从“长”转向“宽”？

dat <- structure(list(District.Name = c("District 1", "District 1", "District 2", "District 2", "District 2", "District 2", "District 3", "District 3"), Course.Code = c("Course A", "Course A", "Course A", "Course B", "Course B", "Course D", "Course B", "Course C")), class = "data.frame", row.names = c(NA, -8L))
library(dplyr)
dat %>%
  mutate(a = TRUE) %>%
  tidyr::pivot_wider(names_from="Course.Code", values_from="a", values_fill=FALSE, values_fn=any)
# # A tibble: 3 x 5
#   District.Name `Course A` `Course B` `Course D` `Course C`
#   <chr>         <lgl>      <lgl>      <lgl>      <lgl>     
# 1 District 1    TRUE       FALSE      FALSE      FALSE     
# 2 District 2    TRUE       TRUE       TRUE       FALSE     
# 3 District 3    FALSE      TRUE       FALSE      TRUE

临时的a 列实际上只是一个占位符。必要的、暂时的、没有长期影响的。

上一个答案

我认为最好将逻辑分为两个步骤：（1）从excel文件中读取； (2) 过滤行。

第一个是直接使用readxl 或openxlsx 包。我将在此处显示两者都使用，但将使用dat2 作为其余答案。

dat1 <- openxlsx::readWorkbook("~/StackOverflow/14617210/Book1.xlsx", sheet = 1)
dat2 <- readxl::read_excel("~/StackOverflow/14617210/Book1.xlsx")
dat2
# # A tibble: 3 x 5
#   `district name` `course A` `course B` `course C` `course D`
#   <chr>           <chr>      <chr>      <chr>      <chr>     
# 1 District 1      no         yes        yes        no        
# 2 District 2      no         no         no         no        
# 3 District 3      yes        yes        yes        yes

注意事项：

dat1 的列名被修改，不管check.names=，在https://github.com/ycphs/openxlsx/issues/33 有一个未解决的问题，从https://github.com/awalker89/openxlsx/issues/102 迁移。因此，虽然在两者中都很容易阅读它们，但如果您更喜欢 openxlsx 的导入，则需要调整 dat2 的答案。
由于dat2中的列名中有空格，我们需要将名称括在反引号中（例如，`course A`）；如果实际的列名没有空格，或者如果您选择 dat1，其中名称是“固定的”以用点替换空格（例如，course.A，则不需要反引号）。
我假设course 列是连续的，因此列的“序列”（例如`course A`:`course D`）是有意义的。也可以通过其他方式进行列选择，包括grep(..., value=TRUE) 等。

您的第二个任务（过滤数据）在 R 的任何方言中都是直截了当的。我不清楚您的过滤逻辑究竟应该如何进行，因此我将演示几个变体。

显示那些包含 none 课程的地区（都是"no"）：

# base R
dat2[rowSums(subset(dat2, select = `course A`:`course D`) == "yes") == 0, ]
# # A tibble: 1 x 5
#   `district name` `course A` `course B` `course C` `course D`
#   <chr>           <chr>      <chr>      <chr>      <chr>     
# 1 District 2      no         no         no         no        

# dplyr
library(dplyr)
dat2 %>%
  filter(across(`course A`:`course D`, ~ . == "no"))

显示至少有 1 个“否”的课程：

# base R
dat2[rowSums(subset(dat2, select = `course A`:`course D`) == "no") > 0, ]
# # A tibble: 2 x 5
#   `district name` `course A` `course B` `course C` `course D`
#   <chr>           <chr>      <chr>      <chr>      <chr>     
# 1 District 1      no         yes        yes        no        
# 2 District 2      no         no         no         no        

# dplyr
dat2 %>%
  filter(rowSums(across(`course A`:`course D`, ~ . == "no")) > 0)

示例数据，此处为 CSV，粘贴到电子表格中并保存为 Book1.xlsx：

district name,course A,course B,course C,course D
District 1,no,yes,yes,no
District 2,no,no,no,no
District 3,yes,yes,yes,yes

【讨论】：

非常感谢您！但是，我认为我应该更好地解释当前数据集的格式。我相信我也可以使用类似的方法，对吗？考虑到学区名称可能不止一次出现，因为数据处于学生注册级别，我是否需要应用独特或独特的功能？再次感谢您！