在 R 中读取多个 Excel 工作表时跳过行答案

【问题标题】：Skip rows while reading multiple excel worksheets in R在 R 中读取多个 Excel 工作表时跳过行
【发布时间】：2018-08-14 02:25:05
【问题描述】：

我正在使用 readxl 库读取同一个 Excel 工作簿（称为 data.xlsx）中的多个 Excel 工作表，格式如下：

数据从第 3 行开始。

  row1
  row2
 companyName   1980    1981    1982 ... 2016
 company1       5       6       7        8
 company2       10      20      30       40
 company3       20      40      60       80
 ....

数据范围的长度因行和列而异。但是，它们将 companyName 作为通用键。年份范围从 1980 年或 1990 年到 2016 年不等。工作表名称是数据名称。

我想创建一个从所有工作表中提取所有数据的 Excel。

 companyName   Year   dataname     values
 company1      1980   sheetname1     5
 company1      1981   sheetname1     6
 company1      1982   sheetname1     7
 company1      ...    sheetname1     ...
 company1      2016   sheetname1     8
 company2      1980   sheetname1     10
 company2      1981   sheetname1     20
 company2      1982   sheetname1     30
 company2      ...    sheetname1     ...
 company2      2016   sheetname1     40
 ....          ....     ...           ...
 company1      2000    sheetname2     xxx
 company1      2001    sheetname2     yyy
  etc
  etc
  etc

这也是我设法达到的程度：

  library(tidyverse)
  library(readxl)
  library(data.table)

   #read excel file (from [here][1])
   file.list<-"data.xlsx"

     **#read all sheets (and **skip** first two rows)**

   df.list <- lapply(file.list,function(x) {
     sheets <- excel_sheets(x)
     dfs <- lapply(sheets, function(y) {
       read_excel(x, sheet = y,skip=2)
       })
     names(dfs) <- sheets
     dfs
   })

我有以下问题：

前两行没有被跳过
如何创建一个仅包含选定工作表的数据框（即工作表 5、工作表 10 和工作表 15）。

感谢您的帮助。

来源： R: reading multiple excel files, extract first sheet names, and create new column

【问题讨论】：

您的readxl 软件包是什么版本？我没有跳过行的问题。除非文件中的所有工作表都以标题前相同的行数开头。
您好 - 使用 readxl 的 1.0.0 包。是的，需要排除一些工作表，请问我该怎么做？
你的意思是skips 不同吗？

标签： r

【解决方案1】：

我刚刚从df.list 中删除了一层嵌套。

df.list <- lapply(file.list,function(x) {
    sheets <- excel_sheets(x)
    dfs <- lapply(sheets, function(y) {
    read_excel(x, sheet = y,skip=2)
  })
  names(dfs) <- sheets
  dfs 
})[[1]]

这对我有用。我无法用跳过复制您的问题。此外，如果这些行只是空白行，read_excel() 应该默认使用trim_ws = TRUE 跳过它们。

我使用以下列表只是为了演示导入后要做什么。

df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6, 
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7, 
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1", 
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9, 
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990", 
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2", 
"sheetname3"))

即使年份从 1980 年或 1990 年开始，以下内容也有效。

dat <- lapply(df.list, function(x){
  nrows = nrow(x)
  years = names(x[,2:nrows])
  x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()

dat

# # A tibble: 27 x 4
#    name       companyName year  values
#    <chr>      <chr>       <chr>  <dbl>
#  1 sheetname1 company1    1980      5.
#  2 sheetname1 company2    1980     10.
#  3 sheetname1 company3    1980     40.
#  4 sheetname1 company1    1981      6.
#  5 sheetname1 company2    1981     20.
#  6 sheetname1 company3    1981     50.
#  7 sheetname1 company1    1982      7.
#  8 sheetname1 company2    1982     30.
#  9 sheetname1 company3    1982     60.
# 10 sheetname2 company1    1980      6.
# # ... with 17 more rows

您现在可以通过使用dplyr::filter() 来使用特定的sheetname。

例如：

dat %>% filter(name == "sheetname1")

#   name       companyName year  values
#   <chr>      <chr>       <chr>  <dbl>
# 1 sheetname1 company1    1980      5.
# 2 sheetname1 company2    1980     10.
# 3 sheetname1 company3    1980     40.
# 4 sheetname1 company1    1981      6.
# 5 sheetname1 company2    1981     20.
# 6 sheetname1 company3    1981     50.
# 7 sheetname1 company1    1982      7.
# 8 sheetname1 company2    1982     30.
# 9 sheetname1 company3    1982     60.

【讨论】：

【解决方案2】：

我会推荐 openxlsx 包，它允许您指定 startRow，并从包 reshape2 中指定 melt，它可以轻松地将数据框更改为长格式。

library(openxlsx)
library(reshape2)

first.Row <- 6 # supposing the data starts at row 6
sheets.2.read <- loadWorkbook(file.list)$sheet_names # retrieving the sheet names
df <- data.frame()
for(tmp.sheet in sheets.2.read){
  tmp.dat <- read.xlsx(file.list, sheet = tmp.sheet, startRow = first.Row, colNames = TRUE)
  tmp.dat <- cbind(melt(tmp.dat, id.vars = "companyName"), tmp.sheet)
  df <- rbind(df, tmp.dat)
}

这是我的输出，带有一些虚拟数据（仅打印 10 行）：

> df[c(1:3, 50:53, 300:302),]
    company.name variable     value tmp.sheet
1          comp7     1968 0.3359298    Sheet1
2          comp8     1968 0.3359298    Sheet1
3          comp9     1968 0.3359298    Sheet1
50        comp16     1966 0.3359298    Sheet2
51        comp17     1966 0.3359298    Sheet2
52        comp18     1966 0.3359298    Sheet2
53        comp19     1966 0.3359298    Sheet2
300       comp16     2000 0.3359298    Sheet3
301       comp17     2000 0.3359298    Sheet3
302       comp18     2000 0.3359298    Sheet3

【讨论】：

谢谢。如果我只想要第 5、10 和 15 页，我该如何更新代码？
@hpesoj626 - 谢谢。跳跃没有变化。但不确定如何仅从 Excel 工作簿中选择工作表 5、10 和 15 进行阅读。
您可以在sheets.2.read <- loadWorkbook(...) 之后添加sheets.2.read <- sheets.2.read[c(5,10,15)] 行，也可以直接定义sheets.2.read <- c("sheet5", "sheet10", "sheet15")