【问题标题】:tidyr:Pivot_wider replace values with data typetidyr:Pivot_wider 用数据类型替换值
【发布时间】:2020-01-26 03:25:34
【问题描述】:

我有一个数据框,其中包含变量的行和列中都包含变量,因此我尝试使用数据透视范围来整理数据。我的数据如下所示:

head(df)
# A tibble: 6 x 4
  State    Year Var                                                           X
  <chr>   <dbl> <chr>                                                     <dbl>
1 ALABAMA  2001 APPALACHIAN REGIONAL COMMISSION (ARC)                   3048031
2 ALABAMA  2001 CORPORATION FOR NATIONAL AND COMMUNITY SERVICE (CNCS)   1765835
3 ALABAMA  2001 DEPARTMENT OF AGRICULTURE (USDA)                      282530429
4 ALABAMA  2001 DEPARTMENT OF COMMERCE (DOC)                           17838084
5 ALABAMA  2001 DEPARTMENT OF DEFENSE (DOD)                            21160159
6 ALABAMA  2001 DEPARTMENT OF EDUCATION (ED)                          174634348

state 是实体,Year 是时间维度,Var 是我尝试转换的变量列表,X 是每个变量的值列表。当我使用以下代码时:

library(tidyverse)

df %<>% 
  pivot_wider(names_from = Var, values_from = X)

R 返回一条警告消息,指出:

Warning message:
Values in `X` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(X = list)` to suppress this warning.
* Use `values_fn = list(X = length)` to identify where the duplicates arise
* Use `values_fn = list(X = summary_fun)` to summarise duplicates 

并且我的数据用数据替换了所有的值,如下图。

head(df)
# A tibble: 6 x 35
  State  Year `APPALACHIAN RE~ `CORPORATION FO~ `DEPARTMENT OF ~ `DEPARTMENT OF ~ `DEPARTMENT OF ~ `DEPARTMENT OF ~ `DEPARTMENT OF ~ `DEPARTMENT OF ~
  <chr> <dbl>      <list<dbl>>      <list<dbl>>      <list<dbl>>      <list<dbl>>      <list<dbl>>      <list<dbl>>      <list<dbl>>      <list<dbl>>
1 ALAB~  2001              [1]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
2 ALAS~  2001              [0]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
3 ARIZ~  2001              [0]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
4 ARKA~  2001              [0]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
5 CALI~  2001              [0]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
6 COLO~  2001              [0]              [1]              [1]              [1]              [1]              [1]              [1]              [1]
# ... with 25 more variables: `DEPARTMENT OF HOUSING AND URBAN DEVELOPMENT (HUD)` <list<dbl>>, `DEPARTMENT OF JUSTICE (DOJ)` <list<dbl>>, `DEPARTMENT OF
#   LABOR (DOL)` <list<dbl>>, `DEPARTMENT OF THE INTERIOR (DOI)` <list<dbl>>, `DEPARTMENT OF TRANSPORTATION (DOT)` <list<dbl>>, `ENVIRONMENTAL PROTECTION
#   AGENCY (EPA)` <list<dbl>>, `FEDERAL EMERGENCY MANAGEMENT AGENCY (FEMA)` <list<dbl>>, `INSTITUTE OF MUSEUM AND LIBRARY SERVICES (IMLS)` <list<dbl>>,
#   `NATIONAL AERONAUTICS AND SPACE ADMINISTRATION (NASA)` <list<dbl>>, `NATIONAL ENDOWMENT FOR THE ARTS (NEA)` <list<dbl>>, `NATIONAL ENDOWMENT FOR THE
#   HUMANITIES (NEH)` <list<dbl>>, `NATIONAL SCIENCE FOUNDATION (NSF)` <list<dbl>>, `SMALL BUSINESS ADMINISTRATION (SBA)` <list<dbl>>, `FEDERAL MEDIATION
#   AND CONCILIATION SERVICE (FMCS)` <list<dbl>>, `NATIONAL ARCHIVES AND RECORDS ADMINISTRATION (NARA)` <list<dbl>>, `AGENCY FOR INTERNATIONAL DEVELOPMENT
#   (USAID)` <list<dbl>>, `JAPAN-UNITED STATES FRIENDSHIP COMMISSION (JUSFC)` <list<dbl>>, `UNITED STATES INSTITUTE OF PEACE (USIP)` <list<dbl>>, `CORPS OF
#   ENGINEERS - CIVIL WORKS (USACE)` <list<dbl>>, `DEPARTMENT OF STATE (DOS)` <list<dbl>>, `NATIONAL LABOR RELATIONS BOARD (NLRB)` <list<dbl>>, `NUCLEAR
#   REGULATORY COMMISSION (NRC)` <list<dbl>>, `SOCIAL SECURITY ADMINISTRATION (SSA)` <list<dbl>>, `SELECTIVE SERVICE SYSTEM (SSS)` <list<dbl>>,
#   `NA` <list<dbl>>

我想知道为什么原始值会从枢轴中删除,以及我能做些什么来阻止这种情况发生。

【问题讨论】:

    标签: r tidyverse tidyr


    【解决方案1】:

    我来到这里是因为pivot_wider() 的结果与我的预期大不相同(它产生了NULLs 和列表,而不是简单的数字)。

    就我而言,这仅仅是因为我有重复的行,可以很容易地删除

    df %>% distinct(x, y, .keep_all = TRUE)
    

    here

    【讨论】:

      【解决方案2】:

      我们可能需要一个序列列,因为存在重复项。按'State'、'Year'、'Var'分组,用row_number()创建一个序列列,然后应用pivot_wider

      library(dplyr)
      library(tidyr)
      df %>% 
        group_by(State, Year, Var) %>%
        mutate(rn = row_number()) %>%
        pivot_wider(names_from = Var, values_from = X)
      

      【讨论】:

      • 为什么重复会强制输出到列表中?
      • @JasonHunter 因为如果有骗子,那么骗子有一个共同的标识符,导致将值折叠到单个单元格中,并且列表支持多个元素。使用spread(不推荐使用的选项 - 会导致错误)。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-12-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-12-06
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多