难以将数据集中的宽格式转换为整齐的格式答案

【问题标题】：Difficulty converting wide format to tidy format in dataset难以将数据集中的宽格式转换为整齐的格式
【发布时间】：2018-07-18 16:13:37
【问题描述】：

我正在使用 Kaggles 枪支暴力数据集。我的目标是使用 Tableau 对与那里的枪支犯罪相关的一些地区和细节进行交互式可视化。我的目标是把这个数据框变成整洁的格式。链接：

https://www.kaggle.com/jameslko/gun-violence-data/version/1

在这种情况下，有几列格式如下，我在 R 中遇到问题。大约有 20 列左右，这 4 列格式如下：

一点背景知识：犯罪可能涉及不止一把枪，并且参与者不止一个。因此，这些列包含由“||”分隔的每个枪支/参与者的信息。 0:, 1: ... 表示特定枪支/参与者的详细信息。

我的目标是捕获每列中的唯一实例并忽略 0:、1:、2:、...

到目前为止，这是我的代码：

df= read.csv("C:/Users/rmahesh/Desktop/gun-violence-data_01-2013_03-2018.csv")
df$incident_id = NULL
df$incident_url = NULL
df$source_url = NULL
df$participant_name = NULL
df$participant_relationship = NULL
df$sources = NULL
df$incident_url_fields_missing = NULL
df$participant_status = NULL
df$participant_age_group = NULL 
df$participant_type = NULL
df$incident_characteristics = NULL

#Subset of columns with formatting issues:
df2 = df[, c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')]

我还没有遇到过这样的问题，并且希望得到任何帮助来解决我的问题。任何帮助将不胜感激！

Edit1：我已经创建了相关列的前 3 行。格式或多或少是相同的，有时会丢失一些列：

gun_stolen,gun_type,participant_age,participant_gender
0::Unknown||1::Unknown, 0::Unknown||1::Unknown, 0::25||1::31||2::33||3::34||4::33, 0::Male||1::Male||2::Male||3::Male||4::Male
0::Unknown||1::Unknown,0::22 LR||1::223 Rem [AR-15],0::51||1::40||2::9||3::5||4::2||5::15,0::Male||1::Female||2::Male||3::Female||4::Female||5::Male
0::Unknown,0::Shotgun,3::78||4::48,0::Male||1::Male||2::Male||3::Male||4::Male

【问题讨论】：

请发布数据样本，而不是屏幕截图和链接，以便我们下载整个内容。此外，您已标记此 tidyverse，但未使用任何 tidyverse 函数 - 那里的计划是什么？
另外，来自[tidyverse] wiki：“如果您的问题涉及 tidyverse 的一个或两个组件，例如 dplyr 或 ggplot2，请不要使用。使用那些标签，并标记r 以获得更好的响应。”我正在相应地进行编辑。
@camille 感谢您回来。我删除了 tidyverse。现在发布数据样本，我应该怎么做？
除了数据样本之外，您还应该显示您期望/想要的相应输出。（我知道整洁的数据对我意味着什么，但解释不同，通常这是在问题中包含的重要内容。关于提出 R 问题的一些一般指导：stackoverflow.com/questions/5963269/…）
@Frank 感谢您的回复。正如我所说，输出将是连续出现所有独特的事件。从概念上讲，我很难弄清楚如何做到这一点。至于数据，我应该如何发送？

标签： r dplyr tidyr

【解决方案1】：

正如弗兰克在 cmets 中所说，“整洁”可能意味着不同的东西。在这里，我们将所有指定的列只转换为两列：一列具有原始列名 ("key")，另一列具有拆分字符串并删除前缀后的单个值，每个列一行 ("value")。

library(tidyr)
library(dplyr)
library(stringr)

myvars <- c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')

res <- as_tibble(df2) %>% 
  tibble::rowid_to_column() %>%
  # Split strings in selected columns at "||". This turns those columns in 
  # list-columns of character vectors
  mutate_at(myvars, str_split, pattern = fixed("||")) %>% 
  # Go from wide to long format: in the new 'key' column are the original column 
  # names, and 'value' is the one list-column of character vectors
  gather(key, value, one_of(myvars)) %>% 
  # unnest turns the 'value' list-column into a regular character column, with 
  # duplication of rows that contain a 'value' of length greater than 1
  unnest(value) %>% 
  filter(value != "") %>% 
  # Remove the "x::" prefixes
  mutate(value = str_split_fixed(value, fixed("::"), n = 2)[, 2]) %>% 
  # Deduplicate
  distinct() %>% 
  arrange(rowid, key, value)

# # A tibble: 732,017 x 3
#    rowid key                value  
#    <int> <chr>              <chr>  
#  1     1 participant_age    20     
#  2     1 participant_gender Female 
#  3     1 participant_gender Male   
#  4     2 participant_age    20     
#  5     2 participant_gender Male   
#  6     3 gun_stolen         Unknown
#  7     3 gun_type           Unknown
#  8     3 participant_age    25     
#  9     3 participant_age    31     
# 10     3 participant_age    33     
# # ... with 732,007 more rows

还扩展了@Ben G 的评论：

res %>% 
  count(key, value) %>% 
  arrange(key, desc(n))

# # A tibble: 141 x 3
#    key             value                n
#    <chr>           <chr>            <int>
#  1 gun_stolen      Unknown         132099
#  2 gun_stolen      Stolen            7350
#  3 gun_stolen      Not-stolen        1560
#  4 gun_stolen      ""                 355
#  5 gun_type        Unknown          98892
#  6 gun_type        Handgun          17609
#  7 gun_type        9mm               6040
#  8 gun_type        Shotgun           3560
#  9 gun_type        Rifle             3196
# 10 gun_type        22 LR             3093
# 11 gun_type        40 SW             2624
# 12 gun_type        380 Auto          2323
# 13 gun_type        45 Auto           2234
# 14 gun_type        38 Spl            1758
# 15 gun_type        223 Rem [AR-15]   1248
# 16 gun_type        12 gauge           975
# 17 gun_type        Other              892
# 18 gun_type        7.62 [AK-47]       854
# 19 gun_type        357 Mag            800
# 20 gun_type        25 Auto            601
# 21 gun_type        32 Auto            481
# 22 gun_type        ""                 356
# 23 gun_type        20 gauge           194
# 24 gun_type        44 Mag             192
# 25 gun_type        30-30 Win          105
# 26 gun_type        410 gauge           96
# 27 gun_type        308 Win             88
# 28 gun_type        30-06 Spr           71
# 29 gun_type        10mm                50
# 30 gun_type        16 gauge            30
# 31 gun_type        300 Win             23
# 32 gun_type        28 gauge             6
# 33 participant_age 19               10541
# 34 participant_age 20                9919
# 35 participant_age 18                9826
# 36 participant_age 21                9795
# 37 participant_age 22                9642
# 38 participant_age 23                9383
# 39 participant_age 24                9204
# 40 participant_age 25                8562
# 41 participant_age 26                7815
# 42 participant_age 17                7416
# 43 participant_age 27                7228
# 44 participant_age 28                6528
# 45 participant_age 29                6055
# 46 participant_age 30                5652
# 47 participant_age 31                5145
# 48 participant_age 32                5039
# 49 participant_age 16                4977
# 50 participant_age 33                4662
# # ... with 91 more rows

【讨论】：

在完全透明的情况下，我没有太多使用 tidyr，所以我对你的代码在做什么有点迷茫。例如，假设 gun_type 具有以下类：Unknown、Handgun、Shotgun、Assault Rifle。导入时，全部存储在一行中。我在创建一个名为“Number_affected”的列时合并了两列。我想知道每个“gun_type”类对“number_affected”的贡献有多少次。可能使“gun_type”中的每个类型成为单独的列？
@rmahesh 对此代码，我将添加解释。至于你想要的输出，我支持弗兰克在 cmets 中所说的：我们可能对“整洁”有不同的定义，最好展示一个所需输出的例子（手工创建有点额外的工作，但是它会澄清事情）。
@rmahesh，这是从您的数据创建整洁数据集的完美答案。回答您的问题的最佳方法（“我想知道每个类 'gun_type' 对 'number_affected' 的贡献次数。）然后对整齐的数据集执行汇总计算。例如，filter(key == "gun_type") %>% group_by(value) %>% summarize(total = n())。如果您对如何执行此操作有任何疑问，您可以发布另一个问题...
@BenG 这更符合我的要求，谢谢！

【解决方案2】：

我认为整理是指将分隔列的内容拆分并分成行。您可以获取第一个元素，也可以将每个元素作为自己的行。

df<-data.frame(instance=1:5, 
           gun_type=c("", "0::Unknown||1::Unknown", "", 
                      "0::Handgun||1::Handgun", ""), stringsAsFactors=FALSE)

df$first<-sapply(strsplit(df$gun_type, "\\|\\|"), '[', 1)
splitType<-strsplit(df$gun_type, "\\|\\|")
df.2<-df[rep(1:nrow(df), sapply(splitType, length)),]
df.2$splitType<-unlist(splitType)

如果您只想要唯一值，请使用：

splitTypeUnique<-sapply(splitType, unique)
df.2<-df[rep(1:nrow(df), sapply(splitTypeUnique, length)),]
df.2$splitType<-unlist(splitTypeUnique)

但你必须做一些争论才能让独特的部分发挥作用

【讨论】：

是的，我的意思是这样。例如，假设 gun_type 具有以下类：Unknown、Handgun、Shotgun、Assault Rifle。导入时，全部存储在一行中。我在创建一个名为“Number_affected”的列时合并了两列。我想知道每个“gun_type”类对“number_affected”的贡献有多少次。可能使“gun_type”中的每个类型成为单独的列？