在 tidyr::extract 中使用正则表达式答案

【问题标题】：Using regular expressions in tidyr::extract在 tidyr::extract 中使用正则表达式
【发布时间】：2017-11-21 22:15:37
【问题描述】：

我正在处理 3D 运动捕捉数据。这意味着我有 3 列 (X,Y,Z) 的关节坐标用于身体中的几个关节（例如，描述左膝关节中心位置的三列是：LKX、LKY、LKZ）。

我的最终目标是绘制至少 9 个联合中心，我相信实现这一目标的唯一方法是将我的宽格式数据帧转换为长数据帧。

如您所知，我正在尝试转换多组以 X、Y 或 Z 结尾的关节中心。因此，我尝试在 tidyr:extract 中使用正则表达式，但我无法正确编写代码.

df_wide <- data.frame(
  ID = rep(1:2, each=10),
  JN = rep(1:2, each=5),
  Frame = rep(1:5, 4),
  System = rep(1:2, 10),
  RKX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
  RKY = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
  RKZ = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
  LHeX = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
  LHeY = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
  LHeZ = rep(1:10-rnorm(10,mean=1,sd=0.5),2))

head(df_wide, 2)
  ID JN Frame System      RKX      RKY      RKZ        LHeX       LHeY      LHeZ
1  1  1     1      1 1.332827 2.068720 2.295742 -0.02336031 -0.3011227 -1.212326
2  1  1     2      2 3.570076 3.306799 3.136177  2.08828231  1.9226740  2.106496

我希望得到这个结果：

   ID JN Frame System joint         X         Y         Z
1   1  1     1      1    RK  1.440103  2.221676  1.621871
2   1  1     1      1   LHe  3.537940  3.060948  2.856955

这是我最近的（许多）尝试。它有两个问题； 1）提取只产生NA； 2) spread 返回“错误：行的重复标识符”我怀疑这与提取的问题有关。

df_3D <- df_wide %>%
 gather(keys, values, -ID, -JN, -Frame, -System)%>% 
  extract(keys, c("X", "Y", "Z", "joint"), "(X$) (Y$) (Z$) ([A-Z].$)")%>% 
  spread(X, values)

我找到了几个关于转换的好问题和答案，但没有一个专门针对正则表达式的使用。

【问题讨论】：

@Gregor 我认为它指的是tidyr::extract，很容易混淆dplyr 中的具体内容和更广泛的tidyverse 中的内容
一直很困惑。由于 rlang 使用的是:=，所以早些时候被绊倒了，但我认为data.table
@Marius - 阅读您的评论我意识到我犯了一个错误。我一直认为 extract 是 dplyr 的一部分，但 here 我发现它实际上是 tidyr 的一部分。帖子已编辑。

标签： r regex tidyr

【解决方案1】：

你的方法有点不对劲。收集后keys 列的每个元素都具有<Joint><Coord> 结构，因此您需要类似：

df_wide %>%
    gather(keys, values, -ID, -JN, -Frame, -System) %>%
    extract(keys, c("Joint", "Coord"), "(.*)(X|Y|Z)$") %>%
    spread(Coord, values)

我在这里使用的正则表达式捕获了第一组中的任何内容（因为我不知道所有可能的联名），然后将 X 或 Y 或 Z 作为第二组中的最后一个字符。还有很多其他的正则表达式可以达到同样的效果。

输出：

   ID JN Frame System Joint          X          Y           Z
1   1  1     1      1   LHe  0.1344259 -0.2927277  0.05375166
2   1  1     1      1    RK  1.8083539  2.4053498  2.32899399
3   1  1     2      2   LHe  1.1777492  1.1780538  0.96549849
4   1  1     2      2    RK  3.2254236  2.4100235  2.79816371

【讨论】：

非常感谢。它工作得很好。我总是对正则表达式的一些迹象可以达到多少感到惊讶。你能推荐任何更好地学习正则表达式的资源吗？我经常发现捕获特定模式（例如本例中的模式）具有挑战性。直到现在我一直使用这个link。
我通过阅读this book学习了正则表达式。可能有更快更简单的学习方法，但我只能说说我所做的。
谢谢 - 书的描述没有特别提到 R。正则表达式是否跨语言相同？
@SteenHarsted：是的，只要您记得正确转义反斜杠，它就是相同的底层表达式语言。 stringr 和 tidyverse 使用 perl 样式的正则表达式，它们在许多语言中都使用。 Base R 使用稍微简单的版本，默认情况下功能较少，但也允许使用 perl 样式的表达式。

【解决方案2】：

您需要将数据收集成超长格式，然后拆分维度，然后将这些数据分散到您的 X、Y 和 Z 列中：

library(tidyr)
library(stringr)

df2  <- df_wide %>% 
  # leave the other columns
  gather( jointid, position, -ID, -JN, -Frame, -System ) %>% 
  # insert a seperator to make it easier to split the X/Y/Z from the joint name
  mutate(jointid = str_replace( jointid, "X|Y|Z", ";\\0")) %>% 
  # split the joint name and the dimension apart
  tidyr::separate(jointid, c('joint', 'dim'), sep = ";" ) %>% 
  # spread the joint and position apart into 3 columns
  spread(dim, position)

【讨论】：

也是一个非常好的解决方案 - 谢谢！