R合并功能无法找到数据帧之间的共享匹配答案

【问题标题】：R merge function is unable to find shared matches between data framesR合并功能无法找到数据帧之间的共享匹配
【发布时间】：2018-04-27 16:58:00
【问题描述】：

您好，我有以下两个数据框：

# dataframe 1 --> clst1_trimmed

> head(clst1_trimmed)
# A tibble: 6 x 2
  GeneName Clst.1
  <fct>     <dbl>
1 Cd74      1.20 
2 Lyz2      1.02 
3 Malat1    0.196
4 Ftl1      0.577
5 H2-Ab1    1.04 
6 B2m       0.639`

# dataframe2 --> immgen_trimmed
> head(immgen_trimmed)
# A tibble: 6 x 6
  ProbeSetID GeneName Description                                      Cell.A Cell.B Cell.C
       <int> <fct>    <fct>                                             <dbl>  <dbl>  <dbl>
1   10344620 Cd74     " predicted gene 10568"                            15.6   15.3   17.2
2   10344622 Cd74     " predicted gene 10568"                           240.   255.   224. 
3   10344624 Lyz2     " lysophospholipase 1"                            421.   474.   349. 
4   10344633 Malat1   " transcription elongation factor A (SII) 1"      802.   950.   864. 
5   10344637 Flt1     " ATPase H+ transporting lysosomal V1 subunit H"  199.   262.   167. 
6   10344653 Cd3e     " opioid receptor kappa 1"                         14.8   12.8   18.0

我想根据共享的GeneNames 将这些合并在一起。我尝试了以下方法，并且成功了：

merged <- merge(clst1_trimmed, immgen_trimmed)
 merged
  GeneName    Clst.1 ProbeSetID                                   Description    Cell.A    Cell.B
1     Cd74 1.1954372   10344622                          predicted gene 10568 239.86400 255.05600
2     Cd74 1.1954372   10344620                          predicted gene 10568  15.62080  15.33110
3   Ifitm3 1.7265938   10344674  family with sequence similarity 150 member A   9.40599   9.22875
4     Lyz2 1.0227826   10344624                           lysophospholipase 1 420.51800 474.19000
5   Malat1 0.1962251   10344633     transcription elongation factor A (SII) 1 801.62400 949.96800
    Cell.C
1 223.8960
2  17.2005
3  10.3231
4 349.0890
5 863.5060

但是，用相同的方法合并两个大数据框会失败：

> dim(sel_clst)
[1] 984   2
> dim(immgen_log2)
[1] 24922   212

merge2 <- merge(sel_clst, immgen_log2)
  str(merged2)
'data.frame':   0 obs. of  213 variables:
 $ GeneName                      : Factor w/ 984 levels "0610012G03Rik",..: 
 $ Cluster.1.Log2.Fold.Change    : num 
 $ ProbeSetID                    : int 
 $ Description                   : Factor w/ 21246 levels " "," 1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase alpha)",..: 
 $ X.proB_CLP_BM.                : num 
 $ X.proB_CLP_FL.                : num 
 $ X.proB_FrA_BM.                : num

我认为问题在于GeneName 在immgen_log2 数据框中没有被正确识别。我查找了一个我知道应该存在于两个数据帧"Cd74" 中的基因，但它没有出现在immgen_log2 数据帧中。

> "Cd74" %in% sel_clst$GeneName
[1] TRUE
> "Cd74" %in% immgen_log2$GeneName
[1] FALSE

任何想法为什么会失败？

【问题讨论】：

您是否注意到变量值之一的前导空格？ "Cd74 " 和 " Cd74" 都不会匹配 "Cd74"。我有一个名为trim 的函数，它删除了前导和尾随空格。我建议首先将所有关键列强制为“字符”，然后在重新尝试匹配之前修剪你的值。
也许还可以查看您的数据导入命令以进行上游修复。
或使用levels(df$var) <- trimws(levels(df$var))（但总是更好地修复上游）
@Moody_Mudskipper：我总是对使用levels<- 持怀疑态度，但这看起来确实是一个合理的应用。这需要在两个数据帧上完成。
是的，空间就是问题所在。没抓到。我想这个问题随着这种认识而过时了。 :)

标签： r dataframe merge

【解决方案1】：

试试这个（在制作这些数据帧的备份副本之后）：

levels(sel_clst$GeneName) <- trimws( levels( sel_clst$GeneName ))
levels(immgen_log2$GeneName) <- trimws( levels( immgen_log2$GeneName ))
merge2 <- merge(sel_clst, immgen_log2)

有时read.csv 函数无法在数据输入时进行修整，因此在所有 read.csv 操作中运行trimws 可能是未来努力的一个明智的保存步骤。对于 TL;DR 版本，无论何时使用 read.csv，都应将 strip.white=TRUE 设置为参数。我什至会说您应该使用以下内容覆盖您的 read.csv 副本：

read.csv <- 
       function ( ...){ utils::read.csv(..., strip.white=TRUE) }

有一个options-参数可以通过default.stringsAsFactors() 访问，它可以让您避免很多新手对因子创建的困惑，但是没有可以为strip.white 调整的默认设置。

查看此成绩单：

> dat <- read.csv(text= "hd1 , hd2, hd3\n 1, a ,   c\n1,b,d\n")
> dat
  hd1 hd2  hd3
1   1  a     c
2   1   b    d
> dput(dat)
structure(list(hd1 = c(1L, 1L), hd2 = structure(1:2, .Label = c(" a ", 
"b"), class = "factor"), hd3 = structure(1:2, .Label = c("   c", 
"d"), class = "factor")), .Names = c("hd1", "hd2", "hd3"), class = "data.frame", row.names = c(NA, 
-2L))
> dat <- data.frame(
             lapply(read.csv(text= "hd1 , hd2, hd3\n 1, a ,   c\n1,b,d\n"), 
                    trimws)
                    )
# could also have used a two step process starting with the original `dat` 
# dat[] <- lapply(dat, trimws)   .... the `[]` preserves structure

> dat
  hd1 hd2 hd3
1   1   a   c
2   1   b   d
> dput(dat)
structure(list(hd1 = structure(c(1L, 1L), .Label = "1", class = "factor"), 
    hd2 = structure(1:2, .Label = c("a", "b"), class = "factor"), 
    hd3 = structure(1:2, .Label = c("c", "d"), class = "factor")), .Names = c("hd1", 
"hd2", "hd3"), row.names = c(NA, -2L), class = "data.frame")

【讨论】：

你能把trimws作为参数传递给我的代码吗：immgen_dat <- tbl_df(read.csv("RequestedImmGenData2018-04-20_18-48-38.csv"))
也许：immgen_dat <- as_tibble( lapply( read.csv("RequestedImmGenData2018-04-20_18-48-38.csv"), trimws))。我不认为tbl_df（现已弃用）或as_tibble 会自动进行修剪。 data.table::fread 会这样做。
您能否用几句话解释一下为什么需要构造数据框 as_tibble 才能使其工作？
lapply jsut 返回一个没有任何其他类属性的列表。使用data.frame 或as_tibble 可以恢复“data.frame”类属性。我建议您查看fread。它更快、更安全。
您可以在read.table 中使用参数strip.white = TRUE，我想在read.csv 中也是如此。